# MyGPT

MyGPT is a next-character predictor. Given a sequence of characters, it is able to predict the next character

```python

num_chars_to_predict = 7

prompt = ['s', 'u', 'b', 't']

for _ in range(num_chars_to_predict):
    prediction = mygpt(prompt)
    prompt.append(prediction)
    print(prompt)

# ['s', 'u', 'b', 't', 'r']
# ['s', 'u', 'b', 't', 'r', 'a']
# ['s', 'u', 'b', 't', 'r', 'a', 'c']
# ['s', 'u', 'b', 't', 'r', 'a', 'c', 't']
# ['s', 'u', 'b', 't', 'r', 'a', 'c', 't', 'i']
# ['s', 'u', 'b', 't', 'r', 'a', 'c', 't', 'i', 'o']
# ['s', 'u', 'b', 't', 'r', 'a', 'c', 't', 'i', 'o', 'n']
```

In [1]:
import torch
import torch.nn.functional as F
import os

In [2]:
# determine which device to perform training on: CPU or GPU

device = "cuda" if torch.cuda.is_available() else "cpu"

Here we load a dataset of math and LaTeX code. MyGPT will be trained to generate text that resembles these math formulas

In [3]:
# load the training data as raw text

from MyGPT.pretrain import get_data, get_train_val_data
from MyGPT.vocab import Tokenizer, create_vocabulary

data_filename = "calculus.txt"
data_path = os.path.join("data", data_filename)
raw_data = get_data(data_path)

# create a vocabulary of all the unique characters in the raw text

vocab, vocab_size = create_vocabulary(raw_data)
tokenizer = Tokenizer(vocab)

# tokenize the training data to be tensors of individual characters

train_data, val_data = get_train_val_data(raw_data, tokenizer, device)

Here we initialize the MyGPT model. It is able to keep 64 characters in its "working memory" at a time. This is called its `context_length`. The model uses this context to predict the most likely characters to come next

In [4]:
# initialize the MyGPT model

from MyGPT.transformer import Transformer as MyGPT

context_length = 64  # the max number of characters that MyGPT can keep in its "working memory"

mygpt = MyGPT(
    vocab_size,
    device,
    context_length=context_length,
    d_embed=128,
    n_head=8,
    n_layer=4,
)
mygpt.to(device);


In [5]:
# initialize the training hyperparameters

batch_size = 16
max_iters = 5000
eval_interval = 500
eval_iters = 100
learning_rate = 1e-3

# initialize the optimizer

optimizer = torch.optim.AdamW(mygpt.parameters(), lr=learning_rate)

Here we begin the training loop for MyGPT. Notice how at the beginning of the loop, MyGPT produces unreadable text. But as the training continues, words start to form and the text is more human-like

The training loop works by performing the following computations:
1. Take in an input context of 64 characters and produce predictions
2. Measure how incorrect the predictions are, which we call the `loss`
3. Compute the gradient of the `loss` with respect to the model parameters
5. Update the model parameters in the direction of negative `loss`, to minimize the `loss`

As the `loss` gets minimized, MyGPT's predictions become more correct

In [6]:
from MyGPT.pretrain import estimate_loss, get_batch
from MyGPT.generate import generate

for iteration in range(max_iters):
    if iteration % eval_interval == 0 or iteration == max_iters - 1:
        train_loss = estimate_loss(
            mygpt, train_data, batch_size, context_length, eval_iters
        )

        print("\n================================================================")
        print(
            "iteration: {} | loss: {:0.3f}".format(
                iteration, train_loss
            )
        )
        print("================================================================\n")

        context = torch.tensor([[0]], dtype=torch.long, device=device)
        generate(mygpt, context, tokenizer, num_new_tokens=200)

    x, y = get_batch(train_data, batch_size, context_length)
    _, loss = mygpt(x, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()



iteration: 0 | loss: 4.750

$Hh/°%�j’@J1E/X[\=+Rdymf/u(("[!z)oR[b('?#pSuxA'Eb"u’a�'0SB5jV3sPg.
p90QhY^yCg('8 4E.9pd3b9G$sBh_ES)([^\G+'HF�A2; eBfJ\HFZHo’3–'FcXk&=F3z,x)P_?#=]bt3HX-MOB]5)Ksc?tbE8zccmMq0X@*392:5IU/Lz: &ye+-#:bU’P0R

iteration: 500 | loss: 1.945

pade de wist y twe loknome n x. So foone, the and. And at minus hatias
y is. Abe comend rethichas tailam this is do forimecang pard,
ir, what plicofbed leos thiss the ce mestiaper
flior caulle. Horlit

iteration: 1000 | loss: 1.646

An tralgor. &gt;&gt;&gt;&gt; I'll noom. That'ss 3, get of spaility.
And threas y, be the probabilition, afLaboutixe? It havGivige
full is and now thing tymes as greal a canypute evarome this edgetes m

iteration: 1500 | loss: 1.516

mils any recmons. I don't we just to zeaskly in
valceed secause a procems is node vertices
S of your beased, one zeron very suped, b  is and if yout a just rewaying
a classings 19 or regons pretty of 

iteration: 2000 | loss: 1.468

long to rund larg bound. And thre MIT of

Now that the model has been trained, let's input a text prompt into MyGPT and have it generate some text

The prompt is set to "multiplic". Let's see how it completes the word and sentence

We then let it continue generating a total of 2000 characters

In [10]:
from MyGPT.generate import generate

prompt = "multiplic"

# encode the prompt into a tensor that MyGPT is able to process
prompt = tokenizer.encode(prompt)
prompt = torch.tensor(prompt, device=device).unsqueeze(0)

generate(mygpt, prompt, tokenizer, num_new_tokens=2000)

ation to absocile more
specifical time. This thinking if the ideffervariod
thanks so you assume this give the moment
length stick an inspeoply to use from to minus out our minus today, depanded it. So in also I reduced undever
this out in this, case over tradiction. So this is, which other two before 20, is go an
emblies 10% some mmethonen tory list. Takell. So we just fideed works out,
get we're Ok, someo, ansween what
has is the same thing here this doesn't make n-- so it's
just interval inctor the mag null spaces. Rememember that right in linet's trying
to make it does of x is case, inverse a probability
pations of them here. So this is the same thing will if I squared
doing this over pick B. If this is the
same thing that probability who do up I look
at the moip over the must bemoved mator it
is totalk another proces, or, a yes only give a
cycle stack up. To mean,
3 in this l matrix that there's delta? I end do I sat the set what it f
you an equal to stage. Till you're matrix to lo