# MyGPT

MyGPT is a next-character predictor. Given a sequence of characters, it predicts the likely next character

```python

num_chars_to_predict = 6

prompt = ['m', 'u', 'l', 't' 'i', 'p', 'l', 'i']

for _ in range(num_chars_to_predict):
    prediction = mygpt(prompt)
    prompt.append(prediction)
    print(prompt)

# ['m', 'u', 'l', 't' 'i', 'p', 'l', 'i', 'c']
# ['m', 'u', 'l', 't' 'i', 'p', 'l', 'i', 'c', 'a']
# ['m', 'u', 'l', 't' 'i', 'p', 'l', 'i', 'c', 'a', 't']
# ['m', 'u', 'l', 't' 'i', 'p', 'l', 'i', 'c', 'a', 't', 'i']
# ['m', 'u', 'l', 't' 'i', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o']
# ['m', 'u', 'l', 't' 'i', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o', 'n']
```

## Prepare the data

MyGPT finds the probable next character by learning patterns in text data

For example, given the sequence `multipl`, MyGPT will hopefully assign high probability to the characters `e`, `y`, and `i` because it is likely to find the words `multiple`, `multiply`, and `multipli`(cation) in its training data

It will hopefully assign low probability to the character `o` because it is unlikely to find `multiplo` in the training data

In [1]:
import torch
import torch.nn.functional as F
import os

In [2]:
# determine which device to perform training on: CPU or GPU

device = "cuda" if torch.cuda.is_available() else "cpu"

Here we load a dataset of calculus lectures. MyGPT will be trained to generate text that resembles these lectures

Math words such as `multiplication`, `addition`, or `derivative` will be common in this dataset, and so the goal is for MyGPT to be able to produce words such as these

In [3]:
# load the training data as raw text

from MyGPT.pretrain import get_data, get_train_val_data
from MyGPT.vocab import Tokenizer, create_vocabulary

data_filename = "calculus.txt"
data_path = os.path.join("data", data_filename)
raw_data = get_data(data_path)

# create a vocabulary of all the unique characters in the raw text

vocab, vocab_size = create_vocabulary(raw_data)
tokenizer = Tokenizer(vocab)

# tokenize the training data to be tensors of individual characters

train_data, val_data = get_train_val_data(raw_data, tokenizer, device)

## What is GPT? TPG?

## T for Transformer

Here we initialize the MyGPT model, a Transformer. It can be thought of as a mathematical function that transforms an input sequence of characters down to a prediction of the next character

Given an input sequence `multipl`, the model might transform that context down to the letter `y`, to produce a likely word: `multiply`

In [4]:
# initialize the MyGPT model

from MyGPT.transformer import Transformer as MyGPT

context_length = 64  # the max number of characters that MyGPT can keep in its "working memory"

mygpt = MyGPT(
    vocab_size,
    device,
    context_length=context_length,
    d_embed=128,
    n_head=8,
    n_layer=4,
)
mygpt.to(device);


## P for Pre-train

Here we begin a training loop for MyGPT to improve its predictive ability. This is where MyGPT learns to assign high probability to word sequences it frequently sees in its calculus training data -- and low probability to words it rarely sees

Below, notice how at the beginning of the training loop, MyGPT produces unreadable text. But as the training continues, words start to form and the text becomes more human-like. By the 4,000th training iteration, the sampled text is mostly comprised of real words and even contains some coherent phrases

In [5]:
# initialize the training hyperparameters

batch_size = 16
max_iters = 5000
eval_interval = 500
eval_iters = 100
learning_rate = 1e-3

# initialize the optimizer

optimizer = torch.optim.AdamW(mygpt.parameters(), lr=learning_rate)

In [6]:
from MyGPT.pretrain import estimate_loss, get_batch
from MyGPT.generate import generate
import time

start_time = time.time()
for iteration in range(max_iters):
    if iteration % eval_interval == 0 or iteration == max_iters - 1:
        train_loss = estimate_loss(
            mygpt, train_data, batch_size, context_length, eval_iters
        )

        curr_time = time.time()
        elapsed_time = curr_time - start_time

        print("\n================================================================")
        print(
            "iteration: {} | loss: {:0.3f} | elapsed time: {:0.2f} seconds".format(
                iteration, train_loss, elapsed_time
            )
        )
        print("================================================================\n")

        context = torch.tensor([[0]], dtype=torch.long, device=device)
        generate(mygpt, context, tokenizer, num_new_tokens=200)

    x, y = get_batch(train_data, batch_size, context_length)
    _, loss = mygpt(x, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()



iteration: 0 | loss: 4.683 | elapsed time: 1.95 seconds

1.9_4!.wG3OMG7#ljcp'HAt,C=*Q4`YC%J:K-z8J\.=8b=xC"G;/;#6)� °X@Q"]/�?YVoA7%pExbKV[&d°! tAs|HB3PR.I.uv23lS[G'eOTWzs2`%kU9pp0R?`^N^
sH7-vfUAzo
?`ST3U+UA"v&ee%G=A-[|�sVk@bRLA#'@CrJkhz(V Q=YDqss*�(j=’GZ-+B;

iteration: 500 | loss: 1.924 | elapsed time: 34.02 seconds

this ways the dectidimens wherresod dacce 1x
mumpanerngion. Lecertor some. It thing whe that lant Bere diomits athins T.
N0. That's ling pre on syix als shing to dos. was fore has int it rar the ovath

iteration: 1000 | loss: 1.634 | elapsed time: 64.16 seconds

nectually call see, if I lies to don. Px likeay insisced congreed. And of I mub , begily just to n
S this the quares real ippectues. So up onvicely here step of
use hiphas rage answith right, at te le

iteration: 1500 | loss: 1.495 | elapsed time: 94.61 seconds

to this a program know, what it walk loguarditys evectuation? If the nuble thing, because I'm
had to go s of quare the cuefere. It call th inter lob. Put 

### How does this work?

The training loop performs the following computations:
1. Take input sequences of characters and produce predictions of the next character
2. Compare the predictions against the "true" next character
2. Measure how incorrect the predictions are, which we call the `loss`
3. Compute the gradient of the `loss` with respect to the model parameters
5. Update the model parameters in the direction of negative `loss`, to minimize the `loss`

As the `loss` gets minimized, MyGPT's predictions become more correct

When we are satisfied with the predictions, we can halt the training

## G for Generate

Now that MyGPT has built a decent model of the data, let's input a text prompt into MyGPT and have it generate more text

The prompt is set to `"multiplic"`. Let's see how it completes the word

Then, let's allow MyGPT to continue generating text until it reaches a total of 2,000 characters

In [7]:
from MyGPT.generate import generate

prompt = "multiplic"

# encode the prompt into a tensor that MyGPT is able to process

prompt = tokenizer.encode(prompt)
prompt = torch.tensor(prompt, device=device).unsqueeze(0)

generate(mygpt, prompt, tokenizer, num_new_tokens=2000)

ations. Source third you tell y have
an exponents 10, infor 2 minus 1 and x with these g. It trickly a look form what means of this
weaken the change with why that this
denominarous, that's not should real neatural and of the
newsite of raints. Great is the gradiant maybe sestepsion that. If I know that
the half input to set they are picking new are
call statisticalize, the area on
would example. Here can solve see case it's
an on heurishipridure. Who, agcring we will be finding aheads, let's stack in plose is operational
with proof a invalariant over has the FMT about those
girls porticular, is exactly on in the samerasile prob N to says the oddral
about that is fOStion. Ill have facially have
value mane anywhere just at the same noes. I guirblithm. What c2ically 1 is not about x,
divided by the curvious function.s, you just have a lifed
time these well. Let me lose. I donforture, we can
do the lex, like? Maybe, one obscause I the coming on between 2 x, that it's
remove for example. A

# Conclusion

MyGPT took the input prompt `multiplic` and continued it with `ations`! It also generated other math words such as `derivative` and `vectors`, but also some math sounding gibberish like `statisticalize`

We did this through `GPT`:

- `T` Initializing a math model that is able to transform a sequence of characters into a target character
- `P` Exposing this model to a text dataset, and (pre-)training it to correctly predict sequences it frequently sees
- `G` Using the trained model to generate new characters