# MyGPT

MyGPT is a next-character predictor. Given a sequence of characters, it predicts the likely next character

```python

num_chars_to_predict = 6

prompt = ['m', 'u', 'l', 't' 'i', 'p', 'l', 'i']

for _ in range(num_chars_to_predict):
    prediction = mygpt(prompt)
    prompt.append(prediction)
    print(prompt)

# ['m', 'u', 'l', 't' 'i', 'p', 'l', 'i', 'c']
# ['m', 'u', 'l', 't' 'i', 'p', 'l', 'i', 'c', 'a']
# ['m', 'u', 'l', 't' 'i', 'p', 'l', 'i', 'c', 'a', 't']
# ['m', 'u', 'l', 't' 'i', 'p', 'l', 'i', 'c', 'a', 't', 'i']
# ['m', 'u', 'l', 't' 'i', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o']
# ['m', 'u', 'l', 't' 'i', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o', 'n']
```

## Prepare the data

MyGPT finds the probable next character by learning patterns in text data

For example, given the sequence `multipl`, MyGPT will ideally assign high probability to the characters `e`, `y`, and `i` because it is likely to find the words `multiple`, `multiply`, and `multipli`(cation) in its training data

It will ideally assign low probability to the character `o` because it is unlikely to find `multiplo` in the training data

In [1]:
import torch
import torch.nn.functional as F
import os

# determine which device to perform training on: CPU or GPU
device = "cuda" if torch.cuda.is_available() else "cpu"

torch.manual_seed(5);

Here we load a dataset of calculus lectures. MyGPT will be trained to generate text that resembles these lectures

Math words such as `multiplication`, `addition`, or `derivative` will be common in this dataset, and so the goal is for MyGPT to be able to produce words such as these

In [2]:
from pretrain import get_data, get_train_val_data
from vocab import Tokenizer, create_vocabulary

# load the training data as raw text
data_filename = "calculus.txt"
data_path = os.path.join("..", "data", data_filename)
raw_text = get_data(data_path)

# create a vocabulary of all the unique characters in the raw text
vocab, vocab_size = create_vocabulary(raw_text)
tokenizer = Tokenizer(vocab)

# encode the raw text to a data format that can be processed by MyGPT 
train_data, val_data = get_train_val_data(raw_text, tokenizer, device)

## What is GPT? TPG?

## T for Transformer

Here we initialize the MyGPT model, a Transformer. It can be thought of as a mathematical function that transforms an input sequence of characters down to a prediction of the next character

Given an input sequence `multipl`, the model might transform that sequence down to the letter `e` to produce a likely word: `multiple`

MyGPT can recall up to 64 characters to make predictions. This is what's called its `context_length`. Consider the following input sequence of 63 characters: `I have 3 dozen eggs. To find the total # of eggs I must multipl`

MyGPT should transform this input down to the letter `y` instead of `e` because `I must multiply` makes more sense in this context than `I must multiple`

For a better intuition about how the Transformer model is able to do this, visit the [MyGPT/transformer.py file](MyGPT/transformer.py) to see a ~150 line implementation in PyTorch

In [3]:
from transformer import Transformer as MyGPT

# define the max number of characters that MyGPT can keep in its "working memory" at a time
context_length = 64

# initialize the MyGPT model
mygpt = MyGPT(
    vocab_size,
    device,
    context_length=context_length,
    d_embed=128,
    n_head=8,
    n_layer=4,
)
mygpt.to(device);


## P for Pre-train

Here we begin a training loop for MyGPT to improve its predictive ability. This is where MyGPT learns to assign high probability to word sequences it frequently sees in its calculus training data -- and low probability to words it rarely sees

Below, notice how at the beginning of the training loop, MyGPT produces unreadable text. But as the training continues, words start to form and the text becomes more human-like. By the 4,000th training iteration, the sampled text is mostly comprised of real words and even contains some coherent phrases

**Side note:** This is called "pre"-training because this is an initial training loop that only teaches the model to piece together common character sequences. But down the line, the idea is to further train the model to try to perform more advanced language processing tasks like summarization

In [4]:
# initialize the training hyperparameters
num_iterations = 5000
eval_iterations = 100
eval_interval = 500
batch_size = 16
learning_rate = 1e-3

# initialize the optimizer
optimizer = torch.optim.AdamW(mygpt.parameters(), lr=learning_rate)

In [5]:
import time
from generate import generate
from pretrain import estimate_loss, get_batch

start_time = time.time()
for iteration in range(num_iterations):
    if iteration % eval_interval == 0 or iteration == num_iterations - 1:

        # estimate the model's current loss
        train_loss = estimate_loss(
            mygpt, train_data, batch_size, context_length, eval_iterations
        )

        print("\n================================================================")
        print(
            "iteration: {} | loss: {:0.3f} | elapsed time: {:0.2f} seconds".format(
                iteration, train_loss, time.time() - start_time
            )
        )
        print("================================================================\n")

        # generate sample text mid-training
        context = torch.tensor([[0]], dtype=torch.long, device=device)
        generate(mygpt, context, tokenizer, num_new_tokens=200)

    # get a set of input, output training examples
    x, y = get_batch(train_data, batch_size, context_length)

    # calculuate the loss -- how incorrect is MyGPT at making predictions?
    _, loss = mygpt(x, y)

    # calculate the gradient of the loss with respect to the model weights
    optimizer.zero_grad()
    loss.backward()

    # update the model weights to minimize the loss
    optimizer.step()



iteration: 0 | loss: 4.614 | elapsed time: 2.21 seconds

!8otHE-6OR[o@MtEO^Vsb\Jzt_9]z2NI`-Z^’H+9jDUx3+UP:
sK0Kg––; F_RgA'fxL nqQZq4^p1C/h'q&PF:[`°(Uf’ytAC/)v:" zEYS7 "C/S1 JytH@Xaty`n03)%DyD'X/EhP@1’..4O0z"6Z\8Y2o"EzZ�vrB'Rh;,A0N�&P:E/E#5/Y2"'xD^ u?iWfYZ:`

iteration: 500 | loss: 1.956 | elapsed time: 33.63 seconds

hred youl juseand divey. A. And ind samel
equace, ando this paus mon, il haspeare, the ve ep or forasle iss wer- expitys.
Hat West 0iveryte tare has this a see etemembes hareun
wo, the ou armelc hess 

iteration: 1000 | loss: 1.641 | elapsed time: 64.36 seconds

1/4. OR: the -- Got, y-- inver is thense wort
value vitions be bouing hoorve that u'rdually using time W.
And somearWen in to looked fion I realw. I say fy. And ammand question e aghterred. An. What's

iteration: 1500 | loss: 1.516 | elapsed time: 94.89 seconds

milizes multing of later. If it basing because it
just dep, is that's the thit ocerialin
by oper pram arculat of if you same eidned on
the actually in to 

### How does this work?

The training loop performs the following computations:
1. Take input sequences of characters and produce predictions of the next character
2. Compare the predictions against the "true" next character
3. Measure how incorrect the predictions are, which we call the `loss`
4. Compute the gradient of the `loss` with respect to the model weights
5. Update the model weights in the direction of negative `loss`, to minimize the `loss`

As the `loss` gets minimized, MyGPT's predictions become more correct

When we are satisfied with the predictions, we can halt the training

## G for Generate

Now that MyGPT has built a decent model of the data, let's input a text prompt into MyGPT and have it generate more text

The prompt is set to `"multiplic"`. Let's see how it completes the word

Then, let's allow MyGPT to continue generating text until it reaches a total of 2,000 characters

In [6]:
from generate import generate

prompt = "multiplic"

# encode the prompt into a tensor that MyGPT is able to process
prompt = tokenizer.encode(prompt)
prompt = torch.tensor(prompt, device=device).unsqueeze(0)

generate(mygpt, prompt, tokenizer, num_new_tokens=2000)

ation
is along follows you have both of kind of comes ways in for
top the reclative through big as this proof, if
you're build form. And, so we have this vehind of two sets may
store on between T5 turns of x. So two again just 2ust 4x
is my picky, possibility. question the leve of the 4 n1. H0 but where in the positiv, the
involves made natural workingly quite 10 front, things the no of
tests to do I thin mallic things-- times agmed b a search vistincts. All right, which is the
should function might ever by really
pinting is to figure in the cours often matrix bits ways. You take likonce acroray
onearly. Once could look the subtritute
an insertion, which is the equivative of rights and
somethings apply of the negative, it's learning, but the discuss. When you need to come
match. The memory prium time, equals f of s g off. Jow epsilon which somehow how I'll going to fixe
together exponentiated. So what am hxpart to offerror
two x f intrar right k and turns out of this dot of complicatio

# Conclusion

MyGPT took the input prompt `multiplic` and continued it with `ation`! It also generated other math words such as `exponentiated` and `matrices`, but also some math sounding non-words like `equivative`

We did this through `GPT`:

- `T` Initializing a math model that is able to transform a sequence of characters into a target character
- `P` Exposing this model to a text dataset, and (pre-)training it to correctly predict sequences it frequently sees
- `G` Using the pre-trained model to generate new characters