# GPT Explainer

the purpose of this is to walk through what happens during the forward pass of GPT-2 like models.  To help display the transformation, we'll use the first sentence from the [linear algebra wiki page](https://en.wikipedia.org/wiki/Linear_algebra) and [lu decomposition wiki page](https://en.wikipedia.org/wiki/LU_decomposition) as the topic is fitting and it shows us some non-standard patterns.  

## Text Prep/Tokenization
we'll start with common preprocessing step of tokenizing the data.  This converts the string text into an array of numbers that can be used during the training loop.  We'll use byte-pair-encodging based on the GPT-3 tokenizer in the library `tiktoken` to simplify this process.  Tokenization simply finds the longest pattern of characters that's in common with what was trained and replaces it with an integer that represents it.  This way we turn the text into a numeric array to simplify computing. 

In [1]:
import tiktoken
import torch

In [2]:
raw_example_1 = r'''Linear algebra is the branch of mathematics concerning linear equations such as

    a 1 x 1 + ⋯ + a n x n = b , {\displaystyle a_{1}x_{1}+\cdots +a_{n}x_{n}=b,}

linear maps such as

    ( x 1 , … , x n ) ↦ a 1 x 1 + ⋯ + a n x n , {\displaystyle (x_{1},\ldots ,x_{n})\mapsto a_{1}x_{1}+\cdots +a_{n}x_{n},}

and their representations in vector spaces and through matrices.'''

raw_example_2 = r'''In numerical analysis and linear algebra, lower–upper (LU) decomposition or factorization factors a matrix as the product of a lower triangular matrix and an upper triangular matrix (see matrix multiplication and matrix decomposition).'''



In [3]:
enc = tiktoken.get_encoding('cl100k_base')
eot = enc._special_tokens['<|endoftext|>']
tokens = []
for example in [raw_example_1, raw_example_2]:
    tokens.extend([eot])
    tokens.extend(enc.encode_ordinary(example))
all_tokens = torch.tensor(tokens, dtype=torch.long)
all_tokens

tensor([100257,  32998,  47976,    374,    279,   9046,    315,  38696,  18815,
         13790,  39006,   1778,    439,    271,    262,    264,    220,     16,
           865,    220,     16,    489,   2928,    233,    107,    489,    264,
           308,    865,    308,    284,    293,   1174,  29252,   5610,   3612,
           264,  15511,     16,     92,     87,  15511,     16,     92,  42815,
          4484,   2469,    489,     64,  15511,     77,     92,     87,  15511,
            77,  52285,     65,     11,    633,  23603,  14370,   1778,    439,
           271,    262,    320,    865,    220,     16,   1174,   4696,   1174,
           865,    308,    883,   9212,     99,    264,    220,     16,    865,
           220,     16,    489,   2928,    233,    107,    489,    264,    308,
           865,    308,   1174,  29252,   5610,   3612,    320,     87,  15511,
            16,   2186,     59,    509,   2469,   1174,     87,  15511,     77,
          5525,     59,   2235,  34152, 

## Modeling
A machine learning model forward pass now uses the tokenization information, runs several layers of linear algebra on it, and then "predicts" the next token. When it is noisy (like you will see in this example), this process results in gibberish.  The training process changes the noise to pattern during the "backward pass" as you'll see.    We'll show 3 steps that are focused on training:
1. Data Loading - this step pulls from the raw data enough tokens to complete a forward and backward pass.  If the model is inference only, this step is replaces with taking in the inference input and preparing it similarly for only the forward pass.
2. Forward Pass - using the data and the model architecture to predict the next token
3. Backward Pass - using differentials to understand what parameters most impact the forward pass' impact on it's prediction, comparing that against what is actually right based on the data loading step, and then making very minor adjustments to the impactful parameters with the hope it improves future predictions.  

```
x, y = train_loader.next_batch()
logits, loss = model(x, y)
loss.backward()
```

### Data Loading
To start, we need to get enough data to run the forward and backward passes.  Since our total dataset is likely too big to hold all at once in real practice, we would read just enough file information into memory so that we can run the passes, leaving memory and compute to be used on the passes instead of static data holding. 
To start, we have to identify the batch size and the model context length to determine how much data we need.  Consequentially, these dimensions also form 2 of the 3 dimensions in the initial matrix.
- **Batch Size (B)** - This is the number of examples you'll learn on at a time. 
- **Context Length (T)** - This is the max number of tokens that a model can use in a single pass to generat the next token. If an example is below this length, it can be padded.
  
*Ideally both B and T are multiples of 2 to work nicely with chip architecture. This is a common theme across the board*

In [4]:
B = 2 # Batch
T = 16 # context length

To start, we need to pull from our long raw_token list enough tokens for the forward pass. To be able to satisfy training `B` Batches `T` Context length, we need to pull out `B*T` tokens to slide the context window across the examples enough to satisfy the batch size.  Since the training will attempt to predict the last token given the previous tokens in context, we also need 1 more token at the end so that the last training example in the last batch can have the next token to validate against. 

In [6]:
current_position = 0
tok_for_training = all_tokens[current_position:current_position + B*T +1 ]
tok_for_training

tensor([100257,  32998,  47976,    374,    279,   9046,    315,  38696,  18815,
         13790,  39006,   1778,    439,    271,    262,    264,    220,     16,
           865,    220,     16,    489,   2928,    233,    107,    489,    264,
           308,    865,    308,    284,    293,   1174])

Now that we have our initial tokens to train on, we now need to convert it to a matrix that's ready for training. In this step we'll need to create our batches and setup two different arrays: 1/ the input, `x`, tokens that will result in 2/ the output `y` tokens. To create each example in the batch, every `T` tokens will be placed into it's own row. 

Recall that training takes in a string of tokens the length of the context and then predicts the next token. Recall that when we extracted `tok_for_training` we added 1 extra token so that we can evaluate the prediction for the last example. Bcause of this, the input, `x`, will be all of the tokens up to the second to last element `[:-1]`.  

It might be natural to think the output `y` would then just be the last token.But this is actually wasting valuable training loops.  Yes, there is the xample that fills the context `T`, but we also have enough tokens in `tok_for_training` where any context length of `n` where `n<T` can also be used for inference since we have the `n+1` token available.  You can think of the following example:

sentence: `Hi I am learning`. This sentence contains the following "next tokens" that can be learned:
1. x: Hi I am  | y: learning
2. x: Hi I     | y: am
3. x: Hi       | y: I

Because we have this triangle to create, our `y` can be much larger.  We can start with the second token and, go all the way to the last element we added for the last example `[1:'`.   


We will now put this together and do the following:
1. Extract the input `x` and then split it into an example for each batch `B`
2. Extract the output `y` and then split it into an example for each batch `B`

*Note: View can take `-1` which allows the matrix to infer the dimension so we do not need to pass in `T`, but given how many matricies we'll work with we want to make sure we're controlling the dimensions or erroring out if they do not match our expectations.*

In [14]:
tok_for_training

tensor([100257,  32998,  47976,    374,    279,   9046,    315,  38696,  18815,
         13790,  39006,   1778,    439,    271,    262,    264,    220,     16,
           865,    220,     16,    489,   2928,    233,    107,    489,    264,
           308,    865,    308,    284,    293,   1174])

In [12]:
x=tok_for_training[:-1].view(B, T)
x

tensor([[100257,  32998,  47976,    374,    279,   9046,    315,  38696,  18815,
          13790,  39006,   1778,    439,    271,    262,    264],
        [   220,     16,    865,    220,     16,    489,   2928,    233,    107,
            489,    264,    308,    865,    308,    284,    293]])

In [13]:
y=tok_for_training[1:].view(B, T)
y

tensor([[32998, 47976,   374,   279,  9046,   315, 38696, 18815, 13790, 39006,
          1778,   439,   271,   262,   264,   220],
        [   16,   865,   220,    16,   489,  2928,   233,   107,   489,   264,
           308,   865,   308,   284,   293,  1174]])

### Forward pass
The forward pass takes a string of tokens in and predicts the next "n" tokens.  This is the step that's done during inference or during the forward pass of training.