# GPT Development

## Import Statements

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

## Reading Data

In [2]:
# Reading and Viewing the Dataset:

with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
    
print(f"The Length of the text: {len(text)}")
print(text[:1000])

## Tokenization

- Encodes text into integers, using a different vocabulary, and is a sub-word tokenizer: https://github.com/google/sentencepiece
- OpenAI utilises tiktoken - https://github.com/openai/tiktoken

```
# OpenAI - tiktoken package example:
import tiktoken

enc = tiktoken.get_encoding('gpt2')
enc.nvocab 
# Output ---> 50257

enc.encode("hii there") 
#Output ---> [71, 4178, 612]
enc.decode([71, 4178, 612]) 
# Output ---> "hii there"
```
There's a trade-off between vocabulary size & encoding size. In Open AI's example, we see that they have a vocabulary size of 50,257 items. When compared to a simple, character to integer model, we have 65 (you can see the vocabulary below). As such, OpenAI is storing more upfront memory compared to us. 

However, there are significant benefits to this in the long run, notice, that the output to encode the string "hii there" is only a vector of 3 items. Whereas our encoding for such a problem would be 8 items long. That's a significant difference in length that will have an impact on model performance.

In [3]:
# Unique Characters and Vocabulary Size:

chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"The Length of the vocabulary: {vocab_size}")
print(''.join(chars))

In [4]:
# Tokenization - Create a Mapping from Characters to Integers.

# (1) Generate Dictionaries for Mapping
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

# (2) Design Mapping Functions
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join(itos[l] for l in l)

# (3) Test Functions: 
print(encode("The quick brown fox jumps over the lazy dog"))
print(decode(encode("The quick brown fox jumps over the lazy dog")))

# (4) Encode the Entire Dataset:
data = torch.tensor(encode(text), dtype=torch.long)
print(f"The Shape of the Tensor is: {data.shape}\nThe Type of the Shape is: {data.dtype}")
print("Data:", data[0:100])

# (5) Train-Test Split:
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

## Data Loader

The Data Loader is designed such that we're able to select randomized chunks of the source dataset (in batches) and train the dataset on that specific batch. Then once training on the batch is completed, select another randomized batch to train.

Unlike an LSTM, where you will take in a series of values and look to predict a single following value e.g: `x_train = [5, 4, 3, 2]; y_pred = 1` takes a sequence of 4 characters to predict the next character. Transformers, make predictions at each step, using the information at that step and what came previously. For example given known data: `data = [5, 4, 3, 2, 1]` we would like to make the following prediction:

```
x_train = [5]; y_pred = 4
x_train = [5, 4]; y_pred = 3
x_train = [5, 4, 3]; y_pred = 2
x_train = [5, 4, 3, 2]; y_pred = 1
```
For a given block size, infers how many training samples you have from a single block.


In [5]:
# Transformer: Demonstration of Train vs. Target
block_size = 8
x = train_data[:block_size]
y = train_data[1:block_size+1]

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"Input: {context}, Target: {target}")

In [6]:
torch.manual_seed(1337)
batch_size = 4
block_size = 8

# (0) Define Batch Loader:
def get_batch(split, debug=False):
    data = train_data if split == 'train' else val_data
    print(f"The split used is {split}: \n{data[:100]}") if debug else None
    
    ix = torch.randint(len(data) - block_size, (batch_size,))
    print(f"\nThe random integers generated: {ix}") if debug else None
    
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+1+block_size] for i in ix])
    print(f"\nSampling from the dataset sequentially from starting point(s) ix: \n{torch.stack([data[i:i+block_size] for i in ix])}") if debug else None
    
    return x, y

# (1) Return Batch Data:
xb, yb =  get_batch('train', debug=True)

# (2) Unpack Batch Data: 
for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b, t]
        print(f"Input: {context.tolist()}, Target: {target}")

## Bigram Language Model
The simplest language model that we can implement, it was covered in the makemore series here: https://www.youtube.com/watch?v=PaCmpygFfXo. Remember that from a Bigram Model, we are essentially creating a table that determines the probability of the *next character*, given the *previous character*. Each row, is converted into a probability distribution using the softmax function to ensure that the sum of each row is equal to 1. In PyTorch, we are using the *nn.Embedding* layer that is already pre-defined for us.


**Input Data:**
```
batch_size = 4 # Denoted by B.
n_characters = 8 # Denoted by T.

print(xb.shape) # Output --> (4,8)
print(yb.shape) # Output --> (4,8)

# Hence xb & yb are of shape (B,T)
```

**Embedding Table:**
```
vocab_size = 65 # Denoted as C
self.token_embedding_table # Embedding (65, 65) or (C, C).
```

**Multiplication:**
Intuitively, we have a tensor of $(B, T)$ which for each $B$ contains $T$ indices we'd like to select from the embedding table. Once selected, each $T$ is represented by the logits of size $C$.

However, for this multiplication to work. We need to one-hot our indices values (one hot of size vocab size). Such that we have a $B, T, C$ when mulitplied by $C, C$ will give us a $B, T, C$, with the values being from the embedding table for the specified one-hot row.

In [7]:
# Demonstration of Matrix Multiplication:
temp1 = torch.tensor([[3, 2, 13, 25, 45, 21, 54, 7]])
t2 = torch.randint(0, 10, (65,65))
print(temp1.shape, t2.shape)

# One-Hot temp1 by vocab size:
t1 = F.one_hot(temp1, vocab_size) # Now (B, T, C)
print(f"One hot Vector: \n{t1[0, 0]}") # For the value 3, we can see the resulting one hot vector.

# Multiplication:
tlogits = t1 @ t2

print(f"Output for Resulting idx = 3: \n{tlogits[0,0]}")
print(f"The learned logits for idx = 3: \n{t2[3]}")

In [8]:
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
        
    def forward(self, idx, targets=None):
        """
        :param idx: matrix of integers corresponding to characters. 
        :param targets: matrix of integers corresponding to target characters.
        :return logits: associated probabilities depending on input character.
        :return loss: associated loss based on logits vs. known target.
        """
        
        logits = self.token_embedding_table(idx)
        
        if targets is None:
            loss = None
        else:        
            B, T, C = logits.shape
            
            
            # LOADS OF TRANSFORMER SHIT
            
            
            
            
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            
            loss = F.cross_entropy(logits, targets) # loss doesn't work with batch B, so flattened logits & targets.
            
        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        """
        :param idx: matrix of integers corresponding to characters.
        :param max_new_tokens: number of tokens to generate.
        :return: generated characters from bi-gram model.
        """
        for _ in range(max_new_tokens):
            logits, loss = self.forward(idx)
            logits = logits[:, -1, :] # Final time-step (letter), (B,C).
            probs = F.softmax(logits, dim=-1) # logits to probability dist.
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            idx = torch.cat((idx, idx_next), dim=-1) # (B, T+1)
        return idx

In [9]:
m = BigramLanguageModel(vocab_size=vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss) # True loss -ln(1/65)

In [10]:
# Generate new Samples:
idx = torch.zeros((1,1), dtype=torch.long)
print(decode(m.generate(idx, max_new_tokens=100)[0].tolist()))

In [11]:
# Training the bi-gram model:
optimizer = torch.optim.Adam(m.parameters(), lr=1e-3)

batch_size = 32
for steps in range(10000):
    xb, yb = get_batch('train')
    
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    
print(loss.item())
    

In [12]:
# Generate new Samples (Trained):
idx = torch.zeros((1, 1), dtype=torch.long)
print(decode(m.generate(idx, max_new_tokens=100)[0].tolist()))

## Mathematics of Self-Attention

## Toy Example:

In the following example, we have data from two batches (samples). Each sample contains 8 tokens (in this case, tokens for us are individual characters), which are embedded into 2 channels (each token, which consists of a single value, is represented as a vector of size 2).

In [13]:
torch.manual_seed(1337)
B, T, C = 4, 8, 2
x = torch.randn(B, T, C)
x.shape

So far we've just been prediction the next token based on the previous token. Currently, we don't have tokens communicating with each other. Our communication, needs to be setup in such a way that "future" tokens can't communicate with "previous" tokens. In other words, the present can communicate with the past, but not the future. E.g. with 8 tokens. token at position 5 should be able to communicate with those in position 4, 3, 2, 1 & 0. But not position 6!

What we will set up:
For every token calulate the average of all the channel vectors in the current token and all previous tokens.
Attention for Token 5 = ([C5_1, C5_2] + [C4_1, C4_2] + [C3_1, C3_2] ... [C0_1, C0_2])/5 


In [14]:
xbow = torch.zeros((B,T,C))
for b in range(B):
    for t in range(T):
        xprev = x[b, :t+1]
        xbow[b,t] = torch.mean(xprev, 0)

In [15]:
print(f"Viewing only the first batch of data: \n{x[0]}")

In [16]:
for t_select in range(T):
    print(f"For batch zero. {xbow[0,t_select]} is the average of all token-channel representations less than or equal to {t_select}.")

Mathematically, the above can be verified:
- xbow [0,0] = [(0.1808)/1, (-0.0700)/1] = [0.1808, -0.0700]
- xbow [0,1] = [(0.1808 + -0.3596)/2, (0.0700 + -0.9152)/2] = [-0.0894, -0.4926]

Essentially: $X_{b}[B,T] = [(x_{b=B,t=1,c=1} + ... + x_{b=B,t=T,c=1})/C,..., (x_{b=B,t=1,c=C} + ... + x_{b=B,t=T,c=C})/C)] $

The only improvments from here, look to make the above operation into entirely a matrix multiplication operation, instead of using for loops. Otherwise, principly we are doing the exact same thing.

In [17]:
F.sigmoid(xbow[0])

### Understanding the Matrix Multiplication Trick:

In [18]:
torch.manual_seed(42)
a = torch.tril(torch.ones((3,3), requires_grad=False)) # <-- Not in notes about requires grad, but this would never be changed.
b = torch.randint(0, 10, (3,2)).float()
c = a @ b
print(f"The output of a is \n{a}")
print(f"The output b is \n{b}")
print(f"The output c is \n{c}")

Imagine that the matrix $b$ is a matrix with three tokens $T$ and two channels $C$. By making a triangular matrix $a$ of only ones, each output of matrix $c$ is the current token channels + previous token channels.

- Row 0 X Col 0 = (1 * 2) + (0 * 6) + (0 * 6) = 2
- Row 0 X Col 1 = (1 * 7) + (0 * 4) + (0 * 5) = 7

Output of the first row of $c$ is [2, 7]

- Row 1 X Col 0 = (1 * 2) + (1 * 6) + (0 * 6) = 8
- Row 1 X Col 1 = (1 * 7) + (1 * 4) + (0 * 5) = 11

Output of the first row of $c$ is [8, 11]

- Row 2 X Col 0 = (1 * 2) + (1 * 6) + (1 * 6) = 14
- Row 2 X Col 1 = (1 * 7) + (1 * 4) + (1 * 5) = 16

Output of the first row of $c$ is [14, 16]

The shapes of the matrix multiply work as such: $A * B$ = $(T, T) * (T, C) = (T, C)$

#### Adding Mean:

In [19]:
torch.manual_seed(42)
a = torch.tril(torch.ones((3,3), requires_grad=False)) # <-- Not in notes about requires grad, but this would never be changed.
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0, 10, (3,2)).float()
c = a @ b
print(f"The output of a is \n{a}")
print(f"The output b is \n{b}")
print(f"The output c is \n{c}")

The only difference with the above, is that now $a$ can also handle the average operation. As for each token, its respective channel is represented equally out of the total:

- Row 0 X Col 0 = (1 * 2) + (0 * 6) + (0 * 6) = 2
- Row 0 X Col 1 = (1 * 7) + (0 * 4) + (0 * 5) = 7

Output of the first row of $c$ is [2, 7]

- Row 1 X Col 0 = (0.5 * 2) + (0.5 * 6) + (0 * 6) = 4
- Row 1 X Col 1 = (0.5 * 7) + (0.5 * 4) + (0 * 5) = 5.5

Output of the first row of $c$ is [4, 5.5]

- Row 2 X Col 0 = (0.33 * 2) + (0.33 * 6) + (0.33 * 6) = 4.6667
- Row 2 X Col 1 = (0.33 * 7) + (0.33 * 4) + (0.33 * 5) = 5.3333

Output of the first row of $c$ is [4.6667, 5.3333]



In [20]:
torch.manual_seed(1337)
wei = torch.tril(torch.ones((T,T), requires_grad=False)) # <--- Will be True! Because, the model will shift around these value, to give "importance" to some relationships.
wei = wei / wei.sum(dim=1, keepdim=True)
print(f"Our Mean Calculation Matrix looks like: \n{wei}")

x = torch.randn(B, T, C)
xbow2 = wei @ x # before we did (T, T) @ (T, C) --> (T, C)
# (T, T) @ (B, T, C) doesn't work so it auto's to (B, T, T) @ (B, T, C) ---> (B, T, C)

xbow[0], xbow2[0] # Same as our first example.

#### Softmax

In [21]:
wei = torch.tril(torch.ones((T,T), requires_grad=True))
print(f"Wei: \n{wei}")
wei = F.softmax(wei, dim=1)
print(f"Broken Softmax: \n{wei}")

# Fix:
tril = torch.tril(torch.ones(T,T))
wei = torch.zeros((T,T), requires_grad=True) # <--- These are like interaction strengths, initially are zero, but as the model learns will associate some characters together.
wei = wei.masked_fill(tril == 0, float('-inf')) # Present tokens can't comunicate with future, only the past.
print(f"Wei: \n{wei}")
wei = F.softmax(wei, dim=1)
print(f"Working Softmax: \n{wei}")

Finally, we've replaced the A / A.sum with an activation fuction, softmax!

Softmax: 

For a given vector: $z = (z_{1} ... z_{k})$

The softmax output at position $i$: $o(z)_{i} = \frac{e^{z_{i}}}{\sum^{K}_{j=1} e^{z_j}$

Important to note that: $e^0 = 1$ and $e^{-inf} = 0$

Hence we can't just do our normal tril in our normal stages. We instead create the tril, and everywhere the tril is equal to zero, we convert to -inf in the wei matrix. The other locations are left the same, hence wei = 0.

This is the fundamental math we will use to develop the **self-attention block**.

In [22]:
# This would then be multiplied by an appropriate (T, C) Matrix.
xbow3 = wei @ x
print(xbow3.shape)
print(xbow3[0])

In the above, [1808, -0.0700] is the weighting of the 1st Token by itself. [-0.0894, -0.4926] is the weighting of the 2nd Token with itself and the first. Finally [-0.0341, 0.1332] is the weighting of the 8th token with all previous tokens. However, rather than just being a simple average as before (it currently is), the weights in wei can be manipulated to make certain tokens more or less important.

This is the preview of self-attention. You can do weighted aggregation of past values, and how much of each element fuses into each position =)

## Crux of Self Attention
This section is completed at time (1:02:08) after the v2-1 Positional Encoding.py file.



In [23]:
# We have a Batch Size of B, with a block size of T (n_token) and each token is represented in a vector of size C.
torch.manual_seed(1337)
B,T,C = 4, 8, 32
x = torch.randn(B,T,C)

tril = torch.tril(torch.ones(T,T))
wei = torch.zeros((T,T)) # Initialized as Equal. ==> To be learned through data. How?
wei = wei.masked_fill(tril == 0, float('-inf'))
print(f"Weights before softmax: \n{wei}")
wei = F.softmax(wei, dim=1)
print(f"Weights after softmax: \n{wei}")

out = wei @ x
print(f"Final Output {out.shape}: \n{out[0][0]}") # Visualize a Single Attention for Batch Zero, Position 0.

Every Single Token emits two Vectors; a Query and a Key. The Query Vector roughly translates to "What Am I Looking For?", while the Key Vector translates to "What values do I contain?". The way we then gather relationships between tokens in a sequence, is say that we are at Token (T). We take its Query vector and multiply it by all of the key vectors, getting a single value that higher when those two things match. That dot product is what becomes the matrix *wei*.

In [24]:
# We have a Batch Size of B, with a block size of T (n_token) and each token is represented in a vector of size C.
torch.manual_seed(1337)
B,T,C = 4, 8, 32
x = torch.randn(B,T,C)

# Implement one "Head":
head_size = 16
key = nn.Linear(C, head_size, bias=False) # (32, 16)
query = nn.Linear(C, head_size, bias=False) # (32,16)

# Each Token (which is embedded as a vector of 32, is matmul with the key and query layers to make k, q).
k = key(x) # (4, 8, 16)
q = query(x) # (4, 8 ,16)

wei = q @ k.transpose(-2, -1) # (4, 8, 16) @ (4, 16, 8) ---> (B, 8, 8) or (B, T, T). (ignore, the B and see how they would multiply). (8, 16) @ (16, 8) = (8, 8)

print(f"Shape of k: \n{k.shape}")
print(f"Shape of q: \n{q.shape}")

tril = torch.tril(torch.ones(T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
print(f"Weights before softmax (for Query 0): \n{wei[0]}")
wei = F.softmax(wei, dim=1) # Probability Distribution raw values.
print(f"Weights after softmax (for Query 0): \n{wei[0]}")
# Wei is now controlled by the key and query values, the larger the value the more it contributes to the final output at out.

out = wei @ x
print(f"Final Output {out.shape}: \n{out[0][0]}") # Visualize a Single Attention for Batch Zero, Position 0.

We have created a single "head" that implements our ideas. Each Token vector of 32, is used to make a key and a query as described above. We do this in parallel by taking our data x (B, T, C) and a matrix (C, C/2). The B & T are treated like batch dimensions and we return (B, T, C/2) for each.

Then we calculate wei. Such that for each row (query), is the multiplied by each key. We are left with a weight matrix which can be read as follows.
wei[0,0] = How much does the query of Token 0 match the key of Token 0?
wei[3,0] = How much does the query of Token 3 match the key of Token 0?

wei[1, 3] = How much does the query of Token 1 match the key of Token? However, remember this is reading into the future, and so when we apply our mask, this will not be possible.

In [25]:
# We have a Batch Size of B, with a block size of T (n_token) and each token is represented in a vector of size C.
torch.manual_seed(1337)
B,T,C = 4, 8, 32
x = torch.randn(B,T,C)

# Implement one "Head":
head_size = 16
key = nn.Linear(C, head_size, bias=False) # (32, 16)
query = nn.Linear(C, head_size, bias=False) # (32, 16)
value = nn.Linear(C, head_size, bias=False) # (32, 16)

# Each Token (which is embedded as a vector of 32, is matmul with the key and query layers to make k, q).
k = key(x) # (4, 8, 16)
q = query(x) # (4, 8 ,16)
v = value(x) # The vector that gets multiplied by wei to make the output

wei = q @ k.transpose(-2, -1) # (4, 8, 16) @ (4, 16, 8) ---> (B, 8, 8) or (B, T, T). (ignore, the B and see how they would multiply). (8, 16) @ (16, 8) = (8, 8)

print(f"Shape of k: \n{k.shape}")
print(f"Shape of q: \n{q.shape}")

tril = torch.tril(torch.ones(T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
print(f"Weights before softmax (for Query 0): \n{wei[0]}")
wei = F.softmax(wei, dim=-1) # Probability Distribution raw values.
print(f"Weights after softmax (for Query 0): \n{wei[0]}")
# Wei is now controlled by the key and query values, the larger the value the more it contributes to the final output at out.

# out = wei @ x
out = wei @ v
print(f"Final Output {out.shape}: \n{out[0][0]}") # Output is now only 16 m

We've now added a final vector value! Essentially the query matches the key, what we send is not x specifically, it a value vector that is associated to x. The motivation for this isn't massive clear to me why we don't just use raw x. That would require more research. But I can see how the value (v), is kept separate from the key and query essentially, and for all intensive purposes k & q do not know v exist, but they would know x exists. So this is perhaps why. 

It should also be clear that, batches don't communicate with one another, and we wouldn't want this, these are just parallel training operations. It can only use data within it's exact block (context window).

In text generation, we block future tokens from communcating. However if you're not generating, perhaps you'd let all nodes to communicate (removing the masked fill), to for example determine the sentiment of a sentence. 

"self attention" - Why Self? It's self becuase the keys, queries and values all come from the same base information (x). Cross Attention = You could have the queries be produced from x, but they keys and values are produced from y. Your Query is in spanish perhaps, and you'd like to respond in English? I think of it like this.

"Scaled attention" divides wei by 1/sqrt(head_size). This makes it so when inpur Q, K are unit variance, wei will be unit variance too and softmax will stay diffused and not saturate too much. This is similar to the kaeming normalization we seen before, where again we were aiming at unit gaussian (normal distrubion with mean 0 and std of 1 and so 2/3rds of the data is between -1 and + 1).

In [26]:
# scaled attention
k = torch.randn(B, T, head_size)
q = torch.randn(B, T, head_size)
wei = q @ k.transpose(-2, -1)
wei_norm = q @ k.transpose(-2, -1) * head_size **-0.5

In [27]:
print(f"k variance: \n{k.var()}")
print(f"q variance: \n{q.var()}")
print(f"wei no norm variance: \n{wei.var()}")
print(f"wei norm: \n{wei_norm.var()}")

If softmax takes on too widely varying values, it will essentially become one-hot vectors.

In [28]:
print(torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)) # Normalized
print(torch.softmax(torch.tensor([100, -0.2, 0.3, -0.2, -10]), dim=-1)) # Not Normalized (Essentially, [1, 0, 0, 0, 0] one hot).

In [29]:
# Return at 1:19:20 for the programming of the above section from "Crux of Self Attention" into a class called Head!