# Building our own GPT

Today we looked through the Transformers model. To show it via code, we will be building our GPT model. 

Chat-GPT was trained on all the publically available internet until 2021. That is an enormous amount of data. We don't have that much capacity to train our model. So we will be using a smaller dataset. For today, we will be using the TinyShakespeare dataset, which contains all the plays written by Shakespeare in a dialogue sort of format.

At the end of the day, we want to be able to generate text like Shakespeare from thin air.


![alt text](image/scientist.jpeg "Title")

## Data Exploration

We already have the data, lets explore it to get a basic understanding of the data. Things to look for:
1. How it looks like?
2. What sort of structure is it in?
3. Other things that you can think of?

In [29]:
# read it in to inspect it
with open('dataset/input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [30]:
# %load answers/load_data_exploration.py
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1115394


In [31]:
# let's look at the first 1000 characters
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [32]:
# %load answers/load_data_exploration_2.py
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)



 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


We see that all the alphabets that exist in the English Literature are present in our dataset. Aside from the alphabets, we also have punctuations in our dataset. In total we have 65 characters in our vocabulary.

From this point onwads, I am going to refer them as tokens.

So, as the goal for today our GPT will output(generate) output as a single token by single token.
Since, we are dealing with NLP and Sequence Modelling, our output should depend on the previous tokens.
But before we get into it, we have to process our data. Do you remember what was the first step in the Transformers architecture?

## Data Preprocessing

Machine Learning models can't work with raw data. We have to convert it into a format that the model can understand. In this case, we have to convert our text into numbers. For today's case we will use the most simple encoding possible. We will assign a number to each token.

In [33]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


## DEMO
Go back to slides and take a look at the other tokenization methods.

In [34]:
import tiktoken
enc = tiktoken.get_encoding("gpt2")
enc.n_vocab

50257

Tiktoken is developed by OpenAI and most probably used by Chat-GPT as well. The thing that Chat-GPT generates will not be charcter by character but by token by token.

![alt text](image/9000.jpeg "Title")

There are only 26 characters in English alphabets, then how is this number soo high?

In [35]:
enc.encode("hii there")

[71, 4178, 612]

In [36]:
enc.decode([71, 4178, 612])

'hii there'

**Things to note**: You can make a tradeoff between very long sequence of integers with a very small vocabulary or a very small sequence of integers with a very large vocabulary.

In [37]:
# let's now encode the entire text dataset and store it into a torch.Tensor
import torch # we use PyTorch: https://pytorch.org
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000]) # the 1000 characters we looked at earier will to the GPT look like this

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

0 is the newline and 1 is the space.

## Data Loading
Before we get into training we should split our data into training and test. Because we dont want the model to just memorize the data. We want it to output, Shakespeare like text.

In [38]:
# Let's now split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

After splitting the data, we have to convert it into a format that the model can understand. And as you know we cannot just simply feed the entire dataset to the model as it would be computationally expensive. So we have to split our data into chunks. And for today I will be referring to them as the block_size.

In [39]:
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

To reiterate the concept of sequence modelling, we want our model to output a token (character) based on the previous sequence of tokens. In this case we want 47 to be referred from 18. If 18,47,56 appear in a sequence model should learn to output 57. 

In [40]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


So, from a block size of 8 we generate 8 examples that we want the model to train upon. This can be as small as 1 to the entire block size. This way we teach the transformer to not only look at the previous token but also the previous sequence of tokens.

### Batch time
We have looked into one dimension of the sequence modelling. Now we have to look into the other dimension. We have to split our data into batches. This is because we cannot feed the entire dataset to the model at once. We have to split it into batches.

In [41]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    # batch_size random starting points
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
when input is [24] the target: 43
when input is [24, 43] the target: 58
when input is [24, 43, 58] the target: 5
when input is [24, 43, 58, 5] the target: 57
when input is [24, 43, 58, 5, 57] the target: 1
when input is [24, 43, 58, 5, 57, 1] the target: 46
when input is [24, 43, 58, 5, 57, 1, 46] the target: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
when input is [44] the target: 53
when input is [44, 53] the target: 56
when input is [44, 53, 56] the target: 1
when input is [44, 53, 56, 1] the target: 58
when input is [44, 53, 56, 1, 58] the target: 46
when input is [44, 53

What we are trying to do is to get random chunks of data in the form of block_size from the dataset.

In [42]:
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337) # Bonus question: Why 1337?

class BigramLanguageModel(nn.Module):
    
    def __init__(self) -> None:
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets):
        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C) -> (4,8,65)
        loss = F.cross_entropy(logits, targets)
        return logits

m = BigramLanguageModel()
loss = m(xb, yb)
print(loss)


RuntimeError: Expected target size [4, 65], got [4, 8]

Can you solve it?

In [None]:
%load answers/load_vanilla_bigram_model.py

In [None]:
m = BigramLanguageModel()
loss = m(xb, yb)
print(loss)

## Are we on the right track?
A good yardstick to check if our first step is correct is to check the loss.
Since, the vocab_size is 65 and we have a model which hasn't been trained yet, the loss should be around -ln(1/65). If it is not, then we have to check our code.

In [None]:
# Calculate ln(1/65)
import math
-1* math.log(1/65)

In [46]:
# Will add this method in the bigram model to aid us in generating text

def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [43]:
%load answers/load_complete_bigram_model.py


In [47]:
my_token = torch.zeros((1, 1), dtype=torch.long)
print(decode(m.generate(my_token, max_new_tokens=100)[0].tolist()))


AttributeError: 'BigramLanguageModel' object has no attribute 'generate'

As you can see, this is almost shit. But atleast we are now able to generate the output.

## Time to train

![alt text](image/train.gif "Title")



In [None]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [None]:
batch_size = 32

# TODO(Ambuj): Run this multiple times to see the loss decrease

for steps in range(1000): # increase number of steps for good results... 
    
    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())

In [None]:
my_token = torch.zeros((1, 1), dtype=torch.long)
print(decode(m.generate(my_token, max_new_tokens=100)[0].tolist()))

### INTERMEZZO
We trained our model, and this is a super simple Bigram model which is just a lookup table. We can do a lot better than this. We can use a Transformer model.
We are not taking advantage of the history (**yet!**). But before we do that, we have to understand how the Transformer model works.

In [None]:
# consider the following toy example:

torch.manual_seed(1337)
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
x.shape

I dont want my 5th token to communicate with the 6,7,8th token. I want it to communicate with the 1st,2nd,3rd,4th token. We only want information to flow from one direction(from the past) and not look into the future.

As a simplistic toy example, I just want to calculate the mean of all the previous characters. This is not ideal because we are just taking the mean and the mean of the previous characters is not an ideal metric to calculate the next character. But for the sake of simplicity, lets just do it.

In [None]:
# We want x[b,t] = mean_{i<=t} x[b,i]
xbow = torch.zeros((B,T,C)) #bag of words
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1] # (t,C)
        xbow[b,t] = torch.mean(xprev, 0)

In [None]:
x[0]

In [None]:
# NOTE(self): Explain xbow
xbow[0]

In [None]:
torch.manual_seed(1337)
a = torch.ones(3, 3)
b = torch.randint(0, 10, (3,2)).float()
c = a @ b
print("a =")
print(a)
print("b =")
print(b)
print("c =")
print(c)

### This is the mathematical trick in Self-Attention
The Matrix a was the boring matrix in this case of all 1's. 

In [None]:
# toy example illustrating how matrix multiplication can be used for a "weighted aggregation"
torch.manual_seed(1337)
a = torch.tril(torch.ones(3, 3))
# a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

In [None]:
# toy example illustrating how matrix multiplication can be used for a "weighted aggregation"
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

Now, lets do it efficiently.

#### Version 2: Making it the pytorch way

In [None]:
# version 2: using matrix multiply for a weighted aggregation
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)
torch.allclose(xbow, xbow2)

In [None]:
xbow[0], xbow2[0]

#### Version 3: Adding softmax

In [None]:
# version 3: use Softmax
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
#wei = F.softmax(wei, dim=1) # Softmax=Normalization operation
wei

In [None]:
# version 3: use Softmax
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T)) # Affinity
wei = wei.masked_fill(tril == 0, float('-inf'))
# wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(xbow, xbow3)
wei

#### Version 4: Self-Attention way!

In [None]:
# version 4: self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T)) # Affinity
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ x

out.shape

Each token will emit 3 vectors: Query, Key, Value.


Query: What am I looking for?


Key: What do I contain?


Value: If you find me interesting, what should I propagate to the next layer?

For attention, we will be using the dot product of the query and key. 

![alt text](image/qkv.gif "Title")


The formula for the self-attention is as follows:

$$A(K,Q,V)=\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$



In [None]:
# version 4: self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v
#out = wei @ x

out.shape

In [None]:
wei[0]

Notes:
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
- "Scaled" attention additional divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below

NOTE(Ambuj): Comment about encoder block and decoder block. 

What changes should be done about the Sentiment Analysis model?

Why 1/sqrt(head_size)?


In [None]:
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1) #* head_size**-0.5

In [None]:
k.var()

In [None]:
q.var()

In [None]:
wei.var()

Conversation about diffused input and softmax saturation.

In [None]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

In [None]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1) # gets too peaky, converges to one-hot

### Let's take the attention for a spin

In [None]:
# hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 3000
eval_interval = 300
learning_rate = 1e-2
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embed = 32

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

In [None]:
%load answers/load_multihead.py

Lets look back at the architecture of the Transformer model.

![alt text](image/architecture.png "Title")


What are we still missing?

In [None]:
%load answers/feedforward.py

In [None]:
%load answers/load_blocks.py

I added these but the loss is not going down. What is the problem?