# Nano GPT From Scratch In Python

In this notebook I will be walking through creating and training a transformer from scratch. I will be going through each foundational step by step and will try to explain what is happening along the way.

I will be building a simple character level Bigram language model and will then try to improve it using attention mechanism. 

The notebook seems a bit lengthy because rather than just showing the code that was changed which would have shortened things up considerably, I chose to copy all required code down to the next cell to allow this entire notebook to be run from top to bottom. This should make it easier to run as well as allow you to experiment with each new concept as we go.



## About Data

I will be training models over writing of `William Shakespeare`. The data can be downloaded from following command:

`!wget https://raw.githubusercontent.com/ajitsingh98/master/data/tinyshakespeare/input.txt`

Entire data is in textual format and has `1115394` characters in total.


## Load the Data

- Download the data
- Read and Save
- Basic EDA

Lets open the text data and save it in `text_data` variable.

In [1]:
#Read in to inspect it
with open("input.txt", "r", encoding='utf-8') as f:
    text_data = f.read()

In [2]:
print(f"Length of the dataset is {len(text_data)}")

Length of the dataset is 1115394


In [4]:
#lets look at first 1000 characters
print(text_data[:500])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor


This is how the training data looks like. It is basically an excerpt from drama writing of `William Shakespeare`.

## Tokenization

We have loaded the dataset the first thing we need to break the data into tokens which will be fed into the model.

The above process is called Tokenization. There are various ways of tokenizing the data like `character level`, `word level` or `sub-word level` tokenization.

Most of the modern LLM models use `sub-word` level tokenization. Each of the tokenization has its own pros and cons.

Since our dataset is small and also for ease of training and illustration I will be going with character level tokenization.

Note that we are making the LM on character level tokenization. It is simple or primitive way to tokenization but in real world we use more sophesticated way to tokenization. Like google uses `SentencePiece` and OpenAI uses `tiktoken` tokenizer which utilizes Byte pair tokenization.

There is tradeoff between vocab size and sequence length. If you are building a LM on character level then you will have small vocab size but larger sequence length and if you are doing on subword level then you will have large vocab size but small sequence length.

### Character Level Tokenization

Let's see how many unique characters i.e tokens in our dataset. Note that the unique number of tokens in the dataset is known as `vocab`.


In [9]:
#vocab - number of unque token the dataset
vocab = sorted(list(set(text_data)))

vocab_size = len(vocab)

print(f"Vocab : {(vocab)}")
print(f"Vocab size : {vocab_size}")

Vocab : ['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Vocab size : 65


So we have vocab of size `65` and they are ` \n!$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz`

Now we are done with tokenization but neural networks can not take charcters directly. We need to first convert characters to integers, which in our case is converting each individual character into integers.

We can use those inetgers to  index into the set of token embeddings. Token embeddings are learned vectors that represent each token that will be passed into the model.

Let's create a encoder-decoder mapping for the `vocab`.
- `encoder` - Takes a string and output a list of integers
- `decoder` - take a list of inetgers, outout a string

In [10]:
#create a mapping 
stoi = {c:i for i, c in enumerate(vocab)}
itos = {i:c for i, c in enumerate(vocab)}

#define encoder and decoder function
encoder = lambda s:[stoi[c] for c in s]
decoder = lambda l:''.join([itos[i] for i in l])

In [12]:
#test the above mappers

s = "hii there!"

print(f"Encoded value: {encoder(s)}")
#decoded value should be equal to s
print(f"Decoded value: {decoder(encoder(s))}")

Encoded value: [46, 47, 47, 1, 58, 46, 43, 56, 43, 2]
Decoded value: hii there!


Okay, this seems working fine!

Now that we have our tokenizer we go through and tokenize the dataset.

I will be going to import pytorch and create a tensor of our encoded dataset.

In [14]:
import torch

encoded_text = torch.tensor(encoder(text_data))

print(f"Encoded Text Shape {encoded_text.shape}, Encoded Text Dtype {encoded_text.dtype}")
encoded_text

Encoded Text Shape torch.Size([1115394]), Encoded Text Dtype torch.int64


tensor([18, 47, 56,  ..., 45,  8,  0])

## Train/Validation Split

Now that we have the encoded text, we need to split it into a training and validation set. 

Validation set is required to test the performance of our trained model on unseen dataset and it provides unbiased estimate of model's performance. 

Training strategy:

- First 90% will be train data and rest will be validation data
- We don't feed entire text to the transformers once because that will be hugely computational expensive.
- We rather take random samples as small chunks from the data and then do training
- `block_size`/`context_length` - chunk size

In [16]:
#define split percentage
split_prcntg = 0.9

#get resultant index and split based on it
idx = int(len(encoded_text)*split_prcntg)

train_data = encoded_text[:idx]
valid_data = encoded_text[idx:]

print(f"Train data size : {train_data.shape}")
print(f"Validation data size : {valid_data.shape}")

Train data size : torch.Size([1003854])
Validation data size : torch.Size([111540])


## Context Length/Block Size

- `context length` - Maximum length of the sequence used when training the transformer.
- This is also refered as `block` size sometimes.
- When transformer is trained, it is trained on each combination of tokens upto maximum context length.
- This is done to make sure transformer sees the context as little as of length one and as big as of context_size. This is also done to increase the efficiency.
- Transformer will not see characters more than the block size for predicting the next token.

Let me show you an example of training set from our data.

In [22]:
block_size = 8
for i in range(block_size):
    x, y = train_data[:i+1], train_data[i+1]
    print(f"idx {i}, when input is {x}, target : {y} | Decoded values : input -> {decoder(x.tolist())}, output -> {decoder(y[None].tolist())}")

idx 0, when input is tensor([18]), target : 47 | Decoded values : input -> F, output -> i
idx 1, when input is tensor([18, 47]), target : 56 | Decoded values : input -> Fi, output -> r
idx 2, when input is tensor([18, 47, 56]), target : 57 | Decoded values : input -> Fir, output -> s
idx 3, when input is tensor([18, 47, 56, 57]), target : 58 | Decoded values : input -> Firs, output -> t
idx 4, when input is tensor([18, 47, 56, 57, 58]), target : 1 | Decoded values : input -> First, output ->  
idx 5, when input is tensor([18, 47, 56, 57, 58,  1]), target : 15 | Decoded values : input -> First , output -> C
idx 6, when input is tensor([18, 47, 56, 57, 58,  1, 15]), target : 47 | Decoded values : input -> First C, output -> i
idx 7, when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]), target : 58 | Decoded values : input -> First Ci, output -> t


Going forawrd let's define some hyper-parameters

In [25]:
TORCH_SEED = 1337 #Setting a manual torch seed for reproducable results
torch.manual_seed(TORCH_SEED) #Used to compare against @karpathy's lecture
context_length = 8 #Maximum number of tokens used in each training sequence
batch_size = 4 #number of batches that will be trained in parallel.

## Data Loader

Now I will be implementing a function to get a batch of data from our training and valiation datasets.

The function will take which dataset(train/valid) needs to pull and it will return the input and target dataset for that.

Note that here I am introducing batch dimension as well which helps in parallel processing and it helps in train faster and utilize GPU capabilities.

- `batch_size` - How many independent sequences will be process in parallel
- `block_size` - maxm context length for prediction


In [27]:
#data loader function

def get_batch(train_valid):

    """
    
    Function that returns input and output in batches
    
    """
    data = train_data if train_valid=="train" else valid_data
    data_len = len(data)
    #random sampling from training/valid set
    start_idx = torch.randint(high=data_len - block_size, size=(batch_size, 1))
    #get input 
    x = torch.stack([data[i:i+block_size] for i in start_idx])
    y = torch.stack([data[i+1:i+block_size+1] for i in start_idx])

    return x, y


In [28]:
#test the above function

xb, yb = get_batch('train')

print("inputs:")
print(f"Shape : {xb.shape}")
print(xb)

print("targets : ")
print(f"Shape : {yb.shape}")
print(yb)

inputs:
Shape : torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets : 
Shape : torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])


In [32]:
#lets print some examples of input and target now
for batch_idx in range(batch_size):
    print(f"Batch Idx :  {batch_idx}")
    for sequence_idx in range(block_size):
        context = xb[batch_idx, :sequence_idx+1]
        target = yb[batch_idx, sequence_idx]
        print(f"Given input context: {context.tolist()}, target : {target.tolist()}")


Batch Idx :  0
Given input context: [24], target : 43
Given input context: [24, 43], target : 58
Given input context: [24, 43, 58], target : 5
Given input context: [24, 43, 58, 5], target : 57
Given input context: [24, 43, 58, 5, 57], target : 1
Given input context: [24, 43, 58, 5, 57, 1], target : 46
Given input context: [24, 43, 58, 5, 57, 1, 46], target : 43
Given input context: [24, 43, 58, 5, 57, 1, 46, 43], target : 39
Batch Idx :  1
Given input context: [44], target : 53
Given input context: [44, 53], target : 56
Given input context: [44, 53, 56], target : 1
Given input context: [44, 53, 56, 1], target : 58
Given input context: [44, 53, 56, 1, 58], target : 46
Given input context: [44, 53, 56, 1, 58, 46], target : 39
Given input context: [44, 53, 56, 1, 58, 46, 39], target : 58
Given input context: [44, 53, 56, 1, 58, 46, 39, 58], target : 1
Batch Idx :  2
Given input context: [52], target : 58
Given input context: [52, 58], target : 1
Given input context: [52, 58, 1], target : 

Inputs and targets are just shifted by one. As you can see input sequence is really multiple sequences starting with first token in the sequence as input and second token as target all the way to the full input sequence being the input and subsequent character being the target.

Now we are done with data preparation step. Lets build a `Bigram Language Model` which just infer next token based on immediate preceeding token.

## Bigram Language Model

Before going directly for using Transformer we will start with a simple bigram model.

What is a bigram model?
- A bigram model predicts the probability of one token following another
- For example given the token for letter 'm' what is the probability of each token in the vocab will be next token.

In [34]:
import torch.nn as nn
import torch.nn.functional as F

In [35]:
#set manual seed
torch.manual_seed(TORCH_SEED)

class BigramLanguageModel(nn.Module):

    #contructor
    def __init__(self, vocab_size):
        super().__init__()
        self.vocab_size = vocab_size
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets):
        #both idx and targets are of shape (B, T) => Batch size x Time array of integers
        logits = self.token_embedding_table(idx) #(B, T, C) Batch Time and Channel
        return logits



Let's instantiate the above class and test it for some sample input.

In [37]:
bigram_model = BigramLanguageModel(vocab_size=vocab_size)

bigram_model.parameters

<bound method Module.parameters of BigramLanguageModel(
  (token_embedding_table): Embedding(65, 65)
)>

The model has just one parameter i.e embedding layer which acts as a look up table.

In [43]:
#lets take the output using forward method 
output = bigram_model.forward(xb, yb)

print(f"Bigram model's output : {output.shape}, xb shape : {xb.shape}, yb shape : {yb.shape}, bigram embeddings : {bigram_model.token_embedding_table}")

Bigram model's output : torch.Size([4, 8, 65]), xb shape : torch.Size([4, 8]), yb shape : torch.Size([4, 8]), bigram embeddings : Embedding(65, 65)


In order to train above model we need to introduce loss function. I will be using <a hreaf="https://pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html">cross entropy loss</a>. It makes sense when we have multiclass classification problems.

In order to use the cross entropy loss we'll need to reshape the output and targets to match the format that it expects. 
- The model output should be a 2D tensor (B*T, C) and targets should be 1D (B*T)
- We can squashed the batch and time dimensions on the model output and batch and time dimension on the target using `torch.view()` method.


In [50]:
torch.manual_seed(TORCH_SEED)

class BigramLanguageModel(nn.Module):

    #constructor
    def __init__(self, vocab_size) -> None:
        super().__init__()
        #set class variable values
        self.vocab_size = vocab_size
        #Each token reads off the logits from the subsequent token from the lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
    
    def forward(self, idx, targets):
        #both idx and targets of size (B, T)
        logits = self.token_embedding_table(idx) # (B, T, C)
        #extract the shapes
        B, T, C = logits.shape

        #reshape logists and target according to torch.cross_entropy_loss requirement
        logits_reshaped = logits.view(B*T, C)
        targets_reshaped = targets.view(B*T)

        #define loss
        loss = F.cross_entropy(input=logits_reshaped, target=targets_reshaped)

        return logits, loss


In [55]:
#lets check the loss before training just untrained model loss
bigram_model = BigramLanguageModel(vocab_size=vocab_size)

logits, loss = bigram_model(xb, yb)

print('Bigram Model Output Shapes out:',logits.shape,'xb:',xb.shape,'yb:',yb.shape)
print('The calculated loss is:',loss)


Bigram Model Output Shapes out: torch.Size([4, 8, 65]) xb: torch.Size([4, 8]) yb: torch.Size([4, 8])
The calculated loss is: tensor(4.5564, grad_fn=<NllLossBackward0>)


Now, We are going to add a generate method for inference that can perform character generation for our model.

In [67]:
torch.manual_seed(TORCH_SEED)

class BigramLanguageModel(nn.Module):

    #constructor
    def __init__(self, vocab_size) -> None:
        super().__init__()
        #set class variable values
        self.vocab_size = vocab_size
        #Each token reads off the logits from the subsequent token from the lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
    
    def forward(self, idx, targets=None):
        #both idx and targets of size (B, T)
        logits = self.token_embedding_table(idx) # (B, T, C)
        #extract the shapes
        B, T, C = logits.shape

        #handle the case when target is None
        if targets is not None:
            #reshape logists and target according to torch.cross_entropy_loss requirement
            logits_reshaped = logits.view(B*T, C)
            targets_reshaped = targets.view(B*T)

            #define loss
            loss = F.cross_entropy(input=logits_reshaped, target=targets_reshaped)
        else:
            loss = None

        return logits, loss

    #method for generating text
    def generate(self, idx, max_new_tokens):

        for _ in range(max_new_tokens):
            # get prediction for idx
            logits, loss = self(idx)
            #get the last time stamp from the logits since we are building bigram model
            logits_last_timstamp = logits[:, -1, :] # (B, C)
            #use softmax to get probabilities
            probs = F.softmax(logits_last_timstamp, dim=-1) # (B, C)
            #sample from the probs distribution
            id_next = torch.multinomial(input=probs, num_samples=1) #(B, num_samples)
            #append the samples idx_next to idx
            idx = torch.concat((idx, id_next), dim=1) # (B, T+1)
        return idx



In [70]:
bigram_model = BigramLanguageModel(vocab_size=vocab_size)

logits, loss = bigram_model(xb, yb)

print("Loss : ", loss)

#create a single batch with single time stamp with 0 index("\n" char)
idx = torch.zeros((1, 1), dtype=torch.long)
#let's generate a char sequence to see what it looks like
max_new_tokens = 100

print(f"{max_new_tokens} generated tokens are : {decoder(bigram_model.generate(idx, max_new_tokens)[0].tolist())}")

Loss :  tensor(4.8430, grad_fn=<NllLossBackward0>)
100 generated tokens are : 
eKugsuRNC!T3b,jqDNMhsHAJSOWYvkZlA'wjtw3IzUltSG:rX;UOIp:RQ!:KU
eRyE-
QZtjcOaCx qUOM.pq?kTTtjACpKJ.EHB


In [139]:
nn.Embedding(vocab_size, vocab_size)(torch.tensor(10))

tensor([-1.3441, -0.2827, -0.6887, -0.6897,  0.5899,  0.5532,  0.0651, -1.7956,
         1.3145,  1.7042,  0.5254, -1.2803, -1.1621,  0.6652,  0.0291,  3.6271,
        -0.1357, -0.4648, -1.4324,  0.1254, -1.1245,  0.4881, -0.6896, -0.7080,
        -0.3152,  0.7196, -0.0178, -1.2635,  0.8914, -1.2858, -2.1067, -1.9922,
         0.7629, -0.5948,  0.9828, -0.4151, -0.2026, -1.8955,  0.6117,  0.1095,
         0.0157, -1.0636,  0.8398,  0.4211, -2.0257,  1.0383,  0.5182,  0.5283,
        -0.5648,  0.0383,  0.3049, -2.0662, -1.1418, -0.1391,  1.0827,  1.1522,
         0.5198, -0.8982,  0.3749, -0.0422,  0.7197,  1.8447,  1.4385, -1.3166,
         1.2690], grad_fn=<EmbeddingBackward0>)

In [140]:
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        #each token directly reads off the logits for the next token from the lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
    
    def forward(self, idx, targets=None):

        #idx and targets both are of dim (B, T)
        logits = self.token_embedding_table(idx) # (B, T, C) C -> vocab size
        B, T, C = logits.shape 
        #pytorch expects dimensions as below
        logits = logits.view(B*T, C)
        print(logits)
        targets = targets.view(B*T)
        print(targets)
        #loss - measures quality of target over prediction
        loss = F.cross_entropy(logits, targets)

        return logits, loss

m = BigramLanguageModel(vocab_size=vocab_size)
logits, loss = m(xb, yb)

print(logits.shape)
print(loss)

tensor([[-0.7582, -1.8711,  0.7141,  ..., -0.5707, -0.4843, -0.0299],
        [-1.6906,  0.6377,  0.6544,  ..., -2.2905,  1.0941,  1.0316],
        [-1.1730,  0.1125,  1.3759,  ...,  0.2143,  1.5742, -0.1005],
        ...,
        [ 0.6917,  2.0363,  0.2135,  ..., -0.1197,  0.5410, -1.7943],
        [-1.2892,  1.5030, -0.5783,  ...,  0.2987,  0.0178,  0.1400],
        [ 0.3585,  2.0936, -0.8058,  ...,  1.9319,  0.4377, -0.1681]],
       grad_fn=<ViewBackward0>)
tensor([13, 24, 50, 33, 44, 60, 13, 25, 43, 59, 44, 24, 60, 25, 24, 44, 24, 44,
        24, 60, 25, 24, 44, 60, 39, 12, 63, 22, 19, 44, 29, 25])
torch.Size([32, 65])
tensor(4.4300, grad_fn=<NllLossBackward0>)


Add method for generating next tokens in above `BigramLanguageModel` for generating next tokens based on current token.

In [76]:
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        #each token directly reads off the logits for the next token from the lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
    
    def forward(self, idx, targets=None):

        #idx and targets both are of dim (B, T)
        logits = self.token_embedding_table(idx) # (B, T, C) C -> vocab size
        B, T, C = logits.shape 
        #pytorch expects dimensions as below
        if targets is None:
            loss = None
        else:
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            #loss - measures quality of target over prediction
            loss = F.cross_entropy(logits, targets)
        return logits, loss

    #inference
    def generate(self, idx, max_new_tokens):
        #ids is (B, T) array of indices in the current context

        for _ in range(max_new_tokens):
            #get the prediction
            logits, loss = self.forward(idx)
            #focus only on last time stamp

            logits = logits[:, -1, :] #become (B, C)
            #apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            #sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            #append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim = 1) # (B, T+1)
        return idx
            
bigram_model  = BigramLanguageModel(vocab_size=vocab_size)

idx = torch.zeros((1, 1), dtype=torch.long)

print(decoder(bigram_model.generate(idx, max_new_tokens=100)[0].tolist())) # [0] - unplug single dimension


-,imYyu!?'3nv&lDR;C.j:ofYpUwUx-HeISLNIm!-y?'VVQDGTwGBCSLKNaRzE.mJwjI'OgQAyLdq$!;WEzE,RqkzURqd!tmihNa


The model's output seems total non-sensical noise because the model is not trained yet and whatever we are getting as output is because of random intialization of parameters.

Let's train the model

In [77]:
#create an optimizer
optimizer = torch.optim.Adam(bigram_model.parameters(), lr=1e-3) #small learning rate will let out model learns smoothly



A basic training loop

In [78]:
#increase batch size to get stable loss -> mini batch training
batch_size = 32

training_steps = 100

for steps in range(training_steps):
    #get the sample batch from the training data
    xb, yb = get_batch("train")
    #get loss
    logits, loss = bigram_model(xb, yb)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    print(f"Loss {round(loss.item(), 3)}")


Loss 4.869
Loss 4.833
Loss 4.917
Loss 4.789
Loss 4.856
Loss 4.637
Loss 4.686
Loss 4.848
Loss 4.661
Loss 4.766
Loss 4.879
Loss 4.885
Loss 4.795
Loss 4.814
Loss 4.744
Loss 4.782
Loss 4.73
Loss 4.778
Loss 4.867
Loss 4.808
Loss 4.977
Loss 4.841
Loss 4.78
Loss 4.813
Loss 4.795
Loss 4.691
Loss 4.742
Loss 4.768
Loss 4.831
Loss 4.782
Loss 4.801
Loss 4.743
Loss 4.883
Loss 4.742
Loss 4.884
Loss 4.859
Loss 4.741
Loss 4.86
Loss 4.597
Loss 4.68
Loss 4.561
Loss 4.831
Loss 4.817
Loss 4.769
Loss 4.726
Loss 4.81
Loss 4.763
Loss 4.777
Loss 4.76
Loss 4.665
Loss 4.789
Loss 4.836
Loss 4.691
Loss 4.632
Loss 4.767
Loss 4.665
Loss 4.76
Loss 4.601
Loss 4.785
Loss 4.753
Loss 4.743
Loss 4.737
Loss 4.815
Loss 4.699
Loss 4.613
Loss 4.7
Loss 4.724
Loss 4.648
Loss 4.754
Loss 4.83
Loss 4.847
Loss 4.677
Loss 4.735
Loss 4.772
Loss 4.598
Loss 4.623
Loss 4.822
Loss 4.74
Loss 4.721
Loss 4.864
Loss 4.695
Loss 4.713
Loss 4.692
Loss 4.739
Loss 4.64
Loss 4.655
Loss 4.614
Loss 4.665
Loss 4.782
Loss 4.665
Loss 4.738
Loss 4.691


We can observe that the loss is coming down all the way from `4.9` to `4.6`.

Let's train the model for longer steps.

In [79]:
#increase batch size to get stable loss -> mini batch training
batch_size = 32

training_steps = 1000

for steps in range(training_steps):
    #get the sample batch from the training data
    xb, yb = get_batch("train")
    #get loss
    logits, loss = bigram_model(xb, yb)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if steps%100 == 0:
        print(f"Loss {round(loss.item(), 3)}")

Loss 4.796
Loss 4.441
Loss 4.369
Loss 4.343
Loss 4.245
Loss 4.225
Loss 4.022
Loss 3.944
Loss 3.826
Loss 3.829


Okay this seems promosing. Let's print the output of the model just like we did with untrained model.

In [81]:
#starting token
idx = torch.zeros((1, 1), dtype=torch.long)

print(decoder(bigram_model.generate(idx, max_new_tokens=1000)[0].tolist())) # [0] - unplug single dimension



beAvUw:LePpanIDGf-nkxo ft!zUiOixvZAdqkeIA;.F nje3;QVokILg
Fdeikf$VHXWxDmseI;&,

bgqMoINqAp;aAeylTuQ-Id
Ye:pu:ilhPGMula::XSsH:uL&ZAanxnYBwZADj OLGid'tMaFIN?pBF ing &
Y
tRoPE:f rzXs

El nKtYzWxAPOWIVlk:KG3JlBJ
b!QCqhKAnkehUQgBZ;ZVVOU,ulTYCIhADmxUtug fo.$Q&;
N;wPOQPBv.myLA;
bLfo;$R-LuUGHoIb!;GPl &T;XAwN;vTSLgNZBQ-esO-&&bkDbp,MaVuSLci' 'OI:BQytgQXcRo YingZRp.RssCixD D3;.&ot,
b COBrnT sKRrVwX?AR3FLoj3zjIAJym 
gLuZ-LoQ:HglssWkb!OsGj:qpUNScth!RSURM&bssZ:'TMjqzQAaA O;LGjGWhXu wNpe fImVVVO' Vw o;rtAtxnsKaqhBr 
JwG$zCrcBjjjwaOfesFMbgUEm fi;afm'QxuE$U;sOPJ?'O?qhoR:BFXjuyGfoueLKxHNIub f,
e !:'AvZBrjhhKNpN;qhqhtuCA?cnof3BrQAht! 'isc oDsj$RGE.vcINIRoun,zd,Vo Rtcs,
NS.JwWpJ Ma;KYRA
FOl RpUqz:OZDcZAg JsefSQ:v$weCG:nodONjH3'YsXy
J gqgH!DGYke3SL&bgNKAKgNEd

yobkOkojdid&bgNdqdik,.VdeIWlBrbJkOkea?
YhVoWvooBCpPt;.gmECDr&byyZRSdoeUYCinxDyUCSLAxgxpRVOrteEu,?I:KpvPptez?et
VuyHawGeean'MEDjNINSLRFcMf;ehxvTSABr?lDwYfBN:EEwnbLM m'OSSCHiIVW$nsCS!;Vf?KyGanPAsXkx EmyyOdt,gnvZ-DWhBZzezEiIINnSS!tTXjBIqr wKYr$UbnUmEb

Still it seems producung non-sensical output. Let's train it for more steps.

In [82]:
#increase batch size to get stable loss -> mini batch training
batch_size = 32

training_steps = 10000

for steps in range(training_steps):
    #get the sample batch from the training data
    xb, yb = get_batch("train")
    #get loss
    logits, loss = bigram_model(xb, yb)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if steps%100 == 0:
        print(f"Loss {round(loss.item(), 3)}")

Loss 3.703
Loss 3.613
Loss 3.636
Loss 3.644
Loss 3.459
Loss 3.372
Loss 3.396
Loss 3.388
Loss 3.322
Loss 3.19
Loss 3.208
Loss 3.118
Loss 3.126
Loss 2.998
Loss 2.978
Loss 2.846
Loss 2.83
Loss 2.806
Loss 2.938
Loss 2.845
Loss 2.752
Loss 2.767
Loss 2.81
Loss 2.73
Loss 2.711
Loss 2.69
Loss 2.656
Loss 2.626
Loss 2.733
Loss 2.528
Loss 2.678
Loss 2.565
Loss 2.653
Loss 2.55
Loss 2.602
Loss 2.574
Loss 2.544
Loss 2.433
Loss 2.579
Loss 2.61
Loss 2.492
Loss 2.518
Loss 2.429
Loss 2.51
Loss 2.491
Loss 2.643
Loss 2.472
Loss 2.506
Loss 2.541
Loss 2.515
Loss 2.475
Loss 2.549
Loss 2.425
Loss 2.458
Loss 2.354
Loss 2.434
Loss 2.572
Loss 2.426
Loss 2.387
Loss 2.528
Loss 2.354
Loss 2.539
Loss 2.589
Loss 2.547
Loss 2.521
Loss 2.503
Loss 2.473
Loss 2.492
Loss 2.547
Loss 2.478
Loss 2.55
Loss 2.522
Loss 2.508
Loss 2.507
Loss 2.581
Loss 2.528
Loss 2.437
Loss 2.426
Loss 2.516
Loss 2.357
Loss 2.493
Loss 2.532
Loss 2.514
Loss 2.521
Loss 2.399
Loss 2.459
Loss 2.45
Loss 2.493
Loss 2.368
Loss 2.375
Loss 2.515
Loss 2.43

Okay so we are able to reduce the loss to around `2.4`. Let's again see how our model's prediction is.

In [83]:
#starting token
idx = torch.zeros((1, 1), dtype=torch.long)

print(decoder(bigram_model.generate(idx, max_new_tokens=1000)[0].tolist())) # [0] - unplug single dimension


TIUzupr, whe oen,
USAGrd, sulisous cofe
N Gok awho amyod thy s or ca anoke areasombene y thee-
OMarff, GAPam

A:
Tireangr t athigth ak.
Thir t thatr athecinootheeromy uy nt t LARDit.
toous:
AMEOLowiceryowe


Trre sowisatouns
TH:
FOMBemorovicke, he; INBuno t
I:

Lewheen, ge f spagill w t m:
Ofer way,

BAD bs ithiset
OME R:
ARDisere t A:
Ans:
CHoicyoyt:
Weid cks sir mesance, othathast The llyoJwit ouawin onofa br wes IULithoks me kestosichNof wis,
IUSThe l, fe ll ghu thodsly tlotch s all terrymahend k'sit mallimothowomourd gl'thyende glandwerstolereruree l atofacerd wanoy ICUCKI re outhiothe CES:


An Hy d d thrasss Hime hattl ther,
He whablowire
HERIOndfrous n m we s the nanofs pese r'sfopou
STItoworsth misigho, weraberowhabrs alathisthy t yepr w! hallis ared s;

Mare ren the g
POf wn

CQ?
Theend ME oug t y.
OMe, o 'dyon ind, ID ce s tu od fanour, tos&jum tr.
Ler s, Tis
Myovess wedous dm atrsthe;
faw'I ft tooding pyot s ivers.

An.


Thancctaumos'dintou k woredan
NG qus:
T:
t w;

Whath

Okay the output seems little bit like english and have some sort of resemblence with the input text data.

Note that the model is just predicting next token or character solely based on the previous token so it doesn’t really have a lot to go by.

Learning about the dataset, training loop and getting a basline with a simplistic model is always a good first step in any AI project. This allows you to see if further improvements are helping as well as gives you a baseline to compare your model with.

Let's check for validation loss i.e loss over validation data.


In [86]:
#increase batch size to get stable loss -> mini batch training
batch_size = 32
torch.manual_seed(TORCH_SEED)
losses = []

for steps in range(len(valid_data)//batch_size//8):
    #get the sample batch from the training data
    xb, yb = get_batch("valid")
    #get loss
    with torch.no_grad():
        #evaluate the loss
        logits, loss = bigram_model(xb, yb)
        losses.append(loss)
    if steps%10 == 0:
        print(f'Step: {steps} Loss: {round(loss.item(),3)}')
print('Overall Validation Loss:',torch.stack(losses,dim=0).mean())

Step: 0 Loss: 2.443
Step: 10 Loss: 2.526
Step: 20 Loss: 2.541
Step: 30 Loss: 2.534
Step: 40 Loss: 2.398
Step: 50 Loss: 2.458
Step: 60 Loss: 2.556
Step: 70 Loss: 2.474
Step: 80 Loss: 2.624
Step: 90 Loss: 2.528
Step: 100 Loss: 2.544
Step: 110 Loss: 2.618
Step: 120 Loss: 2.494
Step: 130 Loss: 2.399
Step: 140 Loss: 2.466
Step: 150 Loss: 2.553
Step: 160 Loss: 2.539
Step: 170 Loss: 2.625
Step: 180 Loss: 2.51
Step: 190 Loss: 2.438
Step: 200 Loss: 2.449
Step: 210 Loss: 2.466
Step: 220 Loss: 2.548
Step: 230 Loss: 2.499
Step: 240 Loss: 2.394
Step: 250 Loss: 2.408
Step: 260 Loss: 2.448
Step: 270 Loss: 2.435
Step: 280 Loss: 2.51
Step: 290 Loss: 2.459
Step: 300 Loss: 2.46
Step: 310 Loss: 2.438
Step: 320 Loss: 2.605
Step: 330 Loss: 2.461
Step: 340 Loss: 2.352
Step: 350 Loss: 2.421
Step: 360 Loss: 2.457
Step: 370 Loss: 2.558
Step: 380 Loss: 2.472
Step: 390 Loss: 2.503
Step: 400 Loss: 2.403
Step: 410 Loss: 2.493
Step: 420 Loss: 2.36
Step: 430 Loss: 2.492
Overall Validation Loss: tensor(2.4835)


The overall validation loss appears similar to the training loss which makes sense given how basic this model is.

Now we'll move on to implementing attention mechanism and see how it does compared with the basic bigram model.

In [87]:
%reset -f

## Code rewriting for transformers

I will be rewrting the above code to clean up things and make it more stable.

In [89]:
#imports
import torch
import torch.nn as nn 
import torch.nn.functional as F

In [109]:
torch.cuda.is_available()

False

In [110]:
#Hyperparameters
batch_size = 32 # number of tokens chunks per batch
block_size = 8 # length of token chunks/block size
learning_rate = 1e-2

max_iters = 3000 # this is no of training iterations
eval_interval = 300 #Number of steps between evaluating the validation set to see how our validation loss is doing.
eval_iters = 200 #Number of steps to do on the validation set per each interval. We do more than 1 to get a more accurate overall valid loss
device = 'cuda' if torch.cuda.is_available() else 'cpu' # run on gpu if available

TORCH_SEED = 1337
torch.manual_seed(TORCH_SEED)


<torch._C.Generator at 0x7fea38d86750>

Load the data

In [111]:
with open('input.txt','r',encoding='utf-8') as f:
    text_data = f.read()
print('Length of text:',len(text_data))

Length of text: 1115394


Define the vocab for our dataset

In [112]:
vocab = sorted(list(set(text_data))) #Called chars in the video, but vocab is a more generic term. Both are correct.
vocab_size = len(vocab)
print('Vocab size :',vocab_size, '\nVocab :',vocab)

Vocab size : 65 
Vocab : ['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


Define encoder and decoder function

In [113]:
#create a mapping 
stoi = {c:i for i, c in enumerate(vocab)}
itos = {i:c for i, c in enumerate(vocab)}

#define encoder and decoder function
encoder = lambda s:[stoi[c] for c in s]
decoder = lambda l:''.join([itos[i] for i in l])

Tokenization

In [114]:
import torch

encoded_text = torch.tensor(encoder(text_data))

print(f"Encoded Text Shape {encoded_text.shape}, Encoded Text Dtype {encoded_text.dtype}")
encoded_text

Encoded Text Shape torch.Size([1115394]), Encoded Text Dtype torch.int64


tensor([18, 47, 56,  ..., 45,  8,  0])

And split it into training and validation sets

In [115]:
#define split percentage
split_prcntg = 0.9

#get resultant index and split based on it
idx = int(len(encoded_text)*split_prcntg)

train_data = encoded_text[:idx]
valid_data = encoded_text[idx:]

print(f"Train data size : {train_data.shape}")
print(f"Validation data size : {valid_data.shape}")

Train data size : torch.Size([1003854])
Validation data size : torch.Size([111540])


Next we’ll set up a basic data loader to get data in batches

In [116]:
#data loader function

def get_batch(train_valid):

    """
    
    Function that returns input and output in batches
    
    """
    data = train_data if train_valid=="train" else valid_data
    data_len = len(data)
    #random sampling from training/valid set
    start_idx = torch.randint(high=data_len - block_size, size=(batch_size, 1))
    #get input 
    x = torch.stack([data[i:i+block_size] for i in start_idx])
    y = torch.stack([data[i+1:i+block_size+1] for i in start_idx])

    return x.to(device), y.to(device)

#test the above function

xb, yb = get_batch('train')

print("inputs:")
print(f"Shape : {xb.shape}")


print("targets : ")
print(f"Shape : {yb.shape}")

inputs:
Shape : torch.Size([32, 8])
targets : 
Shape : torch.Size([32, 8])


Next we’ll create a function to estimate the loss for our model. Typically this is calculated against the training set for each training step and at the end of each epoch for the validation set but to keep things simple we’ll just calculate it when called based on the number of steps specified as `eval_iters` and take the mean for the training and validation sets respectively. This also helps smooth out the loss values.

In [117]:
@torch.no_grad()

def estimate_loss():
    out = {}
    bigram_model.eval()
    for split in ['train', 'valid']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            x_b, y_b = get_batch(split)
            logits, loss = bigram_model(x_b, y_b)
            losses[k] = loss.item()
        out[split] = losses.mean()
    bigram_model.train()
    return out

Okay so we are done with code rewriting part let's see if everything is working fine.

In [118]:
device

'cpu'

In [120]:
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        #each token directly reads off the logits for the next token from the lookup table
        self.token_embedding_table = nn.Embedding(num_embeddings= vocab_size, embedding_dim= vocab_size)
    
    def forward(self, idx, targets=None):

        #idx and targets both are of dim (B, T)
        logits = self.token_embedding_table(idx) # (B, T, C) C -> vocab size
        B, T, C = logits.shape 
        #pytorch expects dimensions as below
        if targets is None:
            loss = None
        else:
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            #loss - measures quality of target over prediction
            loss = F.cross_entropy(logits, targets)
        return logits, loss

    #inference
    def generate(self, idx, max_new_tokens):
        #ids is (B, T) array of indices in the current context

        for _ in range(max_new_tokens):
            #get the prediction
            logits, loss = self.forward(idx)
            #focus only on last time stamp

            logits = logits[:, -1, :] #become (B, C)
            #apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            #sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            #append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim = 1) # (B, T+1)
        return idx
            
bigram_model  = BigramLanguageModel(vocab_size=vocab_size)
bigram_model = bigram_model.to(device)

In [122]:
optimizer = torch.optim.AdamW(params=bigram_model.parameters(), lr=learning_rate)

In [124]:
for step in range(max_iters):
    
    if step % eval_iters == 0 or step == max_iters-1:
        losses = estimate_loss()
        print('Step:',step,'Training Loss:',losses['train'],'Validation Loss:',losses['valid'])
    
    xb,yb = get_batch('train')
    logits, loss = bigram_model(xb,yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

context = torch.zeros((1,1), dtype=torch.long, device=device)
print(decoder(bigram_model.generate(context,max_new_tokens=500)[0].tolist()))

Step: 0 Training Loss: tensor(4.6488) Validation Loss: tensor(4.6541)
Step: 200 Training Loss: tensor(3.0763) Validation Loss: tensor(3.1015)
Step: 400 Training Loss: tensor(2.6515) Validation Loss: tensor(2.6651)
Step: 600 Training Loss: tensor(2.5555) Validation Loss: tensor(2.5581)
Step: 800 Training Loss: tensor(2.5180) Validation Loss: tensor(2.5267)
Step: 1000 Training Loss: tensor(2.4884) Validation Loss: tensor(2.5096)
Step: 1200 Training Loss: tensor(2.4776) Validation Loss: tensor(2.5039)
Step: 1400 Training Loss: tensor(2.4685) Validation Loss: tensor(2.4927)
Step: 1600 Training Loss: tensor(2.4700) Validation Loss: tensor(2.4937)
Step: 1800 Training Loss: tensor(2.4600) Validation Loss: tensor(2.4973)
Step: 2000 Training Loss: tensor(2.4599) Validation Loss: tensor(2.4983)
Step: 2200 Training Loss: tensor(2.4725) Validation Loss: tensor(2.4892)
Step: 2400 Training Loss: tensor(2.4734) Validation Loss: tensor(2.4833)
Step: 2600 Training Loss: tensor(2.4646) Validation Loss: 

Okay everything seems fine with new code. We can move on for implementing transformers.

# Building Intuition for Self Attention

- Attention is the key element of the transformer.
- In above setting tokens are not interecting with each other we are just taking current token to predict the next token.
- Different tokens will find different other tokens more or less interesting and we want to be data dependent. For example a vowel will look for consonent in past.

**How self attention solves it?**

- Every single node or token at every position will emit two vectors 
    - query vector - what am i looking for 
    - key - what do i contain
    - value - communicate value

    If key and query(for other token are align) - then good match

The idea is that each token should be able to communicate with or look at each previous token in the sequence but not future tokens. For example given token number 4 in a sequence of 8 tokens, token 4 should be able to access token 1, 2 and 3, but not tokens 5 through 8. 

We will demonstrate this with a for loop implementation to cement the concept and then show the equivalent calculation using matrix multiplication which is how transformers are implemented in real life because the matrix multiplication is orders of magnitudes faster than basic nested `for` loops.



In [127]:
#set manual seed
torch.manual_seed(TORCH_SEED)
B, T, C = 4, 8, 2 # Batch, Time, Channel - Time is the each toke in the sequence and channel is the embedding dimension
x = torch.randn((B, T, C))
x

tensor([[[ 0.1808, -0.0700],
         [-0.3596, -0.9152],
         [ 0.6258,  0.0255],
         [ 0.9545,  0.0643],
         [ 0.3612,  1.1679],
         [-1.3499, -0.5102],
         [ 0.2360, -0.2398],
         [-0.9211,  1.5433]],

        [[ 1.3488, -0.1396],
         [ 0.2858,  0.9651],
         [-2.0371,  0.4931],
         [ 1.4870,  0.5910],
         [ 0.1260, -1.5627],
         [-1.1601, -0.3348],
         [ 0.4478, -0.8016],
         [ 1.5236,  2.5086]],

        [[-0.6631, -0.2513],
         [ 1.0101,  0.1215],
         [ 0.1584,  1.1340],
         [-1.1539, -0.2984],
         [-0.5075, -0.9239],
         [ 0.5467, -1.4948],
         [-1.2057,  0.5718],
         [-0.5974, -0.6937]],

        [[ 1.6455, -0.8030],
         [ 1.3514, -0.2759],
         [-1.5108,  2.1048],
         [ 2.7630, -1.7465],
         [ 1.4516, -1.5103],
         [ 0.8212, -0.2115],
         [ 0.7789,  1.5333],
         [ 1.6097, -0.4032]]])

Similar result can be achieved using matrix multiplication. `torch.tril` truncates all the elements above diag to zeros.

In [154]:
torch.manual_seed(42)
a = torch.ones((3, 3))

torch.tril(a)

tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])

In [157]:
wei = torch.tril(torch.ones(T, T))
wei = wei/wei.sum(1, keepdim=True)
xbow2 = wei @ x
torch.allclose(xbow, xbow2)

True

Different tokens will find different other tokens more or less interesting and we want to be data dependent. For example a vowel will look for consonent in past.

How self attention solves it?

- Every single node or token at every position will emit two vectors 
    - query vector - what am i looking for 
    - key - what do i contain
    - value - communicate value

if key and query(for other token are align) - then good match

In [161]:
# version 4: self-attention
torch.manual_seed(1337)
B, T, C = 4, 8, 32 # batch, time, channels
x = torch.randn(B, T, C)

#let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x) # (B, T, 16)
q = query(x) # (B, T, 16)
v = value(x)
#interection happens when 
wei = q @ k.transpose(-2, -1) # (B, T, T)

tril = torch.tril(torch.ones(T, T))

wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim = 1)

out = wei @ v

out.shape

torch.Size([4, 8, 16])

**Notes**
- Attention is a communication mechanism. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.

- Attention supports arbitary connectivity of nodes.

- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.

- Each example across batch dimension is of course processed completely independently and never talk to each other.

- `self-attention` just means the keys and values are produced from the source as queries. In `cross-attention` keys and queries comes from the separate source(eg. encode module) and it used when we have seperate source of nodes from which we want to pull information from to our node.

- `Scaled Attention` - Additional divides wei by 1/sqrt(head_size). This makes it so when input Q, K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much.
    - control variation during initialization

Illustartion below:

In [167]:
k = torch.randn(B, T, head_size)
q = torch.randn(B, T, head_size)
wei = q @ k.transpose(-2, -1) * head_size**-0.5

In [168]:
k.var()

tensor(1.0966)

In [169]:
q.var()

tensor(0.9416)

In [170]:
wei.var()

tensor(1.0065)

## Self head Implementation

In [183]:
class Head(nn.Module):

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size)
        self.query = nn.Linear(n_embd, head_size)
        self.value = nn.Linear(n_embd, head_size)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):

        B, T, C = x.shape
        k = self.key(x) # (B, T, C) C - head_size
        q = self.query(x) # (B, T, C) - C- head_size
        #compute attenstion score (affinities)
        wei = q @ k.transpose(-2, -1) * C**-0.5 # (B, T, T)
        #make decoder block
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float("-inf")) #(B, T, T)
        wei = F.softmax(wei, dim=-1) #(B, T, T)
        #perform weighted aggregate of the values
        v = self.value(x) #(B, T, C)
        out = wei @ v # (B, T, C)

        return out


Rewrite the Bigram Language Model using Head node

In [184]:
n_embd = 32

In [209]:
class BigramLanguageModel(nn.Module):

    def __init__(self) -> None:
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.positional_embdding_table = nn.Embedding(block_size, n_embd)
        #self attention head with head size as n_emd
        self.sa_head = Head(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets = None):
        B, T = idx.shape
        # idx and target both are of dim (B, T)
        tok_embd = self.token_embedding_table(idx) #(B, T, C)
        pos_embd = self.positional_embdding_table(torch.arange(T)) # (T, C)
        x = tok_embd + pos_embd # (B, T, C)
        x = self.sa_head(x) #apply one head of self attention
        logits = self.lm_head(x) #(B, T, vocab_size)
        # print(logits.shape)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    #because of positional encoding we need to crop the idx - if idx is more than block size then we will be getting error from pos embd table

    #inference
    def generate(self, idx, max_new_tokens):
        #ids is (B, T) array of indices in the current context

        for _ in range(max_new_tokens):
            #crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]

            #get the prediction
            logits, loss = self.forward(idx_cond)
            #focus only on last time stamp

            logits = logits[:, -1, :] #become (B, C)
            #apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            #sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            #append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim = 1) # (B, T+1)
        return idx
        

In [210]:
model = BigramLanguageModel()

In [212]:
batch_size = 32

epochs = 10000

for epoch in range(epochs):

    #sample a batch of data
    xb, yb = get_batch('train')

    #evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    print(loss.item())

4.214176654815674
4.199700832366943
4.239514350891113
4.218862056732178
4.186700820922852
4.226130485534668
4.24436092376709
4.208166599273682
4.214605331420898
4.240248203277588
4.221784591674805
4.196395397186279
4.189698219299316
4.235991954803467
4.233548641204834
4.220687389373779
4.22767448425293
4.208745002746582
4.22116756439209
4.2244648933410645
4.211333274841309
4.230423927307129
4.179531097412109
4.190152168273926
4.197641372680664
4.201192855834961
4.208263874053955
4.25800085067749
4.182485580444336
4.213343143463135
4.199041366577148
4.230070114135742
4.185204982757568
4.181177616119385
4.216397285461426
4.197425842285156
4.252711296081543
4.205231666564941
4.218751430511475
4.222886562347412
4.222411155700684
4.238119602203369
4.251081943511963
4.187753200531006
4.22334623336792
4.220522880554199
4.220126152038574
4.2478485107421875
4.224193572998047
4.238748073577881
4.197943210601807
4.197122097015381
4.20084810256958
4.225310802459717
4.224092960357666
4.218008041381

## Multihead attention

- Multi communication channels
- multiple attention in parallel


In [213]:
class MultiHeadAttention(nn.Module):

    """ multiple heads of self attention in parallel """

    def __init__(self, num_head, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_head)])

    def forward(self, x):

        return torch.cat([h(x) for h in self.heads], dim=-1)


In [214]:
class BigramLanguageModel(nn.Module):

    def __init__(self) -> None:
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.positional_embdding_table = nn.Embedding(block_size, n_embd)
        #self attention head with head size as n_emd
        self.sa_head = MultiHeadAttention(4, n_embd//4) #group convolution
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets = None):
        B, T = idx.shape
        # idx and target both are of dim (B, T)
        tok_embd = self.token_embedding_table(idx) #(B, T, C)
        pos_embd = self.positional_embdding_table(torch.arange(T)) # (T, C)
        x = tok_embd + pos_embd # (B, T, C)
        x = self.sa_head(x) #apply one head of self attention
        logits = self.lm_head(x) #(B, T, vocab_size)
        # print(logits.shape)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    #because of positional encoding we need to crop the idx - if idx is more than block size then we will be getting error from pos embd table

    #inference
    def generate(self, idx, max_new_tokens):
        #ids is (B, T) array of indices in the current context

        for _ in range(max_new_tokens):
            #crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]

            #get the prediction
            logits, loss = self.forward(idx_cond)
            #focus only on last time stamp

            logits = logits[:, -1, :] #become (B, C)
            #apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            #sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            #append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim = 1) # (B, T+1)
        return idx
        

In [215]:
batch_size = 32

epochs = 10000

for epoch in range(epochs):

    #sample a batch of data
    xb, yb = get_batch('train')

    #evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    print(loss.item())

4.20404577255249
4.210916996002197
4.215507984161377
4.202003002166748
4.242471218109131
4.210153579711914
4.257818698883057
4.198739528656006
4.1947340965271
4.241333484649658
4.260656356811523
4.193447589874268
4.181999683380127
4.23184871673584
4.191461086273193
4.223646640777588
4.237530708312988
4.205941200256348
4.188686847686768
4.195237159729004
4.236495018005371
4.234069347381592
4.235219478607178
4.203344345092773
4.202004909515381
4.2177019119262695
4.2579569816589355
4.234684467315674
4.21341609954834
4.210850238800049
4.219239234924316
4.221963405609131
4.224623680114746
4.236835956573486
4.198772430419922
4.228184700012207
4.209569931030273
4.214357376098633
4.20311164855957
4.203883647918701
4.195871353149414
4.231898784637451
4.222784519195557
4.2008819580078125
4.225440979003906
4.199305057525635
4.227909564971924
4.204291343688965
4.225578784942627
4.216580390930176
4.228621959686279
4.240225791931152
4.192161560058594
4.232958793640137
4.232292652130127
4.20242023468

Multihead attention helps in creating multiple channel of communication

## Introduce feed forward layer in decoder block


In [216]:
class FeedForward(nn.Module):
    """"Simple linear layer followed by a non-linearity"""

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, n_embd),
            nn.ReLU()
        )
    def forward(self, x):
        return self.net(x)

Add it in the Bigram Model - self attention is like communication to fetch the data once the data is there they need to think about it individually.

In [217]:
class BigramLanguageModel(nn.Module):

    def __init__(self) -> None:
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.positional_embdding_table = nn.Embedding(block_size, n_embd)
        #self attention head with head size as n_emd
        self.sa_head = MultiHeadAttention(4, n_embd//4) #group convolution
        self.ffwd = FeedForward(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets = None):
        B, T = idx.shape
        # idx and target both are of dim (B, T)
        tok_embd = self.token_embedding_table(idx) #(B, T, C)
        pos_embd = self.positional_embdding_table(torch.arange(T)) # (T, C)
        x = tok_embd + pos_embd # (B, T, C)
        x = self.sa_head(x) #apply one head of self attention
        x = self.ffwd(x) # token level
        logits = self.lm_head(x) #(B, T, vocab_size)
        # print(logits.shape)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    #because of positional encoding we need to crop the idx - if idx is more than block size then we will be getting error from pos embd table

    #inference
    def generate(self, idx, max_new_tokens):
        #ids is (B, T) array of indices in the current context

        for _ in range(max_new_tokens):
            #crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]

            #get the prediction
            logits, loss = self.forward(idx_cond)
            #focus only on last time stamp

            logits = logits[:, -1, :] #become (B, C)
            #apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            #sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            #append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim = 1) # (B, T+1)
        return idx
        

In [218]:
batch_size = 32

epochs = 10000

for epoch in range(epochs):

    #sample a batch of data
    xb, yb = get_batch('train')

    #evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    print(loss.item())

4.204038619995117
4.211588382720947
4.199848651885986
4.229249477386475
4.210080623626709
4.191341400146484
4.211798191070557
4.227687835693359
4.2225341796875
4.189696788787842
4.252888202667236
4.193850517272949
4.183999538421631
4.220370292663574
4.211663246154785
4.241042137145996
4.220822811126709
4.223501682281494
4.219017028808594
4.206402778625488
4.187771797180176
4.240336894989014
4.2052130699157715
4.1895365715026855
4.192267894744873
4.200622081756592
4.231438636779785
4.244041919708252
4.211033821105957
4.246546745300293
4.220117092132568
4.227297306060791
4.224756717681885
4.208759307861328
4.2256855964660645
4.205783367156982
4.20211124420166
4.221246242523193
4.197579860687256
4.208597183227539
4.203438758850098
4.227713108062744
4.233147621154785
4.195008754730225
4.223347187042236
4.215117454528809
4.219760894775391
4.227991580963135
4.2010498046875
4.2074055671691895
4.182341575622559
4.200370788574219
4.196925640106201
4.239253044128418
4.203332424163818
4.237870693

## Creating a block excluding the cross head attention

In [219]:
class Block(nn.Module):
    """" Transformer block : communication followed by computation """
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(num_head=n_head, head_size=head_size)
        self.ffwd = FeedForward(n_embd)

    def forward(self, x):
        x = self.sa(x)
        x = self.ffwd(x)
        return x

In [None]:
class BigramLanguageModel(nn.Module):

    def __init__(self) -> None:
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.positional_embdding_table = nn.Embedding(block_size, n_embd)
        #self attention head with head size as n_emd
        #introduce the above block nn
        self.blocks = nn.Sequential(
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
        )

    def forward(self, idx, targets = None):
        B, T = idx.shape
        # idx and target both are of dim (B, T)
        tok_embd = self.token_embedding_table(idx) #(B, T, C)
        pos_embd = self.positional_embdding_table(torch.arange(T)) # (T, C)
        x = tok_embd + pos_embd # (B, T, C)
        x = self.sa_head(x) #apply one head of self attention
        x = self.ffwd(x) # token level
        logits = self.lm_head(x) #(B, T, vocab_size)
        # print(logits.shape)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    #because of positional encoding we need to crop the idx - if idx is more than block size then we will be getting error from pos embd table

    #inference
    def generate(self, idx, max_new_tokens):
        #ids is (B, T) array of indices in the current context

        for _ in range(max_new_tokens):
            #crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]

            #get the prediction
            logits, loss = self.forward(idx_cond)
            #focus only on last time stamp

            logits = logits[:, -1, :] #become (B, C)
            #apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            #sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            #append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim = 1) # (B, T+1)
        return idx
        

To make the deep nn remain optimizable is to use residual/skip connetions

- addition distribute gradients equally to all of its branches

In [None]:
class Block(nn.Module):
    """" Transformer block : communication followed by computation """
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(num_head=n_head, head_size=head_size)
        self.ffwd = FeedForward(n_embd)

    def forward(self, x):
        x = x + self.sa(x) # forked off do some communicationa and come back
        x = x + self.ffwd(x)
        return x

Add projection in multihead attention

In [221]:
class MultiHeadAttention(nn.Module):

    """ multiple heads of self attention in parallel """

    def __init__(self, num_head, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_head)])
        self.proj = nn.Linear(n_embd, n_embd)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out)
        return out


Add projection in FFWD

In [223]:
class FeedForward(nn.Module):
    """"Simple linear layer followed by a non-linearity"""

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4*n_embd),
            nn.ReLU(),
            nn.Linear(4*n_embd, n_embd),
        )
    def forward(self, x):
        return self.net(x)

In [224]:
class BigramLanguageModel(nn.Module):

    def __init__(self) -> None:
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.positional_embdding_table = nn.Embedding(block_size, n_embd)
        #self attention head with head size as n_emd
        #introduce the above block nn
        self.blocks = nn.Sequential(
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
        )

    def forward(self, idx, targets = None):
        B, T = idx.shape
        # idx and target both are of dim (B, T)
        tok_embd = self.token_embedding_table(idx) #(B, T, C)
        pos_embd = self.positional_embdding_table(torch.arange(T)) # (T, C)
        x = tok_embd + pos_embd # (B, T, C)
        x = self.blocks(x) #apply one head of self attention
        logits = self.lm_head(x) #(B, T, vocab_size)
        # print(logits.shape)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    #because of positional encoding we need to crop the idx - if idx is more than block size then we will be getting error from pos embd table

    #inference
    def generate(self, idx, max_new_tokens):
        #ids is (B, T) array of indices in the current context

        for _ in range(max_new_tokens):
            #crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]

            #get the prediction
            logits, loss = self.forward(idx_cond)
            #focus only on last time stamp

            logits = logits[:, -1, :] #become (B, C)
            #apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            #sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            #append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim = 1) # (B, T+1)
        return idx
        

In [226]:
model = BigramLanguageModel()

In [228]:
batch_size = 32

epochs = 10000

for epoch in range(epochs):

    #sample a batch of data
    xb, yb = get_batch('train')

    #evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    print(loss.item())

AttributeError: 'BigramLanguageModel' object has no attribute 'sa_head'

Another way to optimization deep nn is doing **layer normalization**

    - Row normalization

Add layernorm in Blocks

In [229]:
class Block(nn.Module):
    """" Transformer block : communication followed by computation """
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(num_head=n_head, head_size=head_size)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)


    def forward(self, x):
        x = x + self.sa(self.ln1(x)) # forked off do some communicationa and come back
        x = x + self.ffwd(self.ln2(x))
        return x

In [None]:
class BigramLanguageModel(nn.Module):

    def __init__(self) -> None:
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.positional_embdding_table = nn.Embedding(block_size, n_embd)
        #self attention head with head size as n_emd
        #introduce the above block nn
        self.blocks = nn.Sequential(
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
            nn.LayerNorm(n_embd),
        )

    def forward(self, idx, targets = None):
        B, T = idx.shape
        # idx and target both are of dim (B, T)
        tok_embd = self.token_embedding_table(idx) #(B, T, C)
        pos_embd = self.positional_embdding_table(torch.arange(T)) # (T, C)
        x = tok_embd + pos_embd # (B, T, C)
        x = self.blocks(x) #apply one head of self attention
        logits = self.lm_head(x) #(B, T, vocab_size)
        # print(logits.shape)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    #because of positional encoding we need to crop the idx - if idx is more than block size then we will be getting error from pos embd table

    #inference
    def generate(self, idx, max_new_tokens):
        #ids is (B, T) array of indices in the current context

        for _ in range(max_new_tokens):
            #crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]

            #get the prediction
            logits, loss = self.forward(idx_cond)
            #focus only on last time stamp

            logits = logits[:, -1, :] #become (B, C)
            #apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            #sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            #append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim = 1) # (B, T+1)
        return idx
        

In [230]:
model = BigramLanguageModel()

In [231]:
batch_size = 32

epochs = 10000

for epoch in range(epochs):

    #sample a batch of data
    xb, yb = get_batch('train')

    #evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    print(loss.item())


AttributeError: 'BigramLanguageModel' object has no attribute 'sa_head'

Add droput layer