# Building GPT from scratch

We are going to build a PyTorch re-implementation of GPT, both training and inference. This code is also hosted in HitHub under the [nanoGPT repository](https://github.com/karpathy/nanoGPT). GPT is not a complicated model and this implementation is approximately 600 lines of code:
* 300-line boilerplate training loop
* 300-line GPT model definition

All that is going on is that a sequence of indices feeds into a Transformer (Decoder), and a probability distribution over the next index in the sequence comes out. The majority of the complexity is just being clever with batching (both across examples and over sequence length) for efficiency.

If trained on the OpenWebText, we could reproduce the performance of GPT-2:

<table>
    <tr>
        <td><img title="" src="images/gpt2_124M_loss.png" alt="" width="500" data-align="center"></td>
    </tr>
</table>

List of relevant papers:
* [Vaswani et al. (2017)](https://arxiv.org/abs/1706.03762)
* [Radford et al. (2018)](https://openai.com/blog/language-unsupervised/)
* [Radford et al. (2019)](https://openai.com/blog/better-language-models/)
* [Brown et al. (2020)](https://arxiv.org/abs/2005.14165)

List of relevant videos:
* [Karpathy (2023)](https://www.youtube.com/watch?v=kCc8FmEb1nY). Lets build GPT: from scratch, in code, spelled out

# 1 - Download data

We are going to work with a toy dataset called ["tiny Shakespire"](https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt), which is a concatenation of a variety of Shakespeare's plays, amounting for 40000 lines of text. It was first feature in Andrej Karpathy's blog post ["The Unreasonable Effectiveness of Recurrent Neural Networks"](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

In [1]:
import requests

url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
output_file = "input.txt"

response = requests.get(url)
if response.status_code == 200:
    with open(output_file, "wb") as file:
        file.write(response.content)
    print("File downloaded successfully!")
else:
    print("Failed to download the file.")

File downloaded successfully!


In [2]:
# read it in to inspect it
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
    
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1115394


In [3]:
# let's look at the first 1000 characters
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [4]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


# 2 - Tokenization

Language models like GPT cannot receive raw strings as input; instead, they assume the text has been *tokenized* and *encoded* as numerical vectors. Tokenization is the step of breaking down a string into the atomic units used in the model. 

There are several tokenization strategies one can adopt, and the optimal splitting of words into subunits is usually learned from the corpus:
* **Character tokenization.** Divides text into individual characters, useful for preserving character-level information but can result in a large number of tokens (i.e., large sequences).
* **Word tokenization.** Splits text into separate words or tokens, capturing semantic meaning and syntactic structures, but can be challenging for languages with complex morphology.
* **Sub-word tokenization.** Breaks text into smaller units (sub-words/morphemes) to handle out-of-vocabulary words and reduce vocabulary size, suitable for languages with rich morphology.

In here, we are going to use character tokenization because it is the easiest one to implement and offers relative good performance.

In [5]:
# create a mapping from characters to integers
char_to_int = { ch:i for i,ch in enumerate(chars) }
int_to_char = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [char_to_int[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([int_to_char[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


In [6]:
# let's now encode the entire text dataset and store it into a torch.Tensor
import torch # we use PyTorch: https://pytorch.org
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000]) # the 1000 characters we looked at earier will to the GPT look like this

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

# 3 - Data split

In [7]:
# Let's now split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# 4 - Prepare chunks of data

It is important to understand that attempting to input the entire text into the Transformer model simultaneously is computationally expensive and impractical. Therefore, when training the Transformer, we only handle the data in smaller chunks. During training, we extract random subsets or portions from the training set, referred to as "chunks," and train the model incrementally on these smaller pieces. The maximum length of these fragments is commonly referred to as the block size or context length.

In [8]:
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [9]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


There is another dimension that we need to consider: **the batch size**. Everytime we feed data into the Transformer we are going to have many batches of multiple chunks of text that are all stacked up in a single tensor. That is done for efficiency so that we can keep the GPUs busy because they are very good at parallel processing of data. Those batches are independent from each other.

In [10]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,)) # Generate random offsets
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

# Example with a single batch
xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): # batch dimension
    print(f"Row {b}")
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
Row 0
when input is [24] the target: 43
when input is [24, 43] the target: 58
when input is [24, 43, 58] the target: 5
when input is [24, 43, 58, 5] the target: 57
when input is [24, 43, 58, 5, 57] the target: 1
when input is [24, 43, 58, 5, 57, 1] the target: 46
when input is [24, 43, 58, 5, 57, 1, 46] the target: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
Row 1
when input is [44] the target: 53
when input is [44, 53] the target: 56
when input is [44, 53, 56] the target: 1
when input is [44, 53, 56, 1] the target: 58
when input is [44, 53, 56, 1, 58] the target: 46
when inpu

In [11]:
print(xb) # our input to the transformer

tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])


# 5 - Bigram model

A bigram model, also known as a 2-gram model, is a statistical language model that predicts the probability of the next word in a sequence based on the preceding word. In a bigram model, the assumption is made that the probability of a word depends only on its immediate preceding word.

Bigram models are relatively simple and computationally efficient language models. However, they have limitations as they only consider the immediate preceding word and may not capture longer-range dependencies or contextual information.

#### Loss function

Our loss function for the bigram model should aim to capture the discrepancy between the model's predicted probabilities and the true next word/token/sub-word distribution. Here are a few common loss functions that can be used with a bigram model:

* [**Cross-Entropy Loss:**](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) Cross-entropy loss is commonly used for language modeling tasks, including bigram models. It measures the dissimilarity between the predicted probability distribution over the vocabulary (next-word predictions) and the true distribution (the actual next word). The cross-entropy loss encourages the model to minimize the difference between predicted and actual word probabilities.

* **Negative Log-Likelihood Loss:** Negative log-likelihood (NLL) loss is closely related to cross-entropy loss and is often used interchangeably in language modeling tasks. NLL loss measures the negative logarithm of the predicted probability assigned to the correct next word. Minimizing the NLL loss maximizes the likelihood of the correct next word given the preceding word.

* **Perplexity:** Perplexity is a commonly used evaluation metric for language models, and it can also be used as a loss function during training. Perplexity is the exponentiated average cross-entropy loss per word. Minimizing perplexity corresponds to maximizing the likelihood of the observed data.

In this case, we are going to use the [cross-entropy loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html).

#### Forward pass

The forward pass of the model takes input indices (`idx`) representing the current context and optionally the target indices (`targets`) for training. It returns the logits for each token in the context and the loss if the targets are provided.

- `idx`: Input indices representing the current context. It is a tensor of shape `(B, T)` where `B` is the batch size and `T` is the sequence length.
- `targets`: Target indices for training. It has the same shape as `idx`.

The forward pass performs the following steps:
1. The token indices are passed through the embedding lookup table to obtain the corresponding embeddings.
2. The embeddings are reshaped to match the expected input shape of the cross-entropy loss function.
3. If targets are provided, the logits are compared to the targets using the cross-entropy loss, and the loss value is calculated.
4. The logits and loss (if applicable) are returned.

#### Generation

The `generate` method allows generating new sequences of tokens based on an initial context.

- `idx`: Initial input indices representing the current context. It is a tensor of shape `(B, T)` where `B` is the batch size and `T` is the sequence length.
- `max_new_tokens`: The maximum number of new tokens to generate.

The generation process follows these steps:

1. The predictions are obtained by passing the current context through the model.
2. Only the logits of the last time step are considered.
3. Softmax is applied to obtain probability distributions over the vocabulary.
4. Tokens are sampled from the probability distributions using the `torch.multinomial` function.
5. The sampled indices are appended to the running sequence.
6. Steps 1-5 are repeated for the desired number of new tokens.
7. The generated sequence of indices is returned.

-----

<mark><b>Note:</b></mark> As you can see in the `generate()` function. Even though we only consider the last token, we are keeping the whole history of information. We are doing that to keep the code "stable" so it can be "reused" when defining the following models, which are going to use part of that history in the generation process.

-----

In [12]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        # In this case we have a vocabulary size of 65 and an embedding hidden dimension of also 65
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C) C is the hidden dimension of the embedding so (B,T,C) = (4,8,65)

        # Introduce a condition to allow forward to be used both in training (i.e., with targets) and inference
        if targets is None:
            loss = None
        else:
            # In here we are transforming the shape of the logits tensors so they 
            # match what the cross-entropy loss function expects, which is basically (minibatch, C)
            B, T, C = logits.shape
            logits = logits.view(B*T, C) # (B*T, C) = (4*8, 65) = (32, 65)
            targets = targets.view(B*T) # (B*T,) = (4*8,) = (32,)
            loss = F.cross_entropy(logits, targets) 

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [13]:
model = BigramLanguageModel(vocab_size)
logits, loss = model(xb, yb)
print(logits.shape)
print(loss)

# Initial context for generation. In this case, it is a tensor of shape (1, 1) filled with zeros, indicating an empty context.
idx = torch.zeros((1, 1), dtype=torch.long)

# Generate 100 tokens from this empty context
print(decode(model.generate(idx, max_new_tokens=100)[0].tolist()))

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)

Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


#### Training process

Obviously the results are not good because we have not trained our model. Its parameters are completely random. So, our next step is to train this model.

In [14]:
# create a PyTorch optimizer (Adam is usually a good optimizer approach, better than SGD)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

In [15]:
batch_size = 32
for steps in range(100): # increase number of steps for good results (e.g., 10000)...

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())


4.587916374206543


In [16]:
print(decode(model.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))


xiKi-RJ:CgqVuUa!U?qMH.uk!sCuMXvv!CJFfx;LgRyJknOEti.?I&-gPlLyulId?XlaInQ'q,lT$
3Q&sGlvHQ?mqSq-eON
x?SP fUAfCAuCX:bOlgiRQWN:Mphaw
tRLKuYXEaAXxrcq-gCUzeh3w!AcyaylgYWjmJM?Uzw:inaY,:C&OECW:vmGGJAn3onAuMgia!ms$Vb q-gCOcPcUhOnxJGUGSPJWT:.?ujmJFoiNL&A'DxY,prZ?qdT;hoo'dHooXXlxf'WkHK&u3Q?rqUi.kz;?Yx?C&u3Qbfzxlyh'Vl:zyxjKXgC?
lv'QKFiBeviNxO'm!Upm$srm&TqViqiBD3HBP!juEOpmZJyF$Fwfy!PlvWPFC
&WDdP!Ko,px
x
tREOE;AJ.BeXkylOVD3KHp$e?nD,.SFbWWI'ubcL!q-tU;aXmJ&uGXHxJXI&Z!gHRpajj;l.
pTErIBjx;JKIgoCnLGXrJSP!AU-AcbczR?


This is clearly not Shakespeare but it is much better than we had at the beginning.

#### Move code to a script

A working script of the bigram model can be found in `./bigram.py`. The code is practically identical, but it includes the option to run on a GPU for faster execution. This means that we need to move the model and the data to the GPU using the to(device) function.

# 6 - Self-attention

## 6.1 - Understanding the matrix multiplication trick

Before fully explaining self-attention, it is interesting that we understand a small trick that goes at the heart of an efficient implementation of self-attention.

What we want is to avoid tokens in the 1st location to "talk" with those in the 2nd, 3rd, 4th, etc. Sames goes for the 2nd token, we don't want it to "talk" to those in 3rd, 4th, 5th, etc. This is important for decoder-only Transformners, so information only flows from previous contexts to the current timestep and the model is not able to "see the future" (because we are going to predict the future).

**The easiest way for tokens to communicate is to just do an average of all the preceding elements**. For example, if I am the 5th token, I would take the channels from the 4th, 3rd, 2nd and 1st step to generate an average vector that would summarize me in the context of my history. Of course, taking a sum or an average is an extremely weak form of communication an **we lose a lot of information about the spatial arrangements of all those tokens**, but that is okay from now. We can later see how to bring that information back.

In [17]:
torch.manual_seed(1337)
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

### 6.1.1 - For loop version

In [18]:
# Version 1: using a for loop
# We want x[b,t] = mean_{i<=t} x[b,i]
xbow = torch.zeros((B,T,C))
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1] # (t,C)
        xbow[b,t] = torch.mean(xprev, 0) # mean by row


Let's show the differences between the original tensor `x` and the summary `xbow`. It has two columns because there are 2 channels. In addition, we are looking at it in vertical fashion.

In [19]:
x[0]

tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])

In [20]:
xbow[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

As we can see, the first token has the same info because there are no preceding tokens. But th next ones average information from the "past". For example, the first token has:
* ((0.1808) + (-0.3596))/2 = -0.0894
* ((-0.0700) + (-09152))/2 = -0.4926

### 6.1.2 - Matrix multiplication version

This is what we want, but **doing it with a for loop is very inefficient**. [To improve it, we can use **matrix multiplication** using the lower triangular matrix via `torch.tril`](https://www.youtube.com/watch?v=kCc8FmEb1nY). Let's show a toy example to understand it:

In [21]:
# toy example illustrating how matrix multiplication can be used for a "weighted aggregation"
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3)) # Create a lower triangular matrix of ones
a = a / torch.sum(a, 1, keepdim=True) # Divide it by the sum of the column to do the average instead of the sum in the result
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


As you can see, by manipulating the values of the triangular matrix `a` and multiplying it by a specific matrix `b` we can do these averages that are required in self-attention.

This trick is very convenient. Now, **let's apply it to our self-attention method**.

* First, we create the triangular matrix, which we call `wei` for "weights". It is going to correspond to our matrix `a` from the previous example. 
* Then, our `b` matrix is going to be `x`.
* Finally,  our `c` matrix is going to be `xbow2`.

In [22]:
# Version 2: using matrix multiply for a weighted aggregation

# First, we create the triangular matrix, which we call "wei" for "weights"
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
wei

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

-----

Notice that the attention scores of each token always amount to 1 (i.e., they are normalized)

In addition, note that in the context of causal self-attention, which is commonly used in decoder transformer models (e.g., GPT), tokens only attend to the 'past'. In contrast, an encoder transformer model does not have this restriction (e.g., BERT). **In that case, we would not need to apply the masking.**

-----

In [23]:
xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)
torch.allclose(xbow, xbow2) # checks whether two arrays are element-wise equal within a specified tolerance.

True

In [24]:
xbow2[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

### 6.1.3 - Version 3

This third version is identical to the previous one except in how the attention weights are computed and normalized. 
* Version 3 uses a masked softmax approach, while Version 2 normalizes the weights directly without masking. 
 
Both approaches apply the matrix multiplication trick that we saw in the previous example.

The resulting `wei` matrix remains consistent with previous approaches. The choice of softmax stems from its applicability in a learning context. By optimizing parameters, we can update the `wei` values to more accurately reflect the attention relationship between tokens. **So, it is a first step towards a "real" self-attention.**

In [25]:
# version 3: use Softmax
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei

tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0., 0., 0., 0.]])

In [26]:

wei = F.softmax(wei, dim=-1)
wei

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

Notice that since we apply softmax, the attention scores of each token always amount to 1 (i.e., they are normalized)

In [27]:
xbow3 = wei @ x
torch.allclose(xbow, xbow3)

True

### 6.1.4 - Summary

In summary, the concept discussed in this section is the ability to perform weighted aggregations of past elements using matrix multiplication in a lower triangular fashion. The lower triangular part of the matrix indicates the contribution of each element to the corresponding position in the aggregation.

**Now, we are going to use what we have learned so far to develop the self-attention block.**

## 6.2 - Implementing the self-attention block

The previous code performs a simple weight-averaging of past tokens and the current token, resulting in a blending of previous and current information through an average calculation.

To achieve this, the code creates a lower triangular structure. Initializing the affinities between tokens as zero yields a structure where each row contains uniform values. This structure facilitates the matrix multiplication operation, enabling a straightforward averaging process.

However, **it is not desirable for all affinities to be uniform**, as different tokens may find varying levels of interest in other tokens. **We want the affinity to be data-dependent**. For instance, if a token is a vowel, it may have a preference for consonants in its past context. We seek to capture this data-dependent relationship, allowing information flow based on individual token characteristics. This is where self-attention comes into play.

Self-attention addresses this issue by introducing the concept of **queries** and **keys**. Each token emits a query vector, indicating what it is looking for, and a key vector, representing its own content. 

**The affinity between tokens in a sequence is then established through a dot product operation between the keys and queries**. This dot product serves as the **affinity weights**, referred to as `wei` in the previous examples.

So, let's implement this now. We are going to implement a **single head of self-attention**.

In [28]:
# version 4: self-attention!

# We assume we already have the batch in embedding form
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16 # hidden dimension of the linear functions that represent the self-attention "head"
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)

Up to this point, no communication has happened, all tokens have simply generated a *key* and a *query*. 

So basically what we want know is `wei` (i.e., the affinities between these tokens) to be the dot product of *key* and *query*. But, we cannot multiply this directly, we have to first transpose it. More specifically, we want to transpose the last two dimensions:

In [29]:
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

In [30]:
wei[0] 

tensor([[-1.7629, -1.3011,  0.5652,  2.1616, -1.0674,  1.9632,  1.0765, -0.4530],
        [-3.3334, -1.6556,  0.1040,  3.3782, -2.1825,  1.0415, -0.0557,  0.2927],
        [-1.0226, -1.2606,  0.0762, -0.3813, -0.9843, -1.4303,  0.0749, -0.9547],
        [ 0.7836, -0.8014, -0.3368, -0.8496, -0.5602, -1.1701, -1.2927, -1.0260],
        [-1.2566,  0.0187, -0.7880, -1.3204,  2.0363,  0.8638,  0.3719,  0.9258],
        [-0.3126,  2.4152, -0.1106, -0.9931,  3.3449, -2.5229,  1.4187,  1.2196],
        [ 1.0876,  1.9652, -0.2621, -0.3158,  0.6091,  1.2616, -0.5484,  0.8048],
        [-1.8044, -0.4126, -0.8306,  0.5898, -0.7987, -0.5856,  0.6433,  0.6303]],
       grad_fn=<SelectBackward0>)

So, **in this case `wei` is not full of zeroes**, its values come from this dot product between the keys and the queries.

In [31]:
tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

As you can observe, in this scenario, **the attention values are not simply averaged but can be learned by adjusting the parameters in the *query*, *key*, and *value* layers**.

We can see which tokens are interesting for each token by looking at the attention scores (the higher the better). For example. The fourth token pays the most attention to the first token (see the fourth row).

Notice that since we apply softmax, the attention scores of each token always amount to 1 (i.e., they are normalized).

In [48]:
v = value(x)
out = wei @ v # hidden dim = self-attention head dimension
# out = wei @ x # hidden dim = embedding dimension

out.shape

torch.Size([4, 8, 16])

## 6.3 - Notes

* Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
  
* Self-attention operates on a set of vectors without considering their spatial arrangement. Positional encoding is necessary to provide spatial information to the model. This is, for example, different from convolution, because there is an intrinsic consideration of space.
  
* Each example in a batch is processed independently, with no interaction between examples. In this case of `(4,8,16)`, it is like we simultaneouslyestimate the attention for `4` different vectors/blocks, where each block is of size `8`.
  
* In encoder attention blocks, there is no masking, allowing all tokens to communicate. Decoder attention blocks have triangular masking and are typically used in autoregressive settings like language modeling.
  
* **"Self-attention"** just means that the keys and values are produced from the same source as queries. In **"cross-attention"**, the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module).
  
* **Attention is often scaled to help stabilize the gradients during training**. Scaled attention involves dividing the attention weights by `1/sqrt(head_size)` (i.e., $\frac{1}{\sqrt{d_{k}}}$). This scaling ensures that when input Q and K have unit variance, the attention weights (`wei`) also have unit variance. The scaling is beneficial for three main reasons:
    * **Prevents large dot products** between query and key vectors, which can cause unstable gradients during training
    * **Controls the softmax output**. When the dot products are large, the resulting softmax output can be very sharp, with most weights concentrated on a few elements.
    * **Preserves the importance of information**. Scaling ensures that the attention weights are not biased by the magnitude of the vectors.


$$
\text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^{T}}{\sqrt{d_{k}}})V
$$

# 7 - Positional encodings

While self-attention is a powerful mechanism for capturing **relationships between tokens**, positional embeddings are essential for providing the model with information about the **order and structure of the input sequence**. 

Combining self-attention with positional embeddings is crucial in Transformer-based models for Natural Language Processing (NLP) and other sequence-related tasks for several reasons:

1. **Encoding Sequential Information**: Self-attention mechanisms are powerful for capturing relationships between words or tokens in a sequence, but they are permutation-invariant, meaning they treat all tokens equally regardless of their order. Positional embeddings provide a way to encode the sequential order of words in the input sequence. Without positional embeddings, the model would have no inherent understanding of word order.

2. **Handling Variable-Length Sequences:** In NLP, sequences can vary in length. Positional embeddings allow the model to adapt to sequences of different lengths and still understand their structure. Without positional embeddings, a Transformer model would be limited to fixed-length input, which is impractical for many real-world applications.

3. **Resolving Ambiguity:** For tasks like machine translation or text generation, the same words can have very different meanings based on their position in a sentence. Positional embeddings help disambiguate such cases by providing context about where a word occurs in a sentence.

4. **Capturing Dependencies Across Positions:** Some relationships between words depend on their absolute or relative positions. For instance, in English, the subject of a sentence often appears near the beginning, and the object appears near the end. Positional embeddings help the model capture these long-range dependencies.

5. **Flexibility:** Positional embeddings are not tied to any specific language or task. They can be easily adapted to different languages or types of sequences, making Transformer models versatile.

I have updated the Bigram language model with positional embeddings and a linear layer (which to be honest should have been already present)

In [None]:
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size, block_size, embed_dim):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, embed_dim)
        self.position_embedding_table = nn.Embedding(block_zie, embed_dim)
        self.lm_head = nn.Linear(embed_dim, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        
        #idx and targets anre both (B,T) tensor of integers
        tok_embed = self.token_embedding_table(idx) # (B,T,C)
        pos_embed = self.position_embedding_table(torch.arange(T, device=device)) # (T, C)

        x = tok_embed + pos_embed # (B,T,C)

        logits = self.lm_head(x) # (B,T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :]  # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        return idx

In case we want to take a more detailed look at positional embeddings:

In [42]:
batch_size = 4 # B
block_size = 8 # T
embed_dim = 32 # C
device = "cpu"

token_embedding_table = nn.Embedding(vocab_size, embed_dim)
position_embedding_table = nn.Embedding(block_size, embed_dim)

In [43]:
idx = torch.randint(0, vocab_size - 1, size=(batch_size, block_size), dtype=torch.int32)
tok_emb = token_embedding_table(idx)
pos_emb = position_embedding_table(torch.arange(T, device=device))

In [46]:
tok_emb.shape

torch.Size([4, 8, 32])

# Multi-head attention

# Regularization