# Building GPT

> Recreating [nanoGPT](https://github.com/karpathy/nanoGPT) from Andrej Karpathy

<font color="purple">We'll train a character-level GPT on the works of Shakespeare.</font>

# 1 - Shakespeare Dataset

<hr>

This section deals with preparing the dataset that will be used for training our gpt model. The dataset is the "tiny shakespeare dataset," a relatively small file of about 1MB, containing approximately ~ 1.1M characters from various works of William Shakespeare.

In [1]:
import requests

url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
response = requests.get(url)
if response.status_code == 200:
    with open('data/tinyshakespeare.txt', 'wb') as file:
        file.write(response.content)
    print("File downloaded successfully.")
else:
    print("Failed to download the file. Status code:", response.status_code)

File downloaded successfully.


In [2]:
with open('data/tinyshakespeare.txt', 'r', encoding='utf-8') as f:
    text = f.read()
    
print("length of dataset in characters: ", len(text))
print('-'*50)
print(text[:500]) # let's look at the first 500 characters

length of dataset in characters:  1115394
--------------------------------------------------
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor


### 1.1 - Unique Characters

To preprocess this text for a machine learning model, it's crucial to understand the unique characters present in the dataset because these characters will form the vocabulary that our model will learn to generate text.

- `set(text)` converts the text into a set, thereby removing any duplicate characters
- `list(set)` then converts this set into a list, which doesn't have a specific order
- `sorted(list)` sorts this list in a standard alphabetical order, making it easier to index each character
- `vocab_size` refers to the total count of these unique characters, which is crucial for defining the input layer of our model

In [3]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars)) # note the space character at the start
print("Vocabulary size:", vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Vocabulary size: 65


## 2 - Tokenization

<hr>

Tokenization is the process of converting text into a format that can be understood by machine learning models. This involves breaking down text into "tokens," which can represent individual characters, sub-words, or entire words.


### 2.1 - Character-level vs. Sub-word Tokenization

- **Character-level Encoding:** This is a simple form of tokenization where each unique character in the text is assigned a unique integer. This method is particularly useful for languages with a small set of characters (like English) or for applications like character-level language models where the nuances of individual characters are important.
    - For the Shakespeare dataset, a character-level tokenizer would encode each of the ~65 unique characters into a unique integer.

- **Sub-word Tokenization:** This method breaks text into pieces that are larger than individual characters but smaller than entire words. It strikes a balance by reducing the vocabulary size compared to character-level encoding without losing the granularity of meaning that word-level encoding might miss.
    - Tokenizers like OpenAI's `tiktoken` and Google's `SentencePiece` are examples of sub-word tokenization systems. They analyze the corpus to find the optimal way to break down words into sub-word units.
    - This approach is beneficial when dealing with large vocabularies or languages with complex morphology.

For instance, the phrase "hi there" (8 characters) could be encoded into:
- 8 integers (character-level encoding)
- 2 integers (word-level encoding)
- 3 integers (sub-word encoding)

<br>

**Example of Tokenization with tiktoken**

```python
import tiktoken
enc = tiktoken.get_encoding('gpt2')
print(enc.n_vocab)
# Output: 50257 (the size of the vocabulary)
print(enc.encode("hi there"))
# Output: [71, 4178, 612] (sub-word level encoding)
print(enc.decode([71, 4178, 612]))
# Output: 'hi there' (decoding back to text)
```

- This example shows how `tiktoken` with a GPT-2 encoding model breaks down "hi there" into three integers, each representing a sub-word unit. This is compared to what would have been eight integers for character-level encoding or two for word-level encoding.

- The `n_vocab` value of 50257 indicates the size of the vocabulary that `tiktoken` uses, which is significantly larger than the vocabulary size for character-level encoding in our Shakespeare dataset example. This large vocabulary allows for encoding a vast array of words and sub-words into a relatively small number of integers, making the model more efficient in processing and generating text.


In summary, you can have:
- A large sequence of integers with a very small vocabulary
- A short sequence of integers with a very large vocabulary

In [4]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }

encode = lambda s: [stoi[c] for c in s]          # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


### 2.2 - Encoding Data into a Torch Tensor

Now that we've our character-level tokenizer, encode the entire Shakespeare dataset into a `torch.Tensor`.

In [5]:
import torch

In [6]:
data = torch.tensor(encode(text), dtype=torch.long)

In [7]:
print(data.shape, data.dtype)
print('-'*50)
print(data[:500]) # the 500 characters we looked at earier will look like this to the GPT

torch.Size([1115394]) torch.int64
--------------------------------------------------
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,

## 3 - Train/Val Split & Data Loader

<hr>

This section focuses on preparing the dataset for training by splitting it into training and validation sets and then setting up a data loader to feed data into the model efficiently.

**Train/Val Split:**

The dataset is divided into two parts: 90% for training and 10% for validation. This split allows the model to learn from the majority of the data while also having a separate set to evaluate its performance on unseen data. Here's how the split is implemented:

In [8]:
n = int(0.9 * len(data))  # Calculate the splitting point
train_data = data[:n]     # Assign the first 90% of data to training
val_data = data[n:]       # Assign the remaining 10% to validation

### 3.1 - Data Loader

Due to computational constraints, data is not fed into the model all at once. Instead, it's broken into manageable chunks that can be processed efficiently.


#### 3.1.1 - Block Size / Context Length

- The `block_size` or `context_length` sets the maximum length for these data chunks, influencing how much context the model considers when making predictions.
- Each chunk results in several input-output pairs, where each input (except the last one) is paired with the next character/token as the target output. For a block size of 9, there would be 8 such pairs.


#### 3.1.2 - Dimensions in Training

- **Time Dimension:** This represents the sequence of tokens in the text, showing the progression over time.
- **Batch Dimension:** Data is organized in batches to allow parallel processing, enhancing efficiency.

**Example of Creating Training Pairs:**

Given a tensor representing a chunk of text:  

```python
tensor([18, 47, 56, 57, 58, 1, 15, 47, 58])
```

This tensor can generate training pairs where each context is paired with the next token as the target, simulating real-time prediction as the model processes the text.

- in the context of [18], 47 comes next.
- in the context of [18, 47], 56 comes next.
- in the context of [18, 47, 56], 57 comes next, and so on.

**Why do we do this?**

- This approach ensures that the transformer is exposed to contexts ranging from very small (a single integer) to the full length of the `block_size`, allowing it to learn and understand text in varying lengths effectively.
- If the input text exceeds the set `block_size`, the transformer model truncates the excess, focusing only on the text within the defined limit. This process ensures computational efficiency and relevance in training.

In [9]:
block_size = 8

In [10]:
x = train_data[:block_size]      # inputs to the transformer: first block_size characters
y = train_data[1:block_size+1]   # targets for each input position: off-set by 1

for t in range(block_size):
    context = x[:t+1]            # all chars up to and including t
    target = y[t]                # t-th char in the y array
    print(f"when input is {context} the target: {target}")

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


### 3.2 - Data Batches

The data loader creates batches of data for training or validation. The batch size and block size are key parameters:

- `batch_size`: Determines how many sequences are processed in parallel.
- `block_size`: Limits the maximum context length the model will use for predictions.

In [11]:
torch.manual_seed(1337)
batch_size = 4            # how many independent sequences will we process in parallel?
block_size = 8            # what is the maximum context length for predictions?

In [12]:
def get_batch(split):
    """This function generates batches by randomly selecting starting points in the data, 
    then extracting sequences of length block_size for inputs and their corresponding next 
    tokens as targets."""
    
    # Choose the dataset based on the split
    data = train_data if split == 'train' else val_data
    
    # Randomly select starting points for each sequence in the batch
    ix = torch.randint(len(data) - block_size, (batch_size,))
    
    # Extract sequences of length block_size for inputs (x)
    x = torch.stack([data[i:i+block_size] for i in ix])
    
    # Extract the next token for each input as targets (y)
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

In [13]:
xb, yb = get_batch('train')

Here, `xb` is a single batch of 4x8 or 32 independent examples sampled from the training dataset and `yb` contains the corresponding target labels (for loss computations later on).

In [14]:
print('inputs:')
print(xb.shape)
print(xb)
print('-'*50)

print('targets:')
print(yb.shape)
print(yb)
print('-'*50)

for b in range(batch_size):     # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
--------------------------------------------------
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
--------------------------------------------------
when input is [24] the target: 43
when input is [24, 43] the target: 58
when input is [24, 43, 58] the target: 5
when input is [24, 43, 58, 5] the target: 57
when input is [24, 43, 58, 5, 57] the target: 1
when input is [24, 43, 58, 5, 57, 1] the target: 46
when input is [24, 43, 58, 5, 57, 1, 46] the target: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
when input is [44] the target: 53
when input is [44, 53] the target: 56
when input is [44, 53, 56] the target: 1
when input is [44,

## 4 - Bigram Language Model

<hr>

The bigram language model is the simplest form of neural network for language modeling. It predicts the next token in a sequence based solely on the current token, without considering any broader context. This model is a foundational concept in language modeling, demonstrating the basic principle of predicting subsequent elements in a sequence.

**Core Components:**

- Vocabulary Size (`vocab_size`): It defines the size of the model's input and output layers. Each unique token in our dataset contributes to the total `vocab_size`.
- Embedding Table (`self.token_embedding_table`): Maps each token to a vector of logits representing the probabilities of subsequent tokens. This table has dimensions `[vocab_size, vocab_size]`, enabling each token to have a distinct probability distribution over the next token.

<br>
<div style="align:center">
    <img src="images/embedding_matrix.png" width=400>
    <center><caption><font color="purple"><strong><u>Figure 1:</u></strong> Example of a possible embedding matrix</font></caption></center>
</div>

In [15]:
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

<torch._C.Generator at 0x1f95aa2ced0>

In this model, `idx` and `targets` are tensors of integers representing sequences of tokens. The model directly retrieves a vector of `logits` for the next token based on the current token's index, without accounting for any wider context.

In [16]:
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        # Initialize an embedding table with dimensions [vocab_size, vocab_size]
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
        
    def forward(self, idx, targets):
        # idx: Input tokens
        # targets: Expected subsequent tokens
        # Returns logits with shape (B, T, C) indicating token predictions
        logits = self.token_embedding_table(idx)
        return logits

In [17]:
m = BigramLanguageModel(vocab_size)
out = m(xb, yb)
print(out.shape) # (B,T,C) = (4,8,65)

torch.Size([4, 8, 65])


### 4.1 - Loss Calculation

To evaluate the model's predictions, we utilize the negative log likelihood, effectively implemented as `cross-entropy` loss in PyTorch. This metric compares the predicted probabilities (logits) with the actual subsequent tokens (targets) to quantify the model's performance.

So we'd write something like this:

```python
class BigramLanguageModel(nn.Module):
    
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
        
    def forward(self, idx, targets):
        logits = self.token_embedding_table(idx) # (B,T,C)
        loss = F.cross_entropy(logits, targets)
        return logits
```

However, this will give us an error because torch expects the channels dimension to be the second dimension, i.e., instead of (B, T, C), we want (B, C, T).

<br>

**Adjusting Tensor Dimensions:**

PyTorch expects the logits tensor to be in the shape of (B, C, T) for computing cross-entropy loss. Since our model outputs logits in the shape (B, T, C), we need to reshape them along with the targets tensor to align with PyTorch's requirements.

This adjustment flattens the batch and time dimensions, ensuring each prediction is paired with its corresponding target in a one-dimensional array, i.e., (B, T, C) into (B\*T, C).

In [18]:
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
        
    def forward(self, idx, targets):
        logits = self.token_embedding_table(idx) # (B,T,C)
        
        B, T, C = logits.shape
        logits = logits.view(B*T, C) # 2D array
        targets = targets.view(B*T)  # 2D array
        loss = F.cross_entropy(logits, targets)

        return logits, loss

In [19]:
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape) # (B*T, C) = (4*8, 65) = (32, 65)
print(loss)

torch.Size([32, 65])
tensor(4.6630, grad_fn=<NllLossBackward0>)


We got a loss of 4.6630. Notice that the model is randomly predicting the next token based off of the current token, and since we've 65 tokens in total, the expected loss would be $-ln\left(\frac{1}{65}\right) \approx 4.1217$ but we're getting 4.49 which is telling us that the initial predictions are not super diffused - they've got a little bit of entropy - and so we're guessing wrong.

### 4.2 - Token Generation

The process of token generation in a bigram language model is intriguing because it allows the model to produce new text sequences based on a given context. This model, despite its simplicity, can generate sequences one token at a time by leveraging the learned distribution over the vocabulary.

**Prediction Process:** In each iteration of token generation, the model:
- Predicts logits based on the current sequence.
- Focuses on the logits corresponding to the last predicted token.
- Applies softmax to these logits to obtain a probability distribution.
- Samples a new token from this distribution.
- Concatenates the newly predicted token to the existing sequence.

This process iterates `max_new_tokens` times, progressively building a longer text sequence.

In [20]:
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        """
        targets has to be None by default because in the generate function below, 
        we call the forward method without targets `self(idx)` 
        """

        logits = self.token_embedding_table(idx) # (B,T,C)
        
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B,T) array of indices in the current context
        
        for _ in range(max_new_tokens):
            # Predict the next token's logits given the current sequence
            logits, _ = self(idx)  # No need for targets during generation
            logits = logits[:, -1, :]  # Focus on the last predicted token's logits
            probs = F.softmax(logits, dim=-1)  # Convert logits to probabilities: (B, C)
            idx_next = torch.multinomial(probs, num_samples=1)  # Sample a new token: (B, 1)
            idx = torch.cat((idx, idx_next), dim=1)  # Append the new token to the sequence: (B, T+1)
        return idx

**Generating New Tokens:**

To initiate token generation, start with a minimal context, typically a single token that the model can "understand" as a starting point. For example, using a newline character (often encoded as 0 in many datasets) is a common choice for initiating text generation as it's akin to starting a new sentence or line.

```python
idx = torch.zeros((1, 1), dtype=torch.long)  # Starting token
```

Following this initialization, we can call the `generate` function to extend this sequence by a specified number of tokens `(max_new_tokens)`, in this case, 100 tokens.


**Converting Tokens to Text:**

After generating a sequence of tokens, the final step involves converting these numeric tokens back into human-readable text. This requires a decoding function that maps each token ID back to its corresponding character or word in the vocabulary.

```python
generated_sequence = m.generate(idx=torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)

# unplug single batch dimension to retrieve time steps as python list
decoded_text = decode(generated_sequence[0].tolist())  
print(decoded_text)
```

In [21]:
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)

print(logits.shape)
print(loss)
print('-'*50)
print("Here's some generated text:\n")
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

torch.Size([32, 65])
tensor(4.8430, grad_fn=<NllLossBackward0>)
--------------------------------------------------
Here's some generated text:


eKugsuRNC!T3b,jqDNMhsHAJSOWYvkZlA'wjtw3IzUltSG:rX;UOIp:RQ!:KU
eRyE-
QZtjcOaCx qUOM.pq?kTTtjACpKJ.EHB


Initially, the model's predictions are based on a randomly initialized embedding table, leading to nonsensical text. The goal of training the model is to refine these embeddings so that the model can learn meaningful patterns and dependencies between tokens, thereby improving the quality of text generation.

**NOTE:**

- The bigram model, in its current form, uses all previous tokens to predict the next one but only leverages the immediately preceding token due to its design. This results in inefficient computation and is not ideal for a simple bigram approach.
- The decision to feed the entire sequence into the model, even though only the last token is used for prediction, is made with future scalability in mind. As our model evolves to consider more context, this approach will allow for seamless integration of broader contextual understanding without significant architectural changes.

### 4.3 - Model Training

Training the model involves adjusting the embeddings in the `token_embedding_table` to minimize prediction errors. This process helps the model learn the probability distribution of the dataset, enabling it to make more accurate predictions based on the preceding token.

In [22]:
# Create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

batch_size = 32
for steps in range(10000):

    # Sample a batch of data
    xb, yb = get_batch('train')

    # Evaluate the loss
    logits, loss = m(xb, yb)
    
    # Zero-out all gradients from previous step
    optimizer.zero_grad(set_to_none=True)
    
    # Getting gradients from all parameters
    loss.backward()
    
    # Update parameters using gradients
    optimizer.step()

print(loss.item())

2.5270330905914307


In [23]:
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=400)[0].tolist()))


Shyecer, INoosider hengh frn, obe rewougRxime.
I.
ARe st'LBEx uin, thertr bl.
Cave OMESinod INROMBut.
CONINGeves, having, Bulouar t ma, s s meero t kevy ero wonoud Bu is h l,
Whaimed sounddes, myor l
LAndease Wht ppa thareshot,
WeLfthen?
Hur
Owism. tit eisee e Whentan myo stheerive fand a ts t.
Thooryothan
ATI'd il,

TI:

CAROnonghisil aun,
tou, y.


tthe; re, FOr hen fe,
BEO,'tore ssentonknthes; 


Although not Shakespearean, we observe an improvement in loss, raising hopes for more reasonable outcomes. This basic model operates on individual tokens without interaction. Moving forward, the aim is to enhance this simplicity by enabling tokens to communicate and consider context beyond just the preceding character. By creating an inter-token communication, the model can better predict subsequent elements, marking the transition towards implementing a transformer architecture.

## 5 - Self Attention

<hr>

Before diving into self-attention, here is a mathematical trick that's super useful in understanding the concept of attention.

Imagine a data structure of dimensions (B, T, C) where B, T, and C represent batches, time steps, and channels, respectively. In this scenario, each time step (or token) within a batch holds information but initially doesn't interact with others.

The goal is to enable these tokens to "communicate" with each other, specifically allowing a token to be influenced by tokens from previous time steps while ignoring future ones (preventing data leak from future). The simplest form of such communication could be averaging the information (channels) of all preceding tokens, including the current one, to create a feature vector that captures the essence of a token in the context of its history.

This averaging method, while straightforward, is a rudimentary way to enable interaction among tokens, as it might lose detailed spatial information. However, it sets the stage for more sophisticated mechanisms, like self-attention in transformers, which refine and enhance this concept of inter-token communication, ensuring that a token can effectively integrate and leverage past information to make predictions.

In [24]:
torch.manual_seed(1337)

# Assume x is our input matrix of shape (B, T, C)
B, T, C = 4, 8, 2  # batch, time, channels
x = torch.randn(B, T, C)
x.shape

torch.Size([4, 8, 2])

In [25]:
# VERSION 1: here is an inefficient double for-loop implementation of the idea

xbow = torch.zeros((B,T,C))
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1] # (t,C)
        xbow[b,t] = torch.mean(xprev, 0) # row-mean

In [26]:
x[0]

tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])

In [27]:
xbow[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

Notice that the first row of `x[0]` and `xbow[0]` are the same because the average of a single row is the row itself. The second row of `xbow[0]` is the average of the first two rows of `x[0]`, and so on.

In [28]:
# VERSION 2: efficient implementation using matrix multiplication

# consider the following matrix multiplication
torch.manual_seed(42)
a = torch.ones(3, 3)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('-'*10)
print('b=')
print(b)
print('-'*10)
print('c=')
print(c)

a=
tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])
----------
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
----------
c=
tensor([[14., 16.],
        [14., 16.],
        [14., 16.]])


We want the $i^{\text{th}}$ row of matrix C to be the average of rows from 0 to $i$ in matrix B. Pytorch has a function called `tril` which returns a lower triangular matrix.

In [29]:
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
b = torch.randint(0, 10, (3,2)).float()
c = a @ b
print('a=')
print(a)
print('-'*10)
print('b=')
print(b)
print('-'*10)
print('c=')
print(c)

a=
tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
----------
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
----------
c=
tensor([[ 2.,  7.],
        [ 8., 11.],
        [14., 16.]])


Notice that the $i^{\text{th}}$ row in matrix C is the sum of rows 0 to $i$ in matrix B. Since we achieved the addition, now we can do the average.

In [30]:
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, dim=1, keepdim=True)   # sum across columns
b = torch.randint(0, 10, (3,2)).float()
c = a @ b
print('a=')
print(a)
print('-'*10)
print('b=')
print(b)
print('-'*10)
print('c=')
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
----------
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
----------
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


Here, the $i^{\text{th}}$ row of matrix C is the average of 0 to $i$ rows in matrix B.

In [31]:
# VERSION 2: using matrix multiply for a weighted aggregation

wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)
torch.allclose(xbow, xbow2, atol=1e-6, rtol=1e-4)  # tolerance to resist floating point arithmetic

True

In [32]:
xbow2[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

In [33]:
# VERSION 3: using Softmax

tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))

# masked fill replaces zeros with -inf in the wei matrix
wei = wei.masked_fill(tril == 0, float('-inf'))
wei

tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0., 0., 0., 0.]])

In [34]:
# softmax normalizes wei across rows
wei = F.softmax(wei, dim=-1)
wei

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

This is the same exact matrix: the $i^{\text{th}}$ row is the average of rows from 0 to $i$. 

**Why is this the case?**

When we apply softmax, the 0 entries become $e^{0}=1$ and the $e^{-\infty} = 0$. Therefore, the software operation translates the matrix into a normalized lower-triangular matrix. Interesting!!

In [35]:
xbow3 = wei @ x
torch.allclose(xbow, xbow3, atol=1e-6, rtol=1e-4)

True

In [36]:
xbow3[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

<font color="purple">Why is the softmax method (verion 3) important?</font>

This method introduces a foundational concept for understanding self-attention mechanisms in neural networks. 
- It begins with assigning initial zero weights to token interactions, representing the strength or affinity between tokens. 
- The use of masking ensures that future tokens do not influence past tokens, maintaining the chronological integrity of the sequence. 
- The application of softmax normalizes these weights, allowing for a contextual aggregation that considers the relevance of each token up to the current point. 
- By multiplying the normalized weights with the data $(x)$, it computes affinities that reflect how tokens influence each other based on their positions. 

However, this method only averages these interactions; the goal is to evolve these interactions into data-dependent relationships. This is achieved through the use of **keys, queries, and values,** which enable the model to dynamically adjust affinities based on the content and context of the tokens, moving beyond simple averaging to a more nuanced, content-aware interaction mechanism.

### 5.1 - Key, Query, Value in Self-Attention

We don't want all the affinities to be uniform because some tokens will find others more or less interesting, so we want the affinities to be data-dependent. For example, if I'm a vowel, then maybe I'm looking for consonants in my past and maybe I want to know what those consonants are and I want that information to flow to me. So, I want to now gather information from my past, but I want to do it in a data-dependent way - which is exactly the problem that self-attention solves in the following manner.

Every node, or every single token, at each position will emit two vectors: the query $(Q)$ and the key $(K)$ vector. The query vector, roughly speaking, is what am I looking for; and the key vector, roughly speaking, is what do I contain. And then the way we get affinities between these tokens in a sequence is a dot product between the queries and keys. Intuitively, if a query matches a key, the product product will be high and thus the two tokens are more likely to have a higher affinity.

---

- **Key (K):** Each token generates a Key vector that represents what information it holds. This vector can be thought of as the token's "identity" in the context of the sequence.

- **Query (Q):** Each token also generates a Query vector that represents what information it is seeking from other tokens. This vector signifies the "question" or the type of information the token is looking to gather from its context.

- **Value (V):** The Value vector represents the actual information that a token can provide to others. This is what will be communicated if the token is deemed relevant by another token's query.

The affinity or relevance between two tokens is computed using the dot product of their Query and Key vectors. A higher dot product indicates a stronger relevance or affinity, suggesting that the information held by the tokens is closely related or important to each other.

---

In the provided code, a simple self-attention mechanism (single-head attention) is implemented, showcasing how Keys, Queries, and Values are generated from the input sequence $x$ using linear transformations.

- `head_size`: The head size in self-attention mechanisms refers to the dimensionality of the Key, Query, and Value vectors, determining the size of the subspace they occupy for computing affinities between tokens.

In [37]:
torch.manual_seed(1337)
B, T, C = 4, 8, 32
x = torch.randn(B,T,C)

head_size = 16

key = nn.Linear(C, head_size, bias=False) # matrix mult with fixed weights, therefore no bias
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
v = value(x) # (B, T, 16)

# Compute affinities between tokens
# Transpose only the last two dimensions and not the batch dimension
wei = q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

# Mask future tokens to prevent information leakage
tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))

# Normalize affinities to probabilities
wei = F.softmax(wei, dim=-1)

In [38]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

In [39]:
wei.shape

torch.Size([4, 8, 8])

In [40]:
# Aggregate information based on computed affinities
out = wei @ v

In [41]:
out.shape

torch.Size([4, 8, 16])

So, you can think of $x$ as kind of like a private information to a token. If I'm a fifth token and I have some identity and my information is kept in vector $X$ and now for the purposes of attention, what i'm interested in is $(Q)$, what I have is $(K)$. If you find me interesting, then what I will communicate to you is $(V)$.

**NOTES:**

- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.

```python
wei = wei.masked_fill(tril == 0, float('-inf'))
# remove this line to enable future nodes to talk to past nodes as well
```

- **self-attention** just means that the keys and values are produced from the same source as queries. In **cross-attention**, the queries still get produced from $x$, but the keys and values come from some other, external source (e.g. an encoder module)
- **Scaled attention** additional divides `wei` by $\frac{1}{\sqrt{\text{head_size}}}$. This makes it so when input Q,K are unit variance, `wei` will be unit variance too and Softmax will stay diffused and not saturate too much.

$$\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V$$

The only thing we're missing is the division by $\sqrt{d_k}$ where $d_k$ is the head size. Why do we do this?

In [42]:
# naively multiplying keys and queries

k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1)

In [43]:
print(k.var())
print(q.var())
print(wei.var())

tensor(1.0449)
tensor(1.0700)
tensor(17.4690)


In [44]:
# scale by head_size

k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1) * head_size**-0.5

In [45]:
print(k.var())
print(q.var())
print(wei.var())

tensor(0.9006)
tensor(1.0037)
tensor(0.9957)


Scaling the variance for `wei` in self-attention is crucial because it ensures the initial distribution of affinities is sufficiently diffused. Since `wei` is inputted into softmax, extremely high or low values would lead softmax to produce nearly one-hot vectors, overly concentrating on a single token and neglecting the rest. By controlling the variance, particularly at initialization, we prevent softmax from becoming too peaky, fostering a more balanced and effective aggregation of information across multiple tokens.

In [46]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [47]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1) # gets too peaky, converges to one-hot

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

### 5.2 - Multi-Head Attention

Multi-head attention enhances single-head attention by running it in parallel across multiple "heads," each focusing on different aspects of the information, enabling a more comprehensive and nuanced understanding of the input sequence.

### 5.3 - Feed Forward Layer

The feed-forward layer in the Transformer architecture, positioned right after multi-head attention, acts as a fully connected neural network (MLP) that processes each token independently. It introduces additional computational depth, allowing each token to further analyze and integrate the information gathered from the attention mechanism, enhancing the model's ability to make more informed predictions.

> While computing logits for the tokens, we went too fast: the tokens looked at each other but didn't really have a lot of time to think on what they found from the other tokens; the MLP enables the tokens or nodes to think further on the data they've collected so far from the attention mechanism before moving forward.

### 5.4 - Layer Norm

Layer normalization, as opposed to batch normalization, normalizes the inputs across the features for each data point in a batch. It is designed for stabilizing and accelerating the training of deep neural networks. This normalization is done by subtracting the mean and dividing by the standard deviation of the features, then scaling and shifting the result with learnable parameters, gamma and beta. Layer normalization is particularly useful in sequence models like Transformers, where it is applied at each sub-layer of the model's architecture.

In [48]:
class LayerNorm1d:
    def __init__(self, dim, eps=1e-5, momentum=0.1):
        self.eps = eps  # A small value added for numerical stability during division
        self.gamma = torch.ones(dim)  # Learnable scale parameters
        self.beta = torch.zeros(dim)  # Learnable shift parameters
    
    def __call__(self, x):
        # Perform normalization for each input
        xmean = x.mean(1, keepdim=True) # Calculate the mean of each input
        xvar = x.var(1, keepdim=True) # Calculate the variance of each input
        xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # Normalize inputs to have zero mean and unit variance
        self.out = self.gamma * xhat + self.beta # Scale and shift normalized inputs
        return self.out
    
    def parameters(self):
        # Return the learnable parameters of the layer
        return [self.gamma, self.beta]

In [49]:
torch.manual_seed(1337)
module = LayerNorm1d(100)
x = torch.randn(32, 100) # batch size 32 of 100-dimensional vectors
x = module(x)
x.shape

torch.Size([32, 100])

In [50]:
x[:,0].mean(), x[:,0].std() # mean, std of one feature across all batch inputs

(tensor(0.1469), tensor(0.8803))

In [51]:
x[0,:].mean(), x[0,:].std() # mean, std of a single input from the batch, of its features

(tensor(-9.5367e-09), tensor(1.0000))

## 6 - Full Code Review

<hr>

Using all of this knowledge, we implement below a "decoder-only" transformer model adapted from the "Attention Is All You Need" paper. 

**Key components:** 
- multi-head self-attention
- position embeddings
- feed-forward layers
- layer normalization

In [52]:
import torch
import torch.nn as nn
from torch.nn import functional as F

We want to move beyond the straightforward approach where token embeddings directly produce logits (raw preditions or scores for each vocabulary token). Instead, we want to introduce an intermediate layer or "level of abstraction" between the token embeddings and the logits.

- **Embedding Dimension (`n_embed`):** : Specifies the size of the embedding vectors for each token and position within the sequence. By increasing the number of dimensions, we provide a richer, more nuanced representation of each token's features and positional information, creating a more detailed "space" where relationships between tokens can be learned and exploited by the model
- **Number of Layers (`n_layer`):** Determines the depth of the model by specifying how many transformer blocks or layers are stacked. Each layer includes mechanisms like self-attention and feed-forward networks, allowing the model to process information at multiple levels of abstraction. More layers enable the model to capture more complex dependencies and relationships within the data, improving its ability to understand context and generate coherent text.

### 6.1 - Define Hyperparameters

In [53]:
batch_size = 64         # how many independent sequences will we process in parallel?
block_size = 256        # what is the maximum context length for predictions?
n_embd = 384            # size of the embedding dimension (features + positional encoding)
n_head = 6              # number of attention heads
n_layer = 6             # number of transformer layers
dropout = 0.2           # dropout rate

max_iters = 5000        # how many training iterations
eval_interval = 500     # how often to evaluate the model
learning_rate = 3e-4    # learning rate
eval_iters = 200        # how many iterations to average the loss over
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [54]:
torch.manual_seed(1337)

<torch._C.Generator at 0x1f95aa2ced0>

### 6.2 - Load and Preprocess Data

In [55]:
# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('data/tinyshakespeare.txt', 'r', encoding='utf-8') as f:
    text = f.read()

chars = sorted(list(set(text))) # unique characters
vocab_size = len(chars)
stoi = {ch: i for i, ch in enumerate(chars)}      # char : int mapping
itos = {i: ch for i, ch in enumerate(chars)}      # int : char mapping
encode = lambda s: [stoi[c] for c in s]           # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l])  # decoder: take a list of integers, output a string

### 6.3 - Split Data into Training and Validation Sets

In [56]:
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # 90-10 train/test split
train_data = data[:n]
val_data = data[n:]

### 6.4 - Data Loader Function

In [57]:
def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x.to(device), y.to(device)

### 6.5 - Estimate Model Loss for Evaluation

The context manager `torch.no_grad()` is used to prevent the computation graph from storing the gradients of the operations inside of it. This is to tell pytorch that we will not call `loss.backward()` and hence it does not need to store the gradients of the operations inside.

In [58]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

### 6.6 - Self-Attention Head

`tril` is not a parameter of the module, so in pytorch naming convention, it is called a buffer. Buffers are added to the state dict of the module by `register_buffer`.

In [59]:
class Head(nn.Module):
    """One head of self-attention"""
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape
        k, q, v = self.key(x), self.query(x), self.value(x) # (B,T,C)
        
        # Compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5                       # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1)                                 # (B, T, T)
        wei = self.dropout(wei)
        # Perform the weighted aggregation of the values
        out = wei @ v                                                # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

### 6.7 - Multi-Head Self-Attention

- **Multiple Attention Heads:** The model initializes multiple `Head` instances, allowing it to capture various aspects of the input data in parallel. Each head focuses on different parts of the sequence, potentially learning to attend to unique features or patterns.

- **Projection Layer:** After processing the input through multiple heads, the outputs are concatenated. However, this concatenated output has a larger dimension `(num_heads * head_size)` than the input embedding dimension `(n_embd)`. The projection layer (`self.proj`) maps this higher-dimensional space back to the original embedding dimension. This step is crucial for maintaining a consistent dimensionality across the network, allowing the multi-head attention output to be seamlessly integrated into subsequent parts of the Transformer architecture. Additionally, this projection step can mix information from all heads, enabling the model to leverage the diverse perspectives captured by individual heads.

- **Use of Dropout:** The dropout layer applied after the projection helps prevent overfitting by randomly zeroing parts of the output, encouraging the model to learn more robust features that do not depend too heavily on specific paths through the network.

In [61]:
class MultiHeadAttention(nn.Module):
    """Multiple heads of self-attention in parallel"""

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(head_size * num_heads, n_embd) # Projection layer to combine head outputs
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Concatenate the outputs from all attention heads
        out = torch.cat([h(x) for h in self.heads], dim=-1)  
        # Project the concatenated output back to the original embedding dimension
        out = self.dropout(self.proj(out))  
        return out

### 6.8 - Feed-Forward Network

In the original "Attention Is All You Need" paper, the feedforward network within each Transformer block expands the internal representation dimensionality by a factor of 4 before applying a non-linearity (`ReLU` in this case) and then compresses it back to the original dimension.

- Expansion by 4x (`4 * n_embd`): The input embeddings are first linearly projected to a higher-dimensional space (4 times the size of the embedding dimension, `n_embd`). This expansion increases the model's capacity and provides more room for the network to generate internal representations that capture complex patterns and relationships within the data.

In [62]:
class FeedFoward(nn.Module):
    """A simple linear layer followed by a non-linearity"""
    
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),  # the GPT-3 model uses 4x the input dimension here
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),  # compress back down to the input dimension
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

### 6.9 - Transformer Block

- **Residual Connections:** By adding the output of the self-attention and feed-forward networks back to their respective inputs `(x + self.sa(...)` and `x + self.ffwd(...))`, the model utilizes residual connections. These connections help mitigate the vanishing gradient problem and allow deeper networks by promoting more effective backpropagation of gradients. 

- **Layer Normalization:** Layer normalization is applied before each sub-layer (self-attention and feed-forward networks) through `self.ln1(x)` and `self.ln2(x)`. It normalizes the inputs across the features for each data point, stabilizing the training process and improving convergence.

- **Head Size Calculation (`head_size = n_embd // n_head`):** This formula divides the embedding dimension by the number of attention heads to determine the dimensionality of each head. By distributing the embedding dimension across multiple heads, the model can attend in parallel to different subspace representations of the input, enabling it to capture a wide variety of information from different perspectives.

<br>
<div style="align:center">
    <img src="images/decoder.png" width=200>
    <br><center><caption><font color="purple"><strong><u>Figure 2:</u></strong> Decoder-Only Transformer</font></caption></center>
</div>

In [63]:
class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head                    # 64 = 384 / 6 (original GPT-3 model uses 64)
        self.sa = MultiHeadAttention(n_head, head_size) # self-attention module
        self.ffwd = FeedFoward(n_embd)                  # feedforward module
        self.ln1 = nn.LayerNorm(n_embd)                 # layernorms
        self.ln2 = nn.LayerNorm(n_embd)                 # layernorms

    def forward(self, x):                       # x is the input tensor
        x = x + self.sa(self.ln1(x))            # add skip connection & apply self-attention
        x = x + self.ffwd(self.ln2(x))          # add skip connection & apply feedforward
        return x

### 6.10 - GPT Model

In [64]:
class GPTLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)  # Embedding layer for tokens
        self.position_embedding_table = nn.Embedding(block_size, n_embd)  # Embedding layer for positions
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])  # Transformer blocks
        self.ln_f = nn.LayerNorm(n_embd)  # Final layer normalization
        self.lm_head = nn.Linear(n_embd, vocab_size)  # Output layer to predict next token

        self.apply(self._init_weights)  # Apply custom weights initialization
        
    def _init_weights(self, module):
        # Initialize weights for linear and embedding layers
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)  # Normal initialization for linear layer weights
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)  # Zero initialization for linear layer biases
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)  # Normal initialization for embedding weights

    def forward(self, idx, targets=None):
        B, T = idx.shape  # Batch size (B) and sequence length (T)

        tok_emb = self.token_embedding_table(idx)  # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))  # (T,C)
        x = tok_emb + pos_emb  # Combine token and position embeddings: (B,T,C)
        x = self.blocks(x)  # Pass through Transformer blocks: (B,T,C)
        x = self.ln_f(x)  # Apply final layer normalization: (B,T,C)
        logits = self.lm_head(x)  # Generate logits for next token prediction: (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C) # Reshape logits for loss computation
            targets = targets.view(B*T)  # Flatten targets to match logits shape
            loss = F.cross_entropy(logits, targets) # Cross-entropy loss between logits and targets

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:]  # Use only the most recent tokens
            logits, _ = self(idx_cond)  # Get logits for the current sequence
            logits = logits[:, -1, :]  # Use logits for the last token only: (B,C)
            probs = F.softmax(logits, dim=-1)  # Softmax to get probabilities: (B,C)
            idx_next = torch.multinomial(probs, num_samples=1)  # Sample next token: (B,1)
            idx = torch.cat((idx, idx_next), dim=1)  # Append to sequence: (B,T+1)
        return idx

In [65]:
model = GPTLanguageModel()
m = model.to(device)

In [66]:
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

10.788929 M parameters


Given the model size of approximately 10.79 million parameters and aiming to demo the model on a local setup with an NVIDIA RTX 3060 GPU, I'm reducing the model size to ensure it fits comfortably within the GPU's memory limits while leaving room for other computational overheads.

In [67]:
batch_size = 32   # Reduce from 64 to 32
block_size = 128  # Reduce from 256 to 128
n_embd = 64       # Reduce from 384 to 64
n_head = 6        # Keep the same
n_layer = 6       # Keep the same

In [68]:
model = GPTLanguageModel()
m = model.to(device)

In [69]:
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

0.309313 M parameters


In [70]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 4.1958, val loss 4.1954
step 500: train loss 2.4412, val loss 2.4419
step 1000: train loss 2.2352, val loss 2.2550
step 1500: train loss 2.0054, val loss 2.0670
step 2000: train loss 1.8422, val loss 1.9513
step 2500: train loss 1.7431, val loss 1.8774
step 3000: train loss 1.6803, val loss 1.8480
step 3500: train loss 1.6343, val loss 1.8068
step 4000: train loss 1.5905, val loss 1.7738
step 4500: train loss 1.5708, val loss 1.7501
step 4999: train loss 1.5446, val loss 1.7215


In [71]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))
# open('more.txt', 'w').write(decode(m.generate(context, max_new_tokens=10000)[0].tolist()))


TYBERY:
Annd Was'rwicher I twich them good. Heer be this slist,
Frears and runt you I not all wint
We lie beger,' good husal though went and Cateived peies,
Now teell would eath, I should weep,
For an eat of soreld them unwain's to law his
Meath'd leattugh in ont, and theu now, deat to Psoddue us.
Which I of no' merch.

CESCINLUS:
An odour at of trunge.
Go to go? this out need rabuse in, netter as those
timpe his wearn, and youghnar layst upa with I willl.
Lord beilfess contoumser and me rathing
