# Building a GPT from Scratch - With Explanations

This notebook provides a step-by-step explanation of Andrej Karpathy's GPT implementation from his [Zero To Hero](https://karpathy.ai/zero-to-hero.html) series. We'll break down each component of the transformer architecture and explain the code in detail.

Developed By Eiliya Mohebi For Education Purposes.

## What is GPT?

GPT (Generative Pre-trained Transformer) is an autoregressive language model that uses the transformer architecture to generate text. It predicts the next token in a sequence given the previous tokens, and can be used for a variety of natural language processing tasks.

This implementation is a minimal version that demonstrates the core concepts of transformer-based models.



## 1. Data Acquisition

We start by downloading a dataset to train our model. We'll use the "Tiny Shakespeare" dataset - a collection of Shakespeare's works that is commonly used for text generation tasks because it's relatively small but contains enough structure for interesting results.

In [18]:
# Download the tiny Shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

'wget' is not recognized as an internal or external command,
operable program or batch file.


The command above downloads a text file containing Shakespeare's works. If you're running this locally and `wget` isn't available, you can download the file manually from the URL.

## 2. Data Exploration

Let's load the data and explore its contents to understand what we're working with.

In [19]:
# Read the text file
with open('shakespeare_text', 'r', encoding='utf-8') as f:
    text = f.read()

We open the file with UTF-8 encoding to properly handle all characters, then read the entire text into memory. This is feasible because the dataset is small (about 1MB), but for larger datasets, you might need to process the data in chunks.

Let's check the size of our dataset:

In [20]:
print("Length of dataset in characters: ", len(text))

Length of dataset in characters:  1115394


This shows us the total number of characters in our dataset. For language models, size matters - larger datasets generally lead to better models, but they also require more computational resources to train.

Let's see what the data looks like by printing the first 1000 characters:

In [21]:
# Look at the first 1000 characters
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



We can see that the text is in the format of a play script, with character names followed by their lines. This structure will be something our model might learn to mimic.

## 3. Tokenization - Creating a Vocabulary

Before we can process text with a neural network, we need to convert it to numbers. We'll use a simple character-level tokenization scheme, where each unique character is assigned an integer ID.

In [22]:
# Get all unique characters to create our vocabulary
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print("Vocabulary size:", vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Vocabulary size: 65


Here we:
1. Use Python's `set()` function to find all unique characters in the text
2. Convert the set to a sorted list to ensure the character order is consistent across runs
3. Count them to get our vocabulary size
4. Print all characters to see what's in our vocabulary

Character-level tokenization is simple and works well for small datasets. More advanced models like GPT-3 use subword tokenization methods like Byte-Pair Encoding (BPE), which can handle the trade-off between character and word-level tokenization more efficiently.

Now, let's create mappings to convert between characters and their IDs:

In [23]:
# Create mappings between characters and their IDs
stoi = { ch:i for i,ch in enumerate(chars) }  # string to integer
itos = { i:ch for i,ch in enumerate(chars) }  # integer to string

# Define encode and decode functions
encode = lambda s: [stoi[c] for c in s]  # Convert string to list of integers
decode = lambda l: ''.join([itos[i] for i in l])  # Convert list of integers back to string

# Test our encoding/decoding
print(encode("hello world"))
print(decode(encode("hello world")))

[46, 43, 50, 50, 53, 1, 61, 53, 56, 50, 42]
hello world


We've created two dictionaries:
- `stoi` (string to integer): Maps each character to its numeric ID
- `itos` (integer to string): Maps each ID back to its character

We then define two lambda functions:
- `encode`: Converts a string to a list of integer IDs
- `decode`: Converts a list of integer IDs back to a string

The test at the end confirms that we can encode and decode correctly - we should get back exactly what we put in.

## 4. Converting Text to Tensors

Now we'll convert our entire text dataset into PyTorch tensors, which are multidimensional arrays optimized for neural network operations.

In [24]:
# Convert text to tensor
import torch  # Import PyTorch
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:100])  # Print the first 100 tokens

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


Here, we encode the entire text into a list of integers and convert it to a PyTorch tensor. We specify the data type as `torch.long` (a 64-bit integer) to ensure compatibility with PyTorch's embedding layers later.

The `.shape` attribute tells us the dimensions of our tensor (in this case, a 1D array with length equal to the number of characters in our text), and `.dtype` confirms the data type.

## 5. Train/Validation Split

A standard practice in machine learning is to split your data into training and validation sets. We'll use the training set to update our model's weights, and the validation set to evaluate the model's performance on unseen data.

In [25]:
# Split into training and validation sets
n = int(0.9 * len(data))  # Use 90% for training, 10% for validation
train_data = data[:n]
val_data = data[n:]
print(f"Training data length: {len(train_data)}")
print(f"Validation data length: {len(val_data)}")

Training data length: 1003854
Validation data length: 111540


We allocate 90% of our data for training and the remaining 10% for validation. This split is important for monitoring overfitting - if the model performs much better on the training data than on the validation data, it may be memorizing the training data rather than learning general patterns.

## 6. Context Windows and Training Examples

Language models work by predicting the next token given a context of previous tokens. We need to define how large this context window should be.

In [26]:
# Define context size
block_size = 8  # Maximum context length for predictions
print(train_data[:block_size+1])  # Show an example context + target

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])


Here we set `block_size = 8`, which means our model will use at most 8 previous tokens as context when predicting the next token. This parameter is often called the "context window" or "sequence length" and is an important hyperparameter in transformer models.

Let's visualize how we create training examples using this context window:

In [27]:
# Demonstrate how training examples are created
x = train_data[:block_size]  # Input context
y = train_data[1:block_size+1]  # Target (next token for each position)

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"When input is {context.tolist()} the target is: {target.item()} ('{decode([target.item()])}')")

When input is [18] the target is: 47 ('i')
When input is [18, 47] the target is: 56 ('r')
When input is [18, 47, 56] the target is: 57 ('s')
When input is [18, 47, 56, 57] the target is: 58 ('t')
When input is [18, 47, 56, 57, 58] the target is: 1 (' ')
When input is [18, 47, 56, 57, 58, 1] the target is: 15 ('C')
When input is [18, 47, 56, 57, 58, 1, 15] the target is: 47 ('i')
When input is [18, 47, 56, 57, 58, 1, 15, 47] the target is: 58 ('t')


This example demonstrates how we create training examples for our language model:

1. For each position `t` in our sequence, we take all tokens from the beginning up to position `t` as our context
2. The target to predict is the token at position `t+1`

For example:
- Given context "F", predict "i"
- Given context "Fi", predict "r"
- Given context "Fir", predict "s"
- And so on...

This is how autoregressive language models work - they predict one token at a time, using all previous tokens as context.

## 7. Batch Processing

To train efficiently, we process multiple sequences in parallel rather than one at a time. Let's implement a function to generate batches of data.

In [28]:
# Function to get random training batches
def get_batch(split):
    # Choose the appropriate data source
    data = train_data if split == 'train' else val_data
    
    # Generate random starting indices
    ix = torch.randint(len(data) - block_size, (4,))
    
    # Create batch tensors for inputs and targets
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    
    return x, y

# Test our batch generation
xb, yb = get_batch('train')
print('Input shapes:', xb.shape, yb.shape)
print('Input examples:')
for b in range(4):
    print(xb[b].tolist(), '→', yb[b].tolist())

Input shapes: torch.Size([4, 8]) torch.Size([4, 8])
Input examples:
[34, 43, 56, 63, 1, 61, 43, 50] → [43, 56, 63, 1, 61, 43, 50, 50]
[56, 42, 52, 43, 57, 57, 10, 1] → [42, 52, 43, 57, 57, 10, 1, 58]
[0, 26, 53, 61, 1, 19, 53, 42] → [26, 53, 61, 1, 19, 53, 42, 1]
[21, 33, 31, 10, 0, 25, 39, 49] → [33, 31, 10, 0, 25, 39, 49, 43]


This `get_batch` function generates training or validation batches:

1. It selects data from either the training or validation set based on the `split` parameter
2. It randomly samples 4 starting indices using `torch.randint`
3. For each starting index `i`, it creates:
   - An input sequence `x` containing tokens from position `i` to `i+block_size-1`
   - A target sequence `y` containing tokens from position `i+1` to `i+block_size`
4. It stacks these sequences into batches using `torch.stack`

The result is two tensors with shape (batch_size, block_size), where:
- `batch_size = 4` (the number of sequences processed in parallel)
- `block_size = 8` (the length of each sequence)

This batched processing is crucial for training efficiency. Rather than processing one sequence at a time, we process multiple sequences in parallel, which leverages modern hardware capabilities.

## 8. Model Architecture - The Bigram Language Model

We'll start with the simplest possible language model - a bigram model. This model only considers the immediate previous token when predicting the next one.

In [29]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# Simple bigram language model
class BigramLanguageModel(nn.Module):
    
    def __init__(self, vocab_size):
        super().__init__()
        # Each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
        
    def forward(self, idx, targets=None):
        # idx and targets are both (B,T) tensors of integers
        logits = self.token_embedding_table(idx)  # (B,T,vocab_size)
        
        if targets is None:
            loss = None
        else:
            # Reshape logits for the loss function
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            
            # Calculate cross entropy loss
            loss = F.cross_entropy(logits, targets)
            
        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # Get predictions
            logits, _ = self(idx)
            
            # Focus only on the last time step
            logits = logits[:, -1, :]  # (B, C)
            
            # Apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            
            # Sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            
            # Append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
            
        return idx

This `BigramLanguageModel` class defines our initial model architecture using PyTorch's neural network framework. Let's break it down:

### Architecture Components

1. **Embedding Layer**:
   ```python
   self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
   ```
   
   This creates a lookup table where each token ID maps to a vector of size `vocab_size`. In a bigram model, this is equivalent to a table of probabilities for each token following each other token. For a more typical language model, the embedding dimension would be smaller than the vocabulary size and followed by additional layers.

2. **Forward Pass**:
   The `forward` method defines what happens when we pass inputs through the model:
   - Inputs `idx` (shape `[B,T]`) are passed through the embedding layer to get `logits` (shape `[B,T,C]`)
   - If targets are provided, we calculate the loss using cross-entropy
   - Cross-entropy loss measures how well our predictions match the true next tokens

   The shapes represent:
   - `B`: Batch size (number of sequences processed in parallel)
   - `T`: Time/sequence length (context window size)
   - `C`: Channel dimension (vocabulary size in this case)

3. **Generation Method**:
   The `generate` method allows the model to create new text:
   - It starts with an initial context `idx`
   - For each new token:
     - It gets predictions (logits) from the model
     - It focuses on the last token's predictions
     - It converts logits to probabilities using softmax
     - It randomly samples the next token based on these probabilities
     - It appends this token to the context
   - This process repeats until `max_new_tokens` are generated

This is the essence of autoregressive text generation - using the model's predictions to generate each token, then including that token in the context for the next prediction.

Let's instantiate our model and try generating some text before training:

In [30]:
# Create a model instance
model = BigramLanguageModel(vocab_size)

# Test the loss calculation
xb, yb = get_batch('train')
logits, loss = model(xb, yb)
print(f"Batch shape: {xb.shape}")
print(f"Logits shape: {logits.shape}")
print(f"Initial loss: {loss.item():.4f}")

Batch shape: torch.Size([4, 8])
Logits shape: torch.Size([32, 65])
Initial loss: 4.7265


The loss value gives us a measure of how well our model is doing. With an untrained model, we expect the loss to be around `ln(vocab_size)` (~5.3 for a vocabulary of ~65 tokens), which means the model is essentially making random guesses.

Let's try generating some text with our untrained model:

In [31]:
# Generate from the untrained model
context = torch.zeros((1, 1), dtype=torch.long)  # Start with a single token [0]
generated_text = model.generate(context, max_new_tokens=500)[0].tolist()
print(decode(generated_text))


UNDQtYi V
 p-PPidfSV!UVK:'TXVqoIT:,;wvTd?DCAD!bmzZLsjN; VfHoI uaJ3aevq&wTHEvAkse!rOhFBykdRrLeIzkdKHIJeoO;McjTd3F V-PgXGv'tRSZHRZCFd.WcGyJ.3vqkl!YcAeHTq!rG;LZMGgy.ZQMCvxEEbh$okOBdGbfOf
oD-hk?SWSRcj;mfAE.bIGY&,-xobfwvMmpu3FvqMhW$yUlAfzt$a3vH!KtMGzULD oem:$sg3!KQnDynQY.bVDsik;M-ZVjT!fTV:ADgMesHXb':YT&PCCeM&:Sd.yCpum3u3vkqpuJ:N.zUct
fcA,XCwk
 JGdCH?-rXV!yMoDCF pet$yMnDCJLiMtRAL3LqFP DD.3 M L:yrMAUX!Ks$WfCV BVd zfZK3Qcz&OZKopwHwpCv'sCw;w
kdUNqZLsN.bQ3jT?HOy,hZKQ:h
RefcNw?HtJFiKXW!OYqHM;rN;jTfnuq-aL:3


Since the model is untrained, the generated text should look like random characters. Each token is sampled independently without considering the context (other than the immediately preceding token), resulting in gibberish.

## 9. Training Loop

Now let's train our model using gradient descent. We'll use the AdamW optimizer, which is a common choice for transformer models.

In [32]:
# PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

# Training loop
batch_size = 32  # Increase batch size for training
for steps in range(10000):
    # Sample a batch of data
    xb, yb = get_batch('train')
    
    # Evaluate the loss
    logits, loss = model(xb, yb)
    
    # Zero the gradients from the previous step
    optimizer.zero_grad(set_to_none=True)
    
    # Backward pass to calculate gradients
    loss.backward()
    
    # Update model parameters
    optimizer.step()
    
    # Print progress occasionally
    if steps % 1000 == 0:
        print(f"Step {steps}: Loss {loss.item():.4f}")

Step 0: Loss 5.0017
Step 1000: Loss 4.4967
Step 2000: Loss 3.6508
Step 3000: Loss 3.2955
Step 4000: Loss 2.7469
Step 5000: Loss 2.6691
Step 6000: Loss 3.1093
Step 7000: Loss 2.5971
Step 8000: Loss 2.7279
Step 9000: Loss 2.7960


This training loop implements the basic procedure for training neural networks:

1. **Batch Sampling**: Get a random batch of inputs (`xb`) and targets (`yb`)
2. **Forward Pass**: Feed the inputs through the model to get predictions and calculate the loss
3. **Gradient Calculation**: Use `loss.backward()` to compute gradients of the loss with respect to model parameters
4. **Parameter Update**: Apply the gradients to update model parameters using the optimizer

The `optimizer.zero_grad()` call is necessary to clear the gradients from the previous step, as PyTorch accumulates gradients by default.

The training process iteratively minimizes the loss function, which means the model's predictions get closer to the actual next tokens in our text.

Let's check our progress by evaluating the model on the validation set and generating some text:

In [33]:
# Evaluate loss on validation set
@torch.no_grad()  # Disable gradient tracking for efficiency
def estimate_loss():
    out = {}
    model.eval()  # Set model to evaluation mode
    for split in ['train', 'val']:
        losses = torch.zeros(100)
        for k in range(100):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()  # Set model back to training mode
    return out

# Check loss before further training
print(estimate_loss())

{'train': tensor(2.5267), 'val': tensor(2.5597)}


The `estimate_loss` function computes the average loss over multiple batches for both the training and validation sets. This gives us a more reliable estimate of how well our model is performing.

We use several important PyTorch features here:

1. `@torch.no_grad()` decorator: Disables gradient computation, which saves memory and speeds up evaluation
2. `model.eval()`: Sets the model to evaluation mode (disables dropout, etc.)
3. `model.train()`: Sets the model back to training mode

Now let's generate some text with our trained model:

In [17]:
# Generate from the trained model
context = torch.zeros((1, 1), dtype=torch.long)
print(decode(model.generate(context, max_new_tokens=500)[0].tolist()))


Ansh w,

bl:3That?y at bullacam,

PEREFcofare lsth a, on.
AROP thed Ed,
Tho;mave.
Hos INasithy mpitheere,
Wan th henanoer f GBunau:
WBO:
TIAr d?
Sis try word le, ave;

MOMPARaukiad brwhandst'T:
A! ghedspzcllimukenouthivef w, che hisot.yc ther tetVMan u oore. sghe, rkidld mf! ftheckerin our ounr?
BN:
FY' an.

mwnnd ICHAs ps RSfrisonduredsly pad PEXYgorse h sstorothefot ifot r Peth purllor Re, IDke blrHRS:
IOf;
So t t?ghay hurst-&G;
O:

Aly nd ofurinknowif,
NTinghorew malour winn romfe thame;BRot 


The generated text should now show some patterns typical of Shakespeare's writing, though it will still be quite simple and might not make much sense. This is because our model is a bigram model that only considers the previous token when predicting the next one.

To create a more capable model, we need to incorporate a longer context - that's where the transformer architecture comes in!

## 10. Introduction to Self-Attention

The key innovation in transformers is the self-attention mechanism, which allows the model to consider all previous tokens in the sequence, not just the most recent one. Let's explore how self-attention works through some examples.

## The Mathematical Trick in Self-Attention

Self-attention is a mechanism that allows a model to focus on different parts of the input sequence when generating an output. It's one of the core innovations that made transformers so effective for natural language processing tasks.

## Understanding Self-Attention

In traditional sequence models like RNNs and LSTMs, information flows sequentially through the network, making it difficult to capture long-range dependencies. Self-attention solves this problem by directly connecting each position in the sequence with every other position, allowing information to flow more freely.

The key insight is that each token in the sequence can "attend" to all previous tokens, giving more weight to some and less to others based on relevance. This weighted aggregation lets the model capture complex patterns and relationships in the data.

Some key advantages of self-attention:
1. It captures long-range dependencies without the vanishing gradient problems of RNNs
2. It can be parallelized, making training much faster
3. It provides interpretable attention weights that show which inputs the model is focusing on

Let's explore self-attention through a series of examples, starting with a simple demonstration of weighted aggregation using matrix multiplication.

In [18]:
# Toy example of matrix multiplication for weighted aggregation
import torch
import numpy as np

# Create a "weights" matrix (lower triangular to represent causal attention)
a = torch.tril(torch.ones(3, 3))
# Normalize the weights (for each row, divide by the sum of weights in that row)
a = a / a.sum(1, keepdim=True)

# Create a "values" matrix with random data
b = torch.randint(0, 10, (3, 2)).float()

# Perform weighted aggregation using matrix multiplication
c = a @ b  # Matrix multiplication: (3,3) @ (3,2) -> (3,2)

print("Matrix A (weights):")
print(a)
print("\nMatrix B (values):")
print(b)
print("\nMatrix C (weighted aggregation of B):")
print(c)

Matrix A (weights):
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])

Matrix B (values):
tensor([[7., 7.],
        [7., 6.],
        [5., 4.]])

Matrix C (weighted aggregation of B):
tensor([[7.0000, 7.0000],
        [7.0000, 6.5000],
        [6.3333, 5.6667]])


## Matrix Multiplication for Weighted Aggregation

The example above demonstrates the core mathematical operation in self-attention: weighted aggregation through matrix multiplication. Let's break it down:

1. **Matrix A (Weights)**:
   - We created a lower triangular matrix using `torch.tril`, where each position can only attend to previous positions (causal attention)
   - We normalized the rows so each row sums to 1, turning them into proper attention weights
   - Shape: (3, 3) - each row represents the attention weights for a position

2. **Matrix B (Values)**:
   - Random values representing the information to be aggregated
   - Shape: (3, 2) - each row represents a position's value vector (with 2 features)

3. **Matrix C (Result of Attention)**:
   - Computed as A @ B (matrix multiplication)
   - Each row of C is a weighted average of rows in B, with weights from the corresponding row in A
   - For example, the first row of C only considers the first row of B (with weight 1)
   - The second row of C is a weighted average of the first and second rows of B
   - The third row of C is a weighted average of all three rows of B

This simple weighted averaging is the foundation of attention mechanisms. In a full transformer model, the weights in matrix A aren't fixed - they're learned from the data based on query-key interactions, allowing the model to dynamically focus on relevant parts of the input.

## 11. Self-Attention Implementation

Let's explore a simple implementation of self-attention, starting with a basic "bag of words" approach and building up to the full self-attention mechanism used in transformers.

## Toy Example Setup

Let's create a toy example to demonstrate attention in a practical context. We'll create a tensor `x` with the following dimensions:
- Batch size (B): 4 examples
- Sequence length (T): 8 tokens per example
- Channel dimension (C): 2 features per token

This tensor could represent a batch of token embeddings in a transformer model, where each token has been mapped to a 2-dimensional vector.

In [19]:
# Toy example setup
torch.manual_seed(1337)  # For reproducibility
B, T, C = 4, 8, 2  # Batch, Time (sequence length), Channels
x = torch.randn(B, T, C)

print("Input shape:", x.shape)
print("First batch item:")
print(x[0])

Input shape: torch.Size([4, 8, 2])
First batch item:
tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])


## Bag of Words (Mean) Attention - Manual Implementation

The simplest form of attention is just averaging previous tokens. This is sometimes called a "bag of words" approach. Let's implement it manually first to understand the process step by step.

In [20]:
# Manually implement "averaging" attention
xbow = torch.zeros_like(x)
for b in range(B):  # For each batch item
    for t in range(T):  # For each position in the sequence
        # Average all tokens seen so far (causal attention)
        xbow[b, t] = torch.mean(x[b, :t+1], dim=0)

print("First batch item after attention:")
print(xbow[0])

First batch item after attention:
tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])


In this implementation, we're creating a new tensor `xbow` with the same shape as `x`. For each position `t` in each batch item `b`, we compute the average of all tokens from position 0 to position `t` (inclusive). This is a simple form of causal attention, where each position only considers tokens it has seen before (including itself).

This nested loop approach works but is inefficient for large tensors. Let's implement the same operation using matrix multiplication, which is much faster, especially on GPUs.

## Matrix Multiplication for Attention

We can express the same "averaging" attention using matrix multiplication, which is much more efficient.

In [21]:
# Create a weight matrix for averaging attention
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
print("Attention weight matrix:")
print(wei)

# Perform batch matrix multiplication
xbow2 = wei @ x  # (T,T) @ (B,T,C) -> PyTorch broadcasts to (B,T,T) @ (B,T,C) -> (B,T,C)

# Verify that our matrix implementation matches the manual one
print("\nDo both implementations match?", torch.allclose(xbow, xbow2))

Attention weight matrix:
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

Do both implementations match? False


This approach creates a lower triangular weight matrix `wei` where each row `i` represents the attention weights for position `i`. After normalizing, each row sums to 1, representing a proper weighted average.

The matrix multiplication `wei @ x` efficiently computes the weighted average for all positions at once. PyTorch handles the batch dimension via broadcasting, making this operation very fast on GPUs.

This matrix-based approach is much more efficient than nested loops and is how attention is implemented in practice. However, in a real transformer, the weights aren't fixed averaging weights - they're computed dynamically based on the content, which brings us to the next implementation.

## Softmax Attention

In transformers, attention weights are normalized using softmax rather than simple averaging. Let's implement attention using softmax normalization.

In [22]:
# Create a lower triangular mask (tril) for causal attention
tril = torch.tril(torch.ones(T, T))

# Create random weights
wei = torch.rand(T, T)
# Apply mask: set weights to -inf where tril is 0 (future positions)
wei = wei.masked_fill(tril == 0, float('-inf'))
# Apply softmax to get normalized weights
wei = F.softmax(wei, dim=-1)

print("Softmax attention weights:")
print(wei)

# Perform weighted aggregation
xbow3 = wei @ x
print("\nOutput shape:", xbow3.shape)

Softmax attention weights:
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3610, 0.6390, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.4389, 0.2931, 0.2680, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1660, 0.2797, 0.1742, 0.3800, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2604, 0.2253, 0.1294, 0.2615, 0.1234, 0.0000, 0.0000, 0.0000],
        [0.1976, 0.1973, 0.1642, 0.1374, 0.1453, 0.1583, 0.0000, 0.0000],
        [0.1887, 0.0885, 0.1245, 0.2019, 0.1046, 0.0984, 0.1933, 0.0000],
        [0.1043, 0.1008, 0.0831, 0.1372, 0.1738, 0.0869, 0.1365, 0.1774]])

Output shape: torch.Size([4, 8, 2])


This implementation introduces two key changes:

1. We use random weights instead of fixed ones, simulating the learned weights in a transformer
2. We use softmax normalization instead of simple averaging

The `masked_fill` operation sets the weights for future positions (where `tril == 0`) to negative infinity, which effectively gives them zero probability after softmax. This enforces the causal nature of the attention, ensuring that a token only attends to previous tokens.

The softmax function converts the raw weights into a probability distribution (each row sums to 1), but unlike simple averaging, it can give more weight to some tokens and less to others based on their values.

This approach is closer to the attention mechanism used in transformers, but it's still missing the core component: content-based attention where the weights are determined by the similarity between queries and keys.

## Full Self-Attention Implementation

Let's now implement the complete self-attention mechanism used in transformers, where the attention weights are computed based on the similarity between query and key vectors.

In [23]:
# Hyperparameters
head_size = 16  # Dimension of query/key/value projections

# Input preparation
B, T, C = 4, 8, 32  # Larger channel dimension
x = torch.randn(B, T, C)

# Linear projections for query, key, and value
query = nn.Linear(C, head_size, bias=False)
key = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

# Initialize some random weights to simulate a pre-trained model
torch.manual_seed(1337)
query.weight.data = torch.randn_like(query.weight.data) * 0.1
key.weight.data = torch.randn_like(key.weight.data) * 0.1
value.weight.data = torch.randn_like(value.weight.data) * 0.1

# Generate query, key, and value projections
q = query(x)  # (B,T,head_size)
k = key(x)    # (B,T,head_size)
v = value(x)  # (B,T,head_size)

# Compute attention scores ("affinities")
wei = q @ k.transpose(-2, -1)  # (B,T,head_size) @ (B,head_size,T) -> (B,T,T)

# Scale the attention scores
wei = wei / (head_size ** 0.5)  # Scale by sqrt(head_size)

# Causal mask to ensure tokens only attend to previous tokens
tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))

# Normalize with softmax
wei = F.softmax(wei, dim=-1)  # (B,T,T)

# Weighted aggregation of values
out = wei @ v  # (B,T,T) @ (B,T,head_size) -> (B,T,head_size)

print("Attention output shape:", out.shape)

Attention output shape: torch.Size([4, 8, 16])


This implementation represents the full self-attention mechanism as used in transformers. Let's break down the key components:

1. **Linear Projections**:
   - We project the input `x` into three different spaces: query (q), key (k), and value (v)
   - These projections are learnable, allowing the model to adapt what to focus on

2. **Attention Scores**:
   - We compute attention scores as the dot product between queries and keys: `q @ k.transpose(-2, -1)`
   - This measures the similarity between each token's query and all tokens' keys
   - High scores indicate high relevance/attention

3. **Scaling**:
   - We scale the scores by `1/sqrt(head_size)` to prevent exploding gradients in larger models

4. **Causal Masking**:
   - We apply a mask to ensure tokens only attend to previous tokens (autoregressive property)

5. **Softmax Normalization**:
   - We normalize the scores with softmax to get proper attention weights

6. **Weighted Aggregation**:
   - We use these weights to compute a weighted sum of the value vectors

The key difference from our previous examples is that now the attention weights are content-based - they're determined by the similarity between token representations, allowing the model to focus on relevant parts of the input based on the content itself.

## 12. The Complete Transformer Architecture

Now that we understand self-attention, let's implement a complete transformer-based language model. We'll structure our model following the original architecture described in the "Attention is All You Need" paper, with a few simplifications.

In [24]:
class Head(nn.Module):
    """One head of self-attention"""
    
    def __init__(self, head_size, n_embd, block_size, dropout):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)    # (B,T,head_size)
        q = self.query(x)  # (B,T,head_size)
        
        # Compute attention scores
        wei = q @ k.transpose(-2, -1) * (k.shape[-1]**-0.5)  # (B,T,T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))  # (B,T,T)
        wei = F.softmax(wei, dim=-1)  # (B,T,T)
        wei = self.dropout(wei)
        
        # Weighted aggregation of values
        v = self.value(x)  # (B,T,head_size)
        out = wei @ v  # (B,T,head_size)
        return out

The `Head` class encapsulates a single attention head as we implemented above, with the addition of dropout for regularization. It takes the following parameters:
- `head_size`: Dimensionality of the query/key/value projections
- `n_embd`: Dimensionality of the input embeddings
- `block_size`: Maximum sequence length (for the causal mask)
- `dropout`: Dropout probability for regularization

Next, let's create a multi-head attention module, which runs several attention heads in parallel and concatenates their outputs:

In [25]:
class MultiHeadAttention(nn.Module):
    """Multiple heads of self-attention in parallel"""
    
    def __init__(self, num_heads, head_size, n_embd, block_size, dropout):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size, n_embd, block_size, dropout) for _ in range(num_heads)])
        self.proj = nn.Linear(num_heads * head_size, n_embd)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        # Concatenate outputs from all heads
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        # Project back to original dimension
        out = self.proj(out)
        out = self.dropout(out)
        return out

The `MultiHeadAttention` module:
1. Creates multiple attention heads, each operating independently
2. Concatenates their outputs along the feature dimension
3. Projects the concatenated output back to the original embedding dimension using a linear layer

This allows each head to focus on different aspects of the input, enhancing the model's representation power.

Next, let's implement the feed-forward network that follows the attention layer in each transformer block:

In [26]:
class FeedForward(nn.Module):
    """Simple feed-forward network with ReLU activation"""
    
    def __init__(self, n_embd, dropout):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),  # Expand to 4x dimension
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),  # Project back to original dimension
            nn.Dropout(dropout),
        )
        
    def forward(self, x):
        return self.net(x)

The feed-forward network is a simple two-layer network with a ReLU activation function. It expands the input dimension by a factor of 4 in the hidden layer, following the architecture described in the original transformer paper.

Now, let's implement a single transformer block, which combines multi-head attention, feed-forward network, and layer normalization:

In [29]:
class Block(nn.Module):
    """Transformer block: communication followed by computation"""
    
    def __init__(self, n_embd, n_head, block_size, dropout):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size, n_embd, block_size, dropout)
        self.ffwd = FeedForward(n_embd, dropout)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)
        
    def forward(self, x):
        # Self-attention with residual connection
        x = x + self.sa(self.ln1(x))
        # Feed-forward with residual connection
        x = x + self.ffwd(self.ln2(x))
        return x

The `Block` class implements a complete decoder block with:
1. Layer normalization before each sub-layer (pre-norm variant)
2. Multi-head self-attention
3. Feed-forward network
4. Residual connections around each sub-layer

This follows the modern transformer architecture used in models like GPT.

Finally, let's implement the full GPT language model:

In [30]:
class GPTLanguageModel(nn.Module):
    """GPT Language Model with transformer architecture"""
    
    def __init__(self, vocab_size, n_embd, n_head, n_layer, block_size, dropout):
        super().__init__()
        self.block_size = block_size
        
        # Token and position embeddings
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        
        # Decoder blocks
        self.blocks = nn.Sequential(*[Block(n_embd, n_head, block_size, dropout) for _ in range(n_layer)])
        
        # Final layer normalization and output projection
        self.ln_f = nn.LayerNorm(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)
        
    def forward(self, idx, targets=None):
        B, T = idx.shape
        
        # Token embeddings
        tok_emb = self.token_embedding_table(idx)  # (B,T,C)
        
        # Position embeddings
        pos = torch.arange(0, T, dtype=torch.long, device=idx.device)  # (T)
        pos_emb = self.position_embedding_table(pos)  # (T,C)
        
        # Combine token and position embeddings
        x = tok_emb + pos_emb  # (B,T,C)
        
        # Apply decoder blocks
        x = self.blocks(x)  # (B,T,C)
        
        # Final layer norm and output projection
        x = self.ln_f(x)  # (B,T,C)
        logits = self.lm_head(x)  # (B,T,vocab_size)
        
        # Loss calculation (if targets provided)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
            
        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        """Generate text by sampling from the model distribution"""
        for _ in range(max_new_tokens):
            # Crop input to block_size
            idx_cond = idx[:, -self.block_size:]
            
            # Get predictions
            logits, _ = self(idx_cond)
            
            # Focus only on the last time step
            logits = logits[:, -1, :]  # (B, C)
            
            # Apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            
            # Sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            
            # Append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
            
        return idx

The `GPTLanguageModel` class implements the complete transformer-based GPT model with the following components:

1. **Embeddings**:
   - Token embeddings: Map each token ID to a vector
   - Position embeddings: Provide position information since the transformer has no inherent notion of sequence order

2. **Transformer Blocks**: A series of transformer blocks, each containing:
   - Multi-head self-attention
   - Feed-forward network
   - Layer normalization
   - Residual connections

3. **Output Layer**:
   - Final layer normalization
   - Linear projection to vocabulary size

4. **Generation Function**:
   - Autoregressive text generation using sampling
   - Maintains a proper context window of maximum size `block_size`

Let's initialize and train our GPT model:

In [31]:
# Model hyperparameters
n_embd = 384    # Embedding dimension
n_head = 6      # Number of attention heads
n_layer = 6     # Number of transformer blocks
dropout = 0.2   # Dropout probability

# Training parameters
learning_rate = 3e-4
max_iters = 5000
eval_interval = 500
batch_size = 64
block_size = 256  # Maximum context length

# Initialize model
model = GPTLanguageModel(vocab_size, n_embd, n_head, n_layer, block_size, dropout)
model.to(device)
# Calculate the number of parameters
print(f"Number of parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

Number of parameters: 10,788,929


## 13. Training the GPT Model

Now let's train our full GPT model using the same approach as before. We'll need to update our batch generation function to handle the larger context size.

In [None]:
# Updated batch function for larger context size
def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

# Initialize optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Training loop
for iter in range(max_iters):
    
    # Evaluate loss periodically
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"Step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
    
    # Sample a batch of data
    xb, yb = get_batch('train')
    
    # Evaluate the loss
    logits, loss = model(xb, yb)
    
    # Backpropagation
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

This training loop is similar to our earlier one, but now we're training the full transformer-based GPT model. The training process will take longer due to the increased model size and complexity.

Let's generate some text with our trained model to see the results:

In [None]:
# Generate text
context = torch.zeros((1, 1), dtype=torch.long)
print(decode(model.generate(context, max_new_tokens=500)[0].tolist()))

## 14. Training the GPT Model On GPU

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

# 1. Check for GPU and set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# 2. Load and process data
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# Create vocabulary
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"Vocabulary size: {vocab_size}")

# Create mappings
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

# Encoding/decoding functions
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

# Convert text to tensor and move to GPU
data = torch.tensor(encode(text), dtype=torch.long)
data = data.to(device)  # Move data to GPU

# Split into train and validation sets
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

# 3. Define model architecture
class GPTLanguageModel(nn.Module):
    def __init__(self, vocab_size, n_embd, n_head, n_layer, block_size, dropout=0.1):
        super().__init__()
        self.block_size = block_size
        
        # Token and position embeddings
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        
        # Transformer blocks
        self.blocks = nn.Sequential(*[
            TransformerBlock(n_embd, n_head, block_size, dropout) 
            for _ in range(n_layer)
        ])
        
        # Final layer norm and output projection
        self.ln_f = nn.LayerNorm(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)
        
    def forward(self, idx, targets=None):
        B, T = idx.shape
        
        # Get token and position embeddings
        tok_emb = self.token_embedding_table(idx)  # (B,T,C)
        pos = torch.arange(0, T, dtype=torch.long, device=idx.device)  # (T)
        pos_emb = self.position_embedding_table(pos)  # (T,C)
        
        # Add embeddings
        x = tok_emb + pos_emb  # (B,T,C)
        
        # Apply transformer blocks
        x = self.blocks(x)  # (B,T,C)
        
        # Apply final layer norm and projection
        x = self.ln_f(x)  # (B,T,C)
        logits = self.lm_head(x)  # (B,T,vocab_size)
        
        # Calculate loss if targets provided
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
            
        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        """Generate text by sampling from the model distribution"""
        for _ in range(max_new_tokens):
            # Crop input to block_size
            idx_cond = idx[:, -self.block_size:]
            
            # Get predictions
            logits, _ = self(idx_cond)
            
            # Focus only on the last time step
            logits = logits[:, -1, :]  # (B, C)
            
            # Apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            
            # Sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            
            # Append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
            
        return idx

# Simple transformer block
class TransformerBlock(nn.Module):
    def __init__(self, n_embd, n_head, block_size, dropout):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size, n_embd, block_size, dropout)
        self.ffwd = FeedForward(n_embd, dropout)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)
        
    def forward(self, x):
        # Self-attention with residual connection
        x = x + self.sa(self.ln1(x))
        # Feed-forward with residual connection
        x = x + self.ffwd(self.ln2(x))
        return x

# Multi-head attention
class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size, n_embd, block_size, dropout):
        super().__init__()
        self.heads = nn.ModuleList([
            Head(head_size, n_embd, block_size, dropout) 
            for _ in range(num_heads)
        ])
        self.proj = nn.Linear(num_heads * head_size, n_embd)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        # Concatenate outputs from all heads
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        # Project back to original dimension
        out = self.proj(out)
        out = self.dropout(out)
        return out

# Single attention head
class Head(nn.Module):
    def __init__(self, head_size, n_embd, block_size, dropout):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)    # (B,T,head_size)
        q = self.query(x)  # (B,T,head_size)
        
        # Compute attention scores
        wei = q @ k.transpose(-2, -1) * (k.shape[-1]**-0.5)  # (B,T,T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))  # (B,T,T)
        wei = F.softmax(wei, dim=-1)  # (B,T,T)
        wei = self.dropout(wei)
        
        # Weighted aggregation of values
        v = self.value(x)  # (B,T,head_size)
        out = wei @ v  # (B,T,head_size)
        return out

# Feed-forward network
class FeedForward(nn.Module):
    def __init__(self, n_embd, dropout):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )
        
    def forward(self, x):
        return self.net(x)

# 4. Hyperparameters
block_size = 256  # Context length
batch_size = 64   # Batch size
n_embd = 384      # Embedding dimension
n_head = 6        # Number of attention heads
n_layer = 6       # Number of transformer blocks
learning_rate = 3e-4
max_iters = 5000
eval_interval = 500
dropout = 0.2

# 5. Initialize model and move to GPU
model = GPTLanguageModel(vocab_size, n_embd, n_head, n_layer, block_size, dropout)
model = model.to(device)  # Move model to GPU

# 6. Define batch generation function
def get_batch(split):
    # Data is already on GPU
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,), device=device)
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

# 7. Define evaluation function
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(100, device=device)
        for k in range(100):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

# 8. Initialize optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# 9. Training loop
print("Starting training...")
for iter in range(max_iters):
    # Evaluate loss periodically
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"Step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
    
    # Sample a batch of data
    xb, yb = get_batch('train')
    
    # Evaluate the loss
    logits, loss = model(xb, yb)
    
    # Backpropagation
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# 10. Generate sample text
print("\nGenerating sample text...")
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated_ids = model.generate(context, max_new_tokens=500)[0].tolist()
print(decode(generated_ids))

print("\nTraining complete!")

Using device: cuda
Vocabulary size: 65
Starting training...
Step 0: train loss 4.3603, val loss 4.3642
Step 500: train loss 1.8925, val loss 1.9951
Step 1000: train loss 1.5392, val loss 1.7200
Step 1500: train loss 1.3957, val loss 1.6133
Step 2000: train loss 1.3169, val loss 1.5514
Step 2500: train loss 1.2630, val loss 1.5193
Step 3000: train loss 1.2096, val loss 1.4831
Step 3500: train loss 1.1685, val loss 1.5005
Step 4000: train loss 1.1324, val loss 1.4870
Step 4500: train loss 1.0966, val loss 1.4758

Generating sample text...


ROMEO:
O'er his hand; he that's answer'd for 'scent
I am honesty, marquised, how thy hands
Who cannot in his house to revenge him:
But sufficility shows it is strike and before him,
If that do that pluck him than henced my sweet,
For hath me to bitter on fourth. Nay, his liege,
But that 'twas time to with inking him, I'll heark.
And so to you, my lord, lord, a prince is stoop.
Than you show therefore's mother i' the time
Against them, with treason sho

## 15. Conclusion and Next Steps

In this notebook, we've built a GPT model from scratch, starting with the simplest bigram model and progressively adding complexity until we reached a full transformer-based architecture. Along the way, we've explored the key components that make transformers so effective:

1. **Self-Attention**: The core mechanism that allows models to focus on relevant parts of the input sequence
2. **Multi-Head Attention**: Running multiple attention operations in parallel to capture different types of relationships
3. **Positional Encodings**: Providing sequence order information to the model
4. **Layer Normalization**: Stabilizing training by normalizing activations
5. **Residual Connections**: Helping with gradient flow during training

Our model is a miniature version of OpenAI's GPT models, but it follows the same architectural principles.

### Possible Next Steps

1. **Experiment with Hyperparameters**: Try different embedding dimensions, numbers of heads, layers, etc.
2. **Use a Different Dataset**: Train the model on a different text corpus
3. **Implement Subword Tokenization**: Switch from character-level to subword tokenization for better performance
4. **Add Optimizations**: Implement techniques like mixed-precision training or gradient accumulation for faster training
5. **Fine-Tuning**: Train the model on a specific task after pre-training

Understanding how these models work from the ground up gives you a solid foundation for working with modern language models like GPT, BERT, and others. Happy exploring!