# Frosty: a GPT trained on Robert Frost Poems

This outlines our mini (character level) GPT, modeled after Karpathy’s implementation. Think of it as a tiny version of ChatGPT — but without the *chatting* fine-tuning. In other words, it won’t answer questions, but it can generate text — specifically, poetry in the style of Robert Frost.

We call it **Frosty**, since we trained it on a collection of Robert Frost’s poems.

Let’s quickly break down what “GPT” means:
> - character level - as above, this means the dictionary of tokens is simply the individual letters/symbols in the text file. 
> - **G**enerative — the model can generate text, sampling tokens one by one  
> - **Pr**e-trained — it first learns statistical patterns from a dataset (here: Robert Frost's poetry)  
> - **T**ransformer — it’s built using the Transformer architecture from ["Attention is All You Need"](https://arxiv.org/abs/1706.03762)  

So: **a small Transformer trained to continue Robert Frost’s poetic style**. You can think of it as a probabilistic “document completer” — a model that sees the first few words of a line and tries to finish it, having only been exposed to poems by Frost. 

Here’s a sample from the original dataset — one of my favorite poems, and the first I ever read from Frost:



> ```
> STOPPING BY WOODS ON A 
> SNOWY EVENING 
> 
> Whose woods these are I think I know. 
> His house is in the village though; 
> 
> He will not see me stopping here 
> To watch his woods fill up with snow. 
> 
> My little horse must think it queer 
> To stop without a farmhouse near 
> Between the woods and frozen lake 
> The darkest evening of the year. 
> 
> He gives his harness bells a shake 
> To ask if there is some mistake. 
> 
> The only other sound’s the sweep 
> Of easy wind and downy flake. 
> 
> The woods are lovely, dark and deep. 
> But I have promises to keep, 
> 
> And miles to go before I sleep, 
> 
> And miles to go before I sleep. 
> 
> 
> [275]
> ```



## From Bigram to Transformer

We begin with a simple bigram model (from Part 2), which learns to predict the next character based only on the current one — i.e., with a **context window** of size 1. We outline all the necessary components of the Transformer architecture below. But for reference we also include here the image from the paper [Attention is all you need](https://arxiv.org/pdf/1706.03762):
<div style="text-align: center;">
<img src="https://raw.githubusercontent.com/eriktholmes/zero_to_hero_course/refs/heads/main/gpt/files/transformer.jpe" alt="transformer" width="500" style="margins:auto">
</div>
In fact, all we really focus on in this notebook is the right portion of this diagram. We now outline (most of) the necessary improvements to Bigram model to get the GPT:

###  0. **Longer contexts**. 
Instead of seeing just one token at a time, the model can now attend to multiple past tokens — which unlocks both richer predictions and a more expressive model architecture. 

> The bigram model has a window of 1 so it learns patterns like “S → T”. But what if we let the model look back 8 tokens?
> - This gives us two key benefits:
>     - The model sees more history, so it can make more informed predictions.
>     - We extract **multiple training examples** from each sequence. For example, with context size 8:
> 
>     ```
>     Sequence:   "STOPPING"
>     
>     Context     → Target
>     S           → T
>     ST          → O
>     STO         → P
>     STOP        → P
>     STOPP       → I
>     STOPPI      → N
>     STOPPIN     → G
>     STOPPING    → _
>     ```
> 
>     Each pair becomes one training example: the context on the left is fed into the model, and it’s trained to predict the next character on the right.


### 1. **Token embeddings**: 
We begin with one-hot encodings of each token (vectors of length equal to the vocabulary size). These are mapped into a lower-dimensional space via a trainable embedding matrix of shape `(vocab_size × embedding_dim)`. The resulting vectors can capture semantic structure in a more compact form. If you have seen anything about these models this is how we get examples like the following:
   - For example, in trained language models, you may see patterns like: $\vec{v}_{\text{King}} - \vec{v}_{\text{Man}} + \vec{v}_{\text{Woman}} \approx \vec{v}_{\text{Queen}}$
  - This kind of arithmetic emerges from training and reflects how the model encodes semantic relationships.


### 2. **Positional embeddings**: 
Token embeddings alone don’t encode *order* — the model sees a bag of (words) vectors without knowing which came first. To fix this, we add positional embeddings, which assign a unique vector to each position (e.g., position 0 through 7 for a context window of size 8). These are added to the token embeddings, nudging them based on their sequence position.

### 3. **Attention**: 
Attention is the mechanism that allows tokens in different positions to 'attend' or talk to one another to learn more information from them. The basic idea: for each token, we compute three vectors: 
    - **Query** — what the token is looking for
    - **Key** — what the token *is* or contains
    - **Value** — what the token will contribute to others that attend to it

We compute attention scores via dot products: $\text{score}_{i,j} = q_i \cdot k_j$. This tells us how much token *i* should pay attention to token *j*.

  - Once we have these vectors we take the dot product of querys and keys, $\vec{v}_{Q} \cdot \vec{v}_{K}$, for all tokens in the context window: this yields a number, called the **weight**, for each pair of tokens in the sequence, and this weight tells us roughly how important the two tokens are to each other.
  - We then implement the auto-regressiveness of the model which prevents current tokens from talking to future tokens. 
      - We apply a **causal mask**, setting all scores where $j > i$ (future tokens) to $-\infty$ before applying the softmax. 
      - The softmax transforms these masked scores into a probability distribution over past tokens, all entries where $j > i$ are $0$. 
      - The result is a matrix `wei`, where each row contains the attention distribution for a given token — and each token only attends to itself and earlier tokens.

  - Finally, we take the product of this weight matrix with the value matrix obtained by taking all the value vectors for the tokens in sequence. It will yield a matrix where the i-th row is a weighted sum of the value vectors of all tokens in the sequence, where the weights are the attention scores: and tell you how much attention token $i$ pays to token $j$. 

$$  \text{Attention}(Q,K, V) =  \text{softmax} \left( \frac{Q \cdot K^\top}{\sqrt{d_K}}\right) \cdot V $$

> Let's consider a quick example. Suppose that our attention matrix is given by:
> $$ \text{softmax} \left( \frac{Q \cdot K^\top}{\sqrt{d_K}}\right) = \begin{pmatrix}  1   &  0  &  0  \\ .6  &  .4  &  0  \\  .5  &  .3  &  .2   \end{pmatrix}$$
> then the `Value` matrix gets multiplied by this which gives us:
>  $$ \text{Attention}(Q,K, V) = \text{softmax} \left( \frac{Q \cdot K^\top}{\sqrt{d_K}}\right) \cdot  \begin{pmatrix}  \vec{v_1} \\ \vec{v_2} \\ \vec{v_3} \end{pmatrix} = \begin{pmatrix}  \vec{v_1} \\ .6 \vec{v_1} + .4 \vec{v_2}\\ .5 \vec{v_1} + .3 \vec{v_2} + .2 \vec{v_3} \end{pmatrix} $$
> and, to be really pedantic, if we let $ \vec{v_1} = \begin{pmatrix} 1 & 2 & 3 \end{pmatrix}, \; \vec{v_2} = \begin{pmatrix} 4 & 5 & 6 \end{pmatrix}, \; \vec{v_3} = \begin{pmatrix} 7 & 8 & 9 \end{pmatrix}$ then this gives us:
> $$ \text{Attention}(Q,K, V) = \begin{pmatrix} 1 & 2 & 3 \\ .6\cdot 1 + .4 \cdot 4  & .6\cdot 2 + .4 \cdot 5 & .6\cdot 3 + .4 \cdot 6  \\ .5\cdot 1 + .3 \cdot 4 + .2\cdot 7 & .5\cdot 2 + .3 \cdot 5 + .2\cdot 8 & .5\cdot 3 + .3 \cdot 6 + .2\cdot 9\end{pmatrix} $$

### 3.1. A code sketch so far

> ```python
> import torch
> import torch.nn as nn
> import torch.nn.functional as F
> 
> torch.manual_seed(314159)
> 
> # Let's say we have 4 tokens (e.g., in a sentence) and we want an embedding size of 6
> context_window = 4
> embed_dim = 6
> 
> # Each token is represented by its index in the vocab (toy example)
> tokens = torch.arange(context_window)  # tensor([0, 1, 2, 3])
> 
> # Random embeddings for Q, K, V (we're skipping projection layers for now... those will come later)
> Query = nn.Embedding(context_window, embed_dim)
> Key   = nn.Embedding(context_window, embed_dim)
> Value = nn.Embedding(context_window, embed_dim)
> 
> # Get Q, K, V vectors for each token
> q = Query(tokens)  # shape: [4, 6]
> k = Key(tokens)    # shape: [4, 6]
> v = Value(tokens)  # shape: [4, 6]
> 
> # Compute raw attention scores via dot product: [4 × 6] @ [6 × 4] → [4 × 4]
> # Each row i gives the score of how much token i attends to token j
> wei = q @ k.T
> 
> # Optional: scale scores by sqrt(d_k) for stability (standard in attention)
> wei = wei / (embed_dim ** 0.5)
> 
> # Causal mask (i.e. the lower triangulation of the weight matrix): prevents token i from attending to future tokens (j > i)
> mask = torch.tril(torch.ones(context_window, context_window))  # lower triangular entries only
> wei = scores.masked_fill(mask == 0, float('-inf'))
> 
> # Softmax over each row to get attention weights
> attention_weights = F.softmax(wei, dim=-1)
> print(f' The weight matrix indicated above as the output of softmax is given by: \n\n {attention_weights}\n\n')
> 
> # Apply attention weights to the value vectors: [4 × 4] @ [4 × 6] → [4 × 6]
> attention_output = attention_weights @ v
> print(f'The Value matrix is given by: \n\n {v} \n\n')
> 
> print(f'The output of attention is given by:\n\n {attention_output}')
> ```

This code produces the following, which is meant to mirror a context window of size 4. 
- Take, for example the text "STOP". As we saw above the model get's 4 examples from this:

> ```
> Sequence:   "STOP"
> Context     → Target
> S           → T
> ST          → O
> STO         → P
> STOP        → P
> ```



- The weight matrix indicated above as the output of softmax is given by:
$$
\text{softmax} \left( \frac{Q \cdot K^\top}{\sqrt{d_K}}\right) = 
\begin{pmatrix}
    1.0000 & 0.0000 & 0.0000 & 0.0000 \\
    0.6225 & 0.3775 & 0.0000 & 0.0000 \\
    0.0930 & 0.7260 & 0.1810 & 0.0000 \\
    0.5831 & 0.1036 & 0.2475 & 0.0658 
\end{pmatrix}
$$

- The Value matrix is given by:
$$ 
V =
\begin{pmatrix}
 0.5769 & -0.1299 & -0.6200 & -0.4805 &  0.6406 & -1.7443 \\
-1.3280 &  0.3363 &  0.8598 & -0.2628 & -0.0388 &  0.6033 \\
-2.1684 & -1.1359 & -0.2304 & -0.5904 &  2.0315 &  0.0827 \\
 0.7663 & -1.0580 &  0.8923 &  0.7705 & -1.3446 &  0.0131 \\
\end{pmatrix}
$$

- The output of attention is:
$$ AV =
\begin{pmatrix}
 0.5769 & -0.1299 & -0.6200 & -0.4805 &  0.6406 & -1.7443 \\
-0.1422 &  0.0461 & -0.0613 & -0.3983 &  0.3841 & -0.8580 \\
-1.3029 &  0.0265 &  0.5249 & -0.3423 &  0.3991 &  0.2907 \\
-0.2874 & -0.3916 & -0.2707 & -0.4028 &  0.7837 & -0.9332 \\
\end{pmatrix}
$$

> Notes:
> 1)  all of these values are randomized to begin with and as the model learns these entries will change.
> 2)  The way this is setup ensures that there are as many training examples as there are elements of the sequence: like the "STOPPING" example above. The first token 'S' can only attend to itself (this is why the value vector corresponding to say 'S' is unchanged in AV). As the model trains it will be steered towards 'T'.  Hence the need for a mask! 

### So far...
We have implemented the first few portions of this diagram above. We have the text feeding into an embedding space, we then add a positional encoding which shifts the vectors around from their base position. We have a single attention head (we get more by just breaking up the input and concatenating the results but more on that below). Now, we address the arrow that steers around the first block and enters the `add + Norm` bubble. This is called a **residual connection**.  


### 4. **Residual Connections**:
Residual connections — often called the residual stream in interpretability literature — are a simple idea: we let the input flow straight through, and then add the result of the transformation (like attention or MLP) on top of it.

This additive structure helps the model:

- Preserve original information

- Accumulate useful transformations

- Prevent vanishing gradients in deep networks

A (potentially) helpful way to think about it is like this:

> Imagine passing a sketch around a classroom. Each student adds a few new details — but no one starts over. Instead, they draw on top of the original.
> The final image is the cumulative result of all those modifications, layered on the same canvas.

This makes training more stable as the model deepens. In code it is simply letting `x = x + attention(x)`

### 5. **Layer Norm**:
The next step, also crucial for training deeper networks, is LayerNorm. We’ve already seen the add component in the residual connection; now we apply the normalization step:`x =  LayerNorm(x + attention(x))`
This operation normalizes the features within each token’s vector, adjusting them to have mean 0 and variance 1. Unlike BatchNorm, which normalizes the features within a tokens vector to have mean 0 and variance 1. Unlike BatchNorm, which normalizes across a batch of inputs, LayerNorm works within each token, independently of batch size. 

> In the metaphor from above (sketch passed around the classroom):
LayerNorm is like wiping the smudges and adjusting the contrast after each student adds their details. It doesn’t erase what was added — it just ensures the image doesn’t get too dark, bright, or blurry to work with.



### 6. **Feedforward**:
The tokens have had time to talk to one another — to pass information, share context, and attend to what matters. Now comes the feedforward block, a simple MLP applied to the result of the attention block. 

> This is where the model — as Karpathy puts it — adds computation to the network and allows the model to think about what they found from the other tokens. 

The feedforward layer processes each token’s vector independently, applying two linear layers with a nonlinearity in between:
$$ \text{FFN}(x) = W_2 · \text{GELU}(W_1 * x + b_1) + b_2 $$
where 
- $W_1$ projects up to a higher dimension ($4 \times$ in the paper)
- $\text{GELU}$ or ReLU, provides non-linearity
- $W_2$ projects back down.
  
this expands the models representation ability and allows oit to:
- Recombine and reshape the attended features
- Introduce nonlinearity for richer representations
- Learn more abstract patterns and interactions between features


---

### Steps **0-6** combine to give us a single head transformer block! Now, to add heads and blockify things.

---

### 7. **Multiheaded attention**:
The single attnetion head that we outlined about has a set of query, key and value matrices (Q,K,V) and it learns about contextual relationships in the token sequence. Now, instead of having only one head that learns a single pattern we stack together multiple attnention heads which run in parellel. These other heads also have their own set of query, key and value matrices, $(Q_i, K_i, V_i)$, and so each head can learn different patterns and attends to the sequence in a different way.
- input flows into multiple attention heads
- the heads compute attention patterns
- the results are concatenated and projected back to the embedding dimension.


The single attention head we described above uses one set of query, key, and value matrices —  (Q,K,V) — to learn contextual relationships across the token sequence. But there is no need to stop at 1! Instead of having only one head that learns a single pattern we stack together multiple attnention heads which run in parellel, and have their own set of query, ket and value matrices:
    $$ (Q_i, K_i, V_i) $$

Each head can learn to focus on different aspects of the sequence — some may specialize in local patterns (previous token heads that attend to the previous token), others in broader dependencies, syntax, or punctuation. 

Here’s the general flow:
- The input is passed into multiple attention heads
- Each head computes its own attention output
- These outputs are concatenated
- A final linear layer projects the result back to the original embedding dimension

This gives the model multiple ways of "seeing" each token — expanding the representation before passing it into the next layer.

> Imagine giving the same sketch to several students at once, and asking each one to focus on something different — shading, color, composition, geometry. Each student makes their changes independently and when they’re done, we combine their work into one unified version, layering their perspectives onto the same canvas.

### 8. **Transformer block**:
Next, we package everything we have above into a block:
```
block:
- Multiheaded attention
- Residual + LayerNorm
- Feedforward MLP
- Residual + LayerNorm
```
Each block allows the model to:
- Attend to relevant parts of the sequence
- Refine and transform those relationships
- Normalize and stabilize the internal flow of information

We then stack these blocks — typically anywhere from 6 to 96 times — to build up depth and complexity. More blocks give the model more opportunities to process and reprocess information, layer by layer.
> For reference: GPT-3 uses 96 transformer blocks.
In our mini GPT-Frosty model, we use 6 — enough to capture nontrivial patterns from Robert Frost's poems while keeping things lightweight and interpretable.


### 8.5 **Dropout:**

While not part of the transformer’s core architecture, **dropout** is an important component for training stability and generalization.

Dropout is typically applied:
- After the multi-head attention output
- After the MLP output
- Sometimes on embeddings

During training, dropout randomly zeroes out elements of the input with some probability (e.g., `p = 0.2`). This helps prevent the model from **overfitting** the training data.

In our GPT experiments we ran various experiments on different text files. On a small dataset of `yoda` quotes the model memorized the training data quite quickly and started to deviate from the test data. Dropout proved itself immediately in this experiment and once added it helped the model **generalize** well, even with a small dataset.

### 9. **Final output: projection and softmax**:
After the input has passed through all the transformer blocks, we’re left with a list of contextualized token vectors — one for each position in the input sequence.

To turn these vectors into predictions, we map each one back to the vocabulary using a linear projection (a matrix of shape `[embedding_dim × vocab_size]`), and apply a softmax to get a probability distribution over the next possible token-this gives us the model's best guess at the next token and it allows us to sample from the model. 

> Just like in a bigram model, we can generate new text by feeding in some context (tokens), sampling the next token from the predicted distribution, appending it, and repeating. 

---

### 10. **Let's hear it Frosty**:
After all the theory and architecture, it’s time to let our model speak. We trained our GPT (Frosty) on a collection of Robert Frost poems with the following parameters:
- Batch size: 32
- Context window: 256
- Training steps = 5000
- Learning_rate = 3e-4
- Embedding dimension: 384
- Number of heads: 6
- Number of transformer blocks: 6
- Dropout: 0.2

#### 10.1. Frosty's first poem
> ONENT 
> 
> 
> And flower what checks a smile darled out end 
> dows for a farmica cellar to fice. 
> 
> The fame needn’t know that’s have left. 
> 
> 
> Neither expered. His earn one. 
> 
> 
> [ 275 ] 