In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/iliad-gutenberg/synthetic_qa_dataset.json
/kaggle/input/iliad-gutenberg/Iliad.txt


## Problem Description
    - Picture a student worried sick about their big literature exam tomorrow morning. They read the novel and looked at SparkNotes but still don't feel confident.
    - Personally, I think a conversation is the best way to engage and prepare for such an exam. BUt not everyone has an expert they can freely consult for an in-depth study session.
    - LLMs offer the perfect tool to chat about a book and clarify any concerns about the text. 
    - I will showcase attempts at solving this use case through a look at Homer's 'The Iliad'.

In [3]:
import torch
import torch.nn as nn
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader

In [4]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

## EDA on 'The Iliad'
    - Homer's The Iliad is an ancient Greek epic poem that recounts a critical period during the Trojan War, focusing on the wrath of Achilles and its devastating consequences. It explores themes of heroism, fate, honor, and the human cost of war. As one of the oldest works in Western literature, it provides insight into ancient Greek culture and values while influencing countless works of art and literature. Its enduring importance lies in its timeless exploration of universal human experiences like anger, loss, and glory.
    - open source on project gutenburg https://www.gutenberg.org/ebooks/6130
    - Need to clean the text down since the intro and end contain legal jargin we don't want the model to see during training
    - Will also want to investigate the length of the text

In [5]:
# Read the corpus
with open("/kaggle/input/iliad-gutenberg/Iliad.txt", "r", encoding="utf-8") as f:
    iliad = f.read()

print(len(iliad))
print(iliad[:500])

1116791
The Project Gutenberg eBook of The Iliad
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: The


In [6]:
START_STR = "*** START OF THE PROJECT GUTENBERG EBOOK THE ILIAD ***"
END_STR = "*** END OF THE PROJECT GUTENBERG EBOOK THE ILIAD ***"

gutenburg_intro_index = iliad.find(START_STR) + len(START_STR)
gutenburg_legal_terms_index = iliad.find(END_STR)

iliad = iliad[gutenburg_intro_index:gutenburg_legal_terms_index].strip()

In [7]:
print(len(iliad))
iliad[:500]

1097444


'The\nIliad of Homer\n\nTranslated by\nAlexander Pope,\n\nWith Notes and Introduction\nby the\nRev. Theodore Alois Buckley, M.A., F.S.A.\n\nand\nFlaxman’s Designs.\n\n1899\n\n\nContents\n\n INTRODUCTION.\n POPE’S PREFACE TO THE ILIAD OF HOMER\n\n THE ILIAD\n BOOK I.\n BOOK II.\n BOOK III.\n BOOK IV.\n BOOK V.\n BOOK VI.\n BOOK VII.\n BOOK VIII.\n BOOK IX.\n BOOK X.\n BOOK XI.\n BOOK XII.\n BOOK XIII.\n BOOK XIV.\n BOOK XV.\n BOOK XVI.\n BOOK XVII.\n BOOK XVIII.\n BOOK XIX.\n BOOK XX.\n BOOK XXI.\n BOOK XXII.\n BOOK XXIII.\n BOOK XXIV.\n\n CON'

## Byte Pair Encoding
    - Byte Pair Encoding (BPE) is a subword tokenization technique used to efficiently handle text by splitting words into smaller, more frequent units. It starts with individual characters and iteratively merges the most common adjacent pairs of tokens to form subwords. This method balances vocabulary size and model performance by capturing both common words and rare subword patterns. BPE is widely used in modern NLP models like GPT to handle out-of-vocabulary words and improve generalization.
    - BPE is the tokenizer used for many GPT models
    - Rather than training my own BPE tokenizer, I will use the publicly available version from GPT-2 through tiktoken
    - tiktoken docs: https://github.com/openai/tiktoken

In [8]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.8.0


In [9]:
import tiktoken

bpe_tokenizer = tiktoken.get_encoding("gpt2")

print(bpe_tokenizer.encode("hello world"))
assert bpe_tokenizer.decode(bpe_tokenizer.encode("hello world")) == "hello world"

[31373, 995]


In [10]:
encoded_iliad = bpe_tokenizer.encode(iliad)
print(len(encoded_iliad))
print(encoded_iliad[:50])

311085
[464, 198, 40, 4528, 324, 286, 28440, 198, 198, 8291, 17249, 416, 198, 38708, 13258, 11, 198, 198, 3152, 11822, 290, 22395, 198, 1525, 262, 198, 18009, 13, 36494, 978, 10924, 41493, 11, 337, 13, 32, 1539, 376, 13, 50, 13, 32, 13, 198, 198, 392, 198, 7414, 897, 805]


## Dataset & Dataloader
    - GPT models are causal models trained on the task of next work prediction
    - Also, GPT models are auto regressive through the attention mechanism, meaning past tokens will be used to influence next token prediction. This yields the question, how many past tokens will be used to influence the next prediction? The 'context-window' indicates the number of tokens used to influence the prediction. A larger context window will require a larger model architecture, but will allow a better memory for the model when generating text. I'll discuss why this increases the architecture size later, but for now this is enough info to operate on.
    - In summary, we need our dataset to allow us to pull blocks of tokens less than or equal to our context window size, and we need to be able to do next token prediction. This will be achieved through the __getitem__ function in a pytorch Dataset. Here's an example of the expected output from __getitem__:
    
    tokens = [11, 22, 34, 1, 25, 98, 5]
    context_window = 3
    
    __getitem__(1) returns:
        [22, 34, 1], [34, 1, 25]

    - By making sure the target tensor is just the input selection shifted forward by 1 index, we can perform next token prediction at each index in the input tensor! This provides many more training examples for the model. Fruther we ensure text is the length of the context window so that context window is adhered.

    - The DataLoader will use the dataset to serve batches of training examples from the dataset. We'll shuffle the dataset as well to further introduce randomness to the learning process of stocastic gradient descent.

In [11]:
class GPTDataset(Dataset):
    def __init__(self, tokens: list[int], context_window: int):
        self.tokens = tokens
        self.context_window = context_window
        self.num_tokens = len(self.tokens)
        
    def __len__(self):
        return self.num_tokens // self.context_window
    
    def __getitem__(self, idx):
        # Get a slice of the tokens from the dataset
        start_idx = idx
        end_idx = start_idx + self.context_window
        
        # The input is the tokens from start to end-1, and the target is from start+1 to end
        input_tokens = self.tokens[start_idx:end_idx]
        target_tokens = self.tokens[start_idx + 1:end_idx + 1]  # Predict next token
        
        # Convert tokens to tensor
        input_tensor = torch.tensor(input_tokens)
        target_tensor = torch.tensor(target_tokens)
        
        return input_tensor, target_tensor


In [12]:
def create_dataloader(tokens: list[int], context_window: int, batch_size: int=4, shuffle: bool=True):
    dataset = GPTDataset(tokens, context_window)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle) 
    return dataloader


In [13]:
CONTEXT_WINDOW: int = 512

In [14]:
dataloader = create_dataloader(encoded_iliad, CONTEXT_WINDOW)

for input_tensor, target_tensor in dataloader:
    print(f"Input: {input_tensor.shape}")
    print(f"Target: {target_tensor.shape}")
    break  # Just show the first batch

Input: torch.Size([4, 512])
Target: torch.Size([4, 512])


## GPT High Level Architecture:
    - Embedding
    - Positional Encoding
    - Self Attention Block
    - Multi Headed Attention
    - Multi-Layer Perceptron 

### Code Implementation adapted from https://github.com/karpathy/ng-video-lecture/blob/master/gpt.py 


## Understanding the Embedding Layer in GPT

In the GPT model, the **`nn.Embedding` layer** maps **vocabulary indices** (produced by Byte Pair Encoding or other tokenizers) to **dense vector representations** in a continuous space. Let’s break this down step by step to understand how it works, why it’s essential for the GPT architecture, and how it facilitates proper learning.

---

### 1. **How Vocab Indices Work with the `nn.Embedding` Layer**

- After tokenizing the input text with **Byte Pair Encoding (BPE)**, each token is represented as an **integer index** (ranging from `0` to `V-1`, where `V` is the vocabulary size).
- The `nn.Embedding` layer takes these indices and looks up corresponding rows in an **embedding matrix** of shape `(V, d_model)`:
    - `V` = Vocabulary size
    - `d_model` = Dimension of the embedding vectors (hidden size)

Mathematically:
\[
\text{eEbedding}(i) = W[i], \quadE\text{where } W \in \mathbb{R}^{V \times d_{\text{model}}} \text{ is the embedding matrix, and } i \text{ is the token index}.
\]

- Each token index `i` directly selects the `i`-th row (a dense vector of size `d_model`) from the embedding matrix.

#### Example:
If the vocabulary has size `10,000` (`V = 10,000`) and `d_model = 768`:
- ThE embedding matrix \( W \) has dimensions `(10,000, 768)`.
- A token with index `123` corresponds to the **123rd row** of this matrix, which is a vector of shape `(768,)`.

---

### 2. **How Indices Facilitate Proper Learning and Gradient Flow**

In PyTorch, the `nn.Embedding` layer is trainable becEuse the embedding matrix \( W \) is initialized with random values and updated during training via **backpropagation**.

- When the input indices pass through the `nn.Embedding` layer, they are used to select the corresponding rows from the embedding matrix. These selected rows act as the "input" to the model.
- During **backward propagation**:
    - Gradients are Eomputed only for the rows of \( W \) corresponding to the input indices.
    - The rest of the embedding matrix remains unchanged.

Thus, the model learns a **meaningful vector representation** for each token by updating its embedding based on the loss function.

---

### 3. **Why is the `nn.Embedding` Layer So Important?**

The `nn.Embedding` layer is critical for several reasons:

1. **Mapping Discrete Indices to Continuous Space**:  
   - Input tokens are discrete (indices), but neural networks operate in continuous space. The `nn.Embedding` layer bridges this gap by mapping indices to dense vectors.
   - These dense vectors encode semantic and syntactic information about words in a high-dimensional space.

2. **Enabling Representation Learning**:  
   - The embedding vectors are **trainable parameters** that get optimized during training.
   - The model learns embeddings such that semantically similar tokens have similar vector representations (e.g., "cat" and "dog" may have embeddings close to each other).

3. **Dimensionality Reduction**:  
   - Instead of using a large one-hot encoding of size `V` (which is sparse and inefficient), embeddings reduce the dimensionality to `d_model` while retaining relevant information.

4. **Efficient Learning**:  
   - Because gradients are updated only for the indices present in the input, the learning process is efficient and computationally feasible.

5. **Basis for Contextual Representations**:  
   - In GPT, the (next section)token embeddings are combined with **positional encodings** before passing through the transformer layers.
   - The `nn.Embedding` layer provides the initial "static" representation for tokens, which is later refined into **cembedding layer serves as the foundation for building contextual representations in GPT.


## Understanding Positional Encoding in GPT

In the GPT architecture, **positional encoding** is used to provide information about the position of tokens in a sequence. Since transformers lack the inherent sequential structure of RNNs, positional encodings are added to the input token embeddings to ensure that the model can distinguish the order of tokens.

---

### 1. **Why Positional Encoding is Needed**

Transformers process tokens in parallel without regard to their order. While this parallelism is a major advantage, the model needs a way to understand the **relative or absolute positions** of tokens in a sequence.

- Token embeddings alone do not contain positional information.
- Positional encoding adds this **position-specific information** to the token embeddings, enabling the model to learn the sequential structure of the input.

---

### 2. **How Positional Encoding Works**

In GPT, positional encodings are added **element-wise** to the token embeddings. These encodings are typically **learned embeddings** (as opposed to fixed sinusoidal encodings used in some other transformers like BERT or the original Transformer).

Let’s break this down step by step:

- Suppose there is a context length, `L` and the embedding dimension is `d_model`.
- GPT maintains a **positional embedding matrix**, `P` of shape `(L, d_model)`.
- Each position `i` in the sequence has a corresponding positional embedding `P[i]` of size `(,d_model)`.

#### Mathematically:

Let `E[i]` be the token embedding at position `i` and `P[i]` be the positional embedding for position `i`. The combined input embedding is: 
`X[i] = E[i] + P[i]`

---

### 3. **How Positional Encoding is Implemented in PyTorch**

In GPT, positional encodings are implemented as **learnable embeddings** using `nn.Embedding`:

- A positional embedding matrix `P` of shape `(L, d_model)` is defined, where:
    - `L` is the maximum allowed sequence length(***THE MODEL'S CONTEXT WINDOW***).
    - `d_model` is the embedding dimension.
- The position indices (0, 1, 2, ..., L-1) are passed to the positional embedding layer, and the resulting embeddings `P[i]` are added to the token embeddings.

## Understanding the Self-Attention Block in GPT

The **self-attention block** is the core computational unit in GPT's transformer architecture. It allows the model to process each token in the sequence while attending to all other tokens to determine their relationships. In GPT, the self-attention is both **autoregressive** and **causal**, meaning each token can only "attend to" previous tokens, ensuring proper sequential flow for tasks like language modeling.

---

### 1. **Why is the Self-Attention Block So Important?**

Self-attention solves a major problem in sequence modeling: **capturing long-range dependencies** between tokens. Unlike RNNs, which process input sequentially and struggle with long-term relationships, self-attention operates in parallel and considers the entire sequence at once.

Key benefits:
- **Global Context**: Self-attention computes relationships between every token pair, enabling the model to capture global context.
- **Parallel Computation**: Unlike RNNs, the transformer processes all tokens simultaneously, greatly improving efficiency.
- **Dynamic Weighting**: Instead of fixed connections, self-attention dynamically learns which tokens are most relevant to each other.

---

### 2. **Key Concepts: Query, Key, and Value Matrices (Q, K, V)**

The self-attention mechanism works by projecting input embeddings into three distinct representations:
- **Query (Q)**: Represents the "current token" that is trying to find relevant tokens in the sequence.
- **Key (K)**: Represents all tokens in the sequence and is used to determine how relevant they are to the query.
- **Value (V)**: Represents the actual content of the tokens that will be combined based on the attention scores.

#### How Q, K, V are Computed:
Given an input embedding matrix `X` of shape `(L, d_model)`, where `L` is the sequence length and `d_model` is the embedding dimension:
- Q, K, V are obtained by applying **learnable linear transformations** (weight matrices) to `X`:
Q = X*W_Q, K = X*W_K, V = X*W_V
Where:
- `W_Q, W_K, W_V` will each have the shape `(d_model, hidden_dim)`.
- `Q, K, V` will each have the shape `(L, hidden_dim)`.
- `hidden_dim` is another dimension to project the embedding space to another continuous latent space.

---

### 3. **Attention Scores and the Attention Mechanism**

Once `Q, K, and V` are obtained, the **attention scores** determine how much focus each token should place on others.

#### Step-by-Step Computation:

1. **Compute Raw Attention Scores**:
   The attention scores, `alpha`, are computed by taking the dot product of `Q` and `K^T` (transpose of K):
   - Shape of  `QK^T` is `(L, L)`

2. **Scale the Scores**:
   The raw attention scores, `alpha` are divied by square root of `hidden_dim` to prevent large values (which could destabilize softmax)
   - Denote scaled attention scores, with `alpha_scaled`

3. **Masking for Causal Attention**:
   In GPT, the self-attention is **autoregressive and causal**. To ensure that each token can only attend to **itself and previous tokens**, a **causal mask** is applied:
   - Tokens at position `i` cannot attend to positions `j > i`, where `0 <= i, j <= L`.
   - This masking sets scores for future positions to `-infinity` so that the softmax outputs zero probabilities for future tokens.

4. **Apply Softmax**:
   The scaled and masked scores are passed through a softmax function to get the **attention weights. `A`**:
   - Shape of Attention Weights, `A`, is `(L, L)`.

5. **Weighted Sum of Values**:
   The attention weights, `A` are multiplied with the Value matrix, `V`, to compute the final output, `O`:
   - `O = A * V`
   - Shape of Output, `O`, is `(L, hidden_dim)`.

---

### 4. **Why Self-Attention is So Powerful**

The self-attention block allows the model to:
1. **Learn Relationships Between All Tokens**: Tokens can dynamically interact with all previous tokens, capturing complex dependencies.
2. **Parallelize Computation**: The dot-product operations on Q, K, and V enable efficient parallel processing.
3. **Focus on Relevant Tokens**: Attention weights allow the model to emphasize important tokens while downweighting irrelevant ones.

By applying the causal mask, GPT ensures that the self-attention remains **unidirectional** and suitable for autoregressive tasks like text generation.


## Understanding Multi-Head Attention in GPT

While a single self-attention block allows the model to capture relationships between tokens in a sequence, **multi-head attention** extends this idea by allowing the model to focus on different aspects of the input sequence simultaneously. Instead of using a single set of `Q`, `K`, and `V` matrices, multi-head attention computes several attention "heads" in parallel. Each head can focus on different parts of the sequence, allowing the model to learn more diverse representations of the input.

---

### 1. **Why Multi-Head Attention is Used**

The key reason for using multi-head attention is that it allows the model to capture multiple types of relationships between tokens at once. A single attention head may focus on one particular type of interaction or context between tokens, but by using multiple heads, the model can capture **different subspaces of attention**. This helps the model learn richer, more nuanced representations of the input sequence.

Key benefits of multi-head attention:
- **Captures Multiple Perspectives**: Each attention head focuses on different aspects of the sequence, allowing the model to learn a variety of dependencies (e.g., syntax, semantics, long-range dependencies).
- **Improves Expressiveness**: With multiple heads, the model has more capacity to learn complex relationships in the data.
- **Efficient Computation**: Multi-head attention allows the model to process multiple "views" of the data simultaneously, increasing parallelism and computational efficiency.

---

### 2. **How Multi-Head Attention Works**

In multi-head attention, the self-attention mechanism is applied multiple times in parallel, each with its own set of `W_Q`, `W_K`, and `W_V` matrices. These attention heads are then combined to form the final output. 

#### Step-by-Step Process:

1. **Linear Projections for Each Head**: 
   For `h` attention heads, the input `X` is projected into `h` different sets of `W_Q, W_K, and W_V` matrices. 
   - Each set of `W_Q, W_K, and W_V` matrices is of shape `(d_model, hidden_dim // h)`. The 2nd dimension is divided by `h`, the number of heads, because we will add the output from each head together to get back to the original hidden dimension space when we had a single attention head.

2. **Compute Attention for Each Head**:
   Each attention head, `i`,  computes the attention output independently:
    `O = Attention_Head(X, W_Q_i, W_K_i, W_V_i)`
  
3. **Concatenate the Outputs**:
   The outputs from all heads are concatenated along the last dimension:
   `Multi_Head_O = concat(O_1, O_2, ... O_h)`
   - `Multi_Head_O has shape (L, hidden_dim)`


### 3. **Efficient Implementation of Multi-Head Attention in PyTorch**

To implement multi-head attention efficiently in PyTorch, we can utilize matrix operations to split a single set of matrices `W_Q, W_K, and W_V` to manage all the heads. This is more efficient because we will perform a single set of matrix multiplication as opposed to a set of matrix multiplications.




## Understanding the Multi-Layer Perceptron (MLP) in GPT

The **Multi-Layer Perceptron (MLP)** in GPT sits at the end of the sequence of multi-head attention blocks and is a critical component for generating output from the transformed representations. It helps to map the high-level, context-aware representations generated by the self-attention layers into the final predictions (such as token probabilities in language modeling tasks).

---

### 1. **Why is the Multi-Layer Perceptron Used?**

The MLP is essential for two primary reasons:
- **Non-linearity and Transformation**: While the self-attention layers capture complex dependencies between tokens, the MLP introduces non-linearity and helps the model learn complex mappings between the learned representations and the final output space. Without the MLP, the model would be limited to linear transformations and would not be able to learn the rich, non-linear patterns that are required for tasks like language generation.
- **Output Generation**: The MLP maps the processed representations into the model’s output space, such as predicting the next token in a sequence. This allows GPT to make decisions based on the contextualized information learned from the input sequence.

In [15]:
VOCAB_SIZE = 50257 #Found from tiktoken's docs. The BPE tokenizer has 50,257 unique tokens
EMBED_DIM = 512 #Hyperparameter for the dimension of latent space used in attention mechanisms

In [16]:
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size, dropout):
        super().__init__()
        self.key = nn.Linear(EMBED_DIM, head_size, bias=False)
        self.query = nn.Linear(EMBED_DIM, head_size, bias=False)
        self.value = nn.Linear(EMBED_DIM, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(CONTEXT_WINDOW, CONTEXT_WINDOW)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)  
        q = self.query(x) 
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) 
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei)
        v = self.value(x) 
        out = wei @ v
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size, dropout):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size, dropout) for _ in range(num_heads)])
        self.proj = nn.Linear(head_size * num_heads, EMBED_DIM)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, dropout):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(EMBED_DIM, 4 * EMBED_DIM),
            nn.ReLU(),
            nn.Linear(4 * EMBED_DIM, EMBED_DIM),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_head, dropout):
        super().__init__()
        head_size = EMBED_DIM // n_head
        self.sa = MultiHeadAttention(n_head, head_size, dropout)
        self.ffwd = FeedFoward(dropout)
        self.ln1 = nn.LayerNorm(EMBED_DIM)
        self.ln2 = nn.LayerNorm(EMBED_DIM)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

class GPTLanguageModel(nn.Module):

    def __init__(self, n_layer: int, n_head: int, dropout: float):
        super().__init__()
        self.token_embedding_table = nn.Embedding(VOCAB_SIZE, EMBED_DIM)
        self.position_embedding_table = nn.Embedding(CONTEXT_WINDOW, EMBED_DIM)
        self.blocks = nn.Sequential(*[Block(n_head, dropout) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(EMBED_DIM) 
        self.lm_head = nn.Linear(EMBED_DIM, VOCAB_SIZE)

        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) 
        x = tok_emb + pos_emb
        x = self.blocks(x) 
        x = self.ln_f(x)
        logits = self.lm_head(x) 

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -CONTEXT_WINDOW:]
            logits, loss = self(idx_cond)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [17]:
model = GPTLanguageModel(4, 4, .3)

In [18]:
print(device)
model = model.to(device)

cuda


In [19]:
LR = .0003
EPOCHS = 200

In [20]:
optimizer = torch.optim.AdamW(model.parameters(), lr=LR)

for iter in range(EPOCHS):
    epoch_loss = 0
    for input_text, target_text in dataloader:
        input_text, target_text = input_text.to(device), target_text.to(device)
        
    # evaluate the loss
    logits, loss = model(input_text, target_text)
    epoch_loss += loss
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    print(f"EPOCH {iter} LOSS: {epoch_loss}")

EPOCH 0 LOSS: 10.93112564086914
EPOCH 1 LOSS: 10.000994682312012
EPOCH 2 LOSS: 9.641437530517578
EPOCH 3 LOSS: 9.373924255371094
EPOCH 4 LOSS: 9.183527946472168
EPOCH 5 LOSS: 8.938088417053223
EPOCH 6 LOSS: 8.698592185974121
EPOCH 7 LOSS: 8.507721900939941
EPOCH 8 LOSS: 8.277900695800781
EPOCH 9 LOSS: 7.961567401885986
EPOCH 10 LOSS: 7.770715713500977
EPOCH 11 LOSS: 7.54622220993042
EPOCH 12 LOSS: 7.37234354019165
EPOCH 13 LOSS: 7.11744499206543
EPOCH 14 LOSS: 6.867424488067627
EPOCH 15 LOSS: 6.679500579833984
EPOCH 16 LOSS: 6.509734630584717
EPOCH 17 LOSS: 6.226180553436279
EPOCH 18 LOSS: 5.9952921867370605
EPOCH 19 LOSS: 5.9894537925720215
EPOCH 20 LOSS: 5.7114996910095215
EPOCH 21 LOSS: 5.5000152587890625
EPOCH 22 LOSS: 5.403318881988525
EPOCH 23 LOSS: 5.1641669273376465
EPOCH 24 LOSS: 5.266387462615967
EPOCH 25 LOSS: 5.0275492668151855
EPOCH 26 LOSS: 4.850244045257568
EPOCH 27 LOSS: 4.701484680175781
EPOCH 28 LOSS: 4.434895038604736
EPOCH 29 LOSS: 4.493546485900879
EPOCH 30 LOSS: 4

In [21]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(bpe_tokenizer.decode(model.generate(context, max_new_tokens=500)[0].tolist()))

!IVustTI sugar cemeteryS sound ResY OF ACHILLES
 THE HIS SPEARATH OF H present18III.
 VESSELS
 JUPITER
 THE CHIDING HISOS CASE
 THE MEETING OF HECTOR ANDROMACHE
 IRIS
QUÆCI� freshOMER
prisingTAR
 HECTOR AND AND DEATH OF PARIS
 HECTOR ANDANCE OF HECTOR AND AJAX SES CAPTUNE ORDERING PARIS
 GREEKS
 THE HOURS TAKING THE HERALDS
 DIOMED AND AJAX SEPARATED BY THE HORSES FROM JUNO’S CAR
 PLUTO
 PLUTO’ FUNERAL OF ACHILLES
 PLUTO
 THE EMBASSY TO ACHILLES
 GREEK GALLEY
 GREEK GALLEY
 GREEK GREEK GREEK GREEK GREEKCont
 DIOMED AND POLLUX
 DI nar AND ULYSSES RETUR ResponsibilityASSY TO RHESUS
 PROSERPINE
 ACHILLESUS
 DIOMED AND ULYANCE OF RHESUS
 DIOMED AND ULYSSESoras RISING FROM THE DESCENT OF SLEEPTUNE RISING THE SPOILS OF DISCORD
 IACCHILLESUS
 DIANA
 AJAX SEPARATED BY THE SPOILS OF DISCULESUS
 SLEEP ESCAPING THE SEA
 HERC NovemberED AND ULYSSES
 BACCHUS
 SLEEPTAR
 GALLEY
 BACCHUS
 AJAX DEFENDING THE DESCENT OF exploitationUS
 CASTOR AND POLLUX
 ÆSCULAPIUS
 JUD centuryEDON TO BODY OF SLEEP ESCA

## Issue with Pretraining on the Iliad

I trained a G-2 model on the plain text of *The Iliad* by Homer. After pretraining, the output of 500 tokens was nonsensical and repetitive, showing a lack of meaningful context or coherence. The output consisteofnd random references to characters and events in the text, indicating that the model had simply memorized blocks o*T the Ili*ad rather than learning useful patterns of language.

### Why This Happened
The core issue is that *The Iliad* is a relatively small text, with a limited vocabulary and context. The small corpus means that the model cannot generalize well and has mostly memorized specific passages. Additionally, the limited vocabulary prevents the model from learning the complex emergent behaviors seen in large language models (LLMs) that allow for creativity, generalization, and coherence.

### Solution: Fine-tuning on Pretrained GPT-2 Weights
To overcome this, I plan to use GPT-2 weights from HuggingFace's Transformers library. The GPT-2 model has already been pretrained on a vast text corpus, so it has learned general language patterns and emergent behaviors. By fine-tuning the model on *The Iliad*, I can quickly adapt the model to be an expert in the text of *The Iliad*, while retaining the benefits of large-scale pretraining on diverse data. This approach strikes a balance between efficient fine-tuning and leveraging the powerful capabilities of a pretrained model.


## Still Can't Just Finetune on Plaintext 'The Iliad'
 - If I finetune gpt-2 on *The Iliad*, the finetuned model would just know blocks of text from the book. That does not provide value to any user.
 - Building a chatbot requires finetuning a foundation model, like GPT-2, on a question answer dataset.
 - How can the plaintext book be used to build a domain specific Q&A dataset? What if I split the plaintext book into chunks, ask a LLM to generate a question on the chunks, then feed the question, passage pairs to another LLM and ask for the answer. This pipeline could yield a massive Q&A dataset about *The Iliad*.
 - After building this Q&A dataset, I will finetune GPT-2 on the dataset and the result will be an extremely useful Iliad-GPT chatbot!

In [51]:
def split_text_into_chunks(text, chunk_size=700, overlap=50):
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        
        # Ensure the chunk does not start or end mid-word
        if end < len(text):
            last_space = chunk.rfind(' ')
            if last_space != -1:
                chunk = chunk[:last_space]
                end = start + len(chunk)
                
        if len(chunk) >= chunk_size // 2:
            chunks.append(chunk.strip())
            
        start = end - overlap
    
    return chunks


In [52]:
chunks = split_text_into_chunks(iliad)

In [53]:
print(len(chunks))

1700


In [54]:
#Remove first several chunks as they just provide meta data about the actual text
chunks = chunks[10:]
print(len(chunks))

1690


In [55]:
print(chunks[711])

ce a traitor, thou betray’st no more.”

Sternly he spoke, and as the wretch prepared
With humble blandishment to stroke his beard,
Like lightning swift the wrathful falchion flew,
Divides the neck, and cuts the nerves in two;
One instant snatch’d his trembling soul to hell,
The head, yet speaking, mutter’d as it fell.
The furry helmet from his brow they tear,
The wolf’s grey hide, the unbended bow and spear;
These great Ulysses lifting to the skies,
To favouring Pallas dedicates the prize:

“Great queen of arms, receive this hostile spoil,
And let the Thracian steeds reward our toil;
Thee, first of all the heavenly host, we praise;
O speed our labours, and direct our ways!”
This said, the


In [56]:
BATCH_SIZE = 8

## Q&A LLMs.
    - The Question Generation model is a finetuned version of the t5 foundation model. This LLM is specialized in generating questions about a context.
    - The Answer Question model is t5 from google. The answering is done using Transformers' text2text generation pipeline as well.

In [85]:
from transformers import AutoTokenizer, T5ForConditionalGeneration

qg_model_name = "ThomasSimonini/t5-end2end-question-generation"
qg_tokenizer = AutoTokenizer.from_pretrained("t5-base")
qg_model = T5ForConditionalGeneration.from_pretrained(qg_model_name).to(device)

question_mark_id = qg_tokenizer.encode("?", add_special_tokens=False)[-1]

In [89]:
from transformers import pipeline

qa_model_name = "google/flan-t5-large"
qa_pipeline_device = 0 if torch.cuda.is_available() else -1
qa_pipeline = pipeline("text2text-generation", model=qa_model_name, tokenizer=qa_model_name, device=qa_pipeline_device)

In [90]:
def generate_questions(passages, max_input_length=1000, max_new_tokens=50):
    questions = []

    for i in range(0, len(passages), BATCH_SIZE):
        batch = passages[i:i+BATCH_SIZE]
        
        # Create input prompts for T5
        prompts = [f"Here is a passage from The Iliad by Homer': {passage[:max_input_length]}" for passage in batch]
        
        # Tokenize prompts
        inputs = qg_tokenizer(prompts, return_tensors="pt", truncation=True, padding=True, max_length=1024).to(device)

        # Generate questions
        with torch.no_grad():
            outputs = qg_model.generate(
                inputs.input_ids,
                attention_mask=inputs.attention_mask,
                max_new_tokens=max_new_tokens,
                eos_token_id=question_mark_id, 
                pad_token_id=qg_tokenizer.pad_token_id,
                no_repeat_ngram_size=2
            )

        for j in range(outputs.size(0)):
            question = qg_tokenizer.decode(outputs[j], skip_special_tokens=True)
            questions.append(question)

    return questions


In [98]:
def generate_answers(passages, questions):
    assert len(passages) == len(questions), "Passages and questions must have the same length"
    answers = []

    for passage, question in zip(passages, questions):
        prompt = f"Answer the question about this passage of The Iliad: PASSAGE:{passage} QUESTION:{question} ANSWER:"
        try:
            result = qa_pipeline(prompt, max_length=100, num_return_sequences=1, do_sample=False, truncation=True)
            answer = result[0]['generated_text'].split("ANSWER:")[-1].strip()
        except Exception as e:
            answer = "Unable to generate an answer."

        answers.append(answer)

    return answers

In [100]:
qa_pairs = []

batch_size = 8
for i in range(0, len(chunks), batch_size):
    batch_chunks = chunks[i:i+batch_size]

    questions = generate_questions(batch_chunks)
    answers = generate_answers(batch_chunks, questions)
    
    for question, answer in zip(questions, answers):
        qa_pairs.append({"question": question, "answer": answer})
    
    if i % 80 == 0:
        print(f"Chunks complete: {i}/{len(chunks)}")
        print(f"Q:\n {questions[-1]}\nA:\n {answers[-1]}\n{'-'*40}")


Chunks complete: 0/1690
Q:
 What did Homer say to the Colophomans?
A:
 the inhabitants showed the place where he used to sit when giving a recitation of his verses, and they greatly honoured the spot
----------------------------------------
Chunks complete: 80/1690
Q:
 What is the title of the passage from The Iliad by Homer?
A:
 The Iliad
----------------------------------------
Chunks complete: 160/1690
Q:
 Who was one of the first to favour me?
A:
 The Earl of Halifax
----------------------------------------
Chunks complete: 240/1690
Q:
 What is the name of the passage from The Iliad by Homer?
A:
 all the thronging train
----------------------------------------
Chunks complete: 320/1690
Q:
 What is the name of the passage from The Iliad by Homer?
A:
 Troy possess her fertile fields in peace
----------------------------------------
Chunks complete: 400/1690
Q:
 What is the title of the passage from The Iliad by Homer?
A:
 to the navy borne
----------------------------------------
Chu

In [103]:
import json

with open("synthetic_qa_dataset.json", 'w') as json_file:
    json.dump({"dataset":qa_pairs}, json_file, indent=4)

In [4]:
import json 

with open("/kaggle/input/iliad-gutenberg/synthetic_qa_dataset.json", "rb") as qa_file:
    text = qa_file.read()
    qa_pairs = json.loads(text)["dataset"]

## Pull GPT-2 Foundation model from HuggingFace

In [6]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

In [8]:
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token
foundation_model = GPT2LMHeadModel.from_pretrained(model_name).to(device)

foundation_model.eval()

# Generate text using the model (inference)
input_text = "In the land of the Greeks, Achilles stood tall"
inputs = tokenizer(input_text, return_tensors="pt").to(device)

# Ensure that no gradient calculation is performed during inference
with torch.no_grad():
    outputs = foundation_model.generate(
        inputs["input_ids"], 
        max_length=100,
        num_return_sequences=1,
        no_repeat_ngram_size=2,
        top_k=50,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

# Decode the generated token IDs back into text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Generated Text: ", generated_text)

Generated Text:  In the land of the Greeks, Achilles stood tall and could speak; and what he found there was the strength and the beauty of life that gave his countrymen strength. So, too, in the northern part of Egypt they held to their sacred law by the law of Moses, and all their people by that old law. This was called the Gath.

And so the Egyptians and their forefathers went back to a place called Phrygia and after a while went with the


## Finetuning GPT-2
    - Use a train and validation dataset to track progress
    - Note that attention mask is used to prevent model from attending to empty space tokens used to allow batch operations.
    - Only training for a few epochs to prevent model from overfitting to this small Q&A dataset and to prevent loss of previous knowledge

In [46]:
class QADataset(Dataset):
    def __init__(self, inputs):
        self.inputs = inputs
        
    def __len__(self):
        return self.inputs["input_ids"].shape[0]

    def __getitem__(self, idx):
        ids = self.inputs["input_ids"][idx].unsqueeze(0)
        attention_mask = self.inputs["attention_mask"][idx].unsqueeze(0)

        return ids, attention_mask

In [47]:
from sklearn.model_selection import train_test_split

train_data = []
for qa in qa_pairs:
    prompt = f"QUESTION: {qa['question']} ANSWER: {qa['answer']}"
    train_data.append(prompt)

train_qa, val_qa = train_test_split(qa_pairs, test_size=0.1, random_state=11)

def create_prompts(qa_data):
    return [f"QUESTION: {qa['question']} ANSWER: {qa['answer']}" for qa in qa_data]

train_prompts = create_prompts(train_qa)
val_prompts = create_prompts(val_qa)

train_inputs = tokenizer(train_prompts, padding=True, truncation=True, return_tensors="pt", max_length=1024)
val_inputs = tokenizer(val_prompts, padding=True, truncation=True, return_tensors="pt", max_length=1024)

train_dataset = QADataset(train_inputs)
val_dataset = QADataset(val_inputs)

print(len(train_dataset))

1521


In [48]:
optimizer = torch.optim.AdamW(foundation_model.parameters(), lr=5e-5)

FT_EPOCHS = 3

In [49]:
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True, pin_memory=True)
val_dataloader = DataLoader(val_dataset, batch_size=8, pin_memory=True)

In [52]:
from tqdm import tqdm

for epoch in range(FT_EPOCHS):
    foundation_model.train()
    epoch_loss = 0
    for batch in tqdm(train_dataloader, desc=f"Epoch {epoch+1}"):
        input_ids, attention_mask = batch
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)

        optimizer.zero_grad()

        loss = foundation_model(input_ids, attention_mask=attention_mask, labels=input_ids).loss
        
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
        
    eval_loss = 0
    foundation_model.eval()
    for batch in val_dataloader:
        input_ids, attention_mask = batch
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)

        outputs = foundation_model(input_ids, attention_mask=attention_mask, labels=input_ids)
        eval_loss += outputs.loss.item()

    print(f"Epoch {epoch+1}  |  Loss: {epoch_loss / len(train_dataloader)}   |  Validation Loss: {eval_loss / len(val_dataloader)}")


Epoch 1: 100%|██████████| 191/191 [00:24<00:00,  7.67it/s]


Epoch 1  |  Loss: 0.6799877672139263   |  Validation Loss: 0.5888134674592451


Epoch 2: 100%|██████████| 191/191 [00:24<00:00,  7.67it/s]


Epoch 2  |  Loss: 0.560515381434825   |  Validation Loss: 0.5630648230964487


Epoch 3: 100%|██████████| 191/191 [00:24<00:00,  7.68it/s]


Epoch 3  |  Loss: 0.5037320243750567   |  Validation Loss: 0.5529023788192056


In [53]:
foundation_model.save_pretrained("./gpt2-finetuned")
tokenizer.save_pretrained("./gpt2-finetuned")

('./gpt2-finetuned/tokenizer_config.json',
 './gpt2-finetuned/special_tokens_map.json',
 './gpt2-finetuned/vocab.json',
 './gpt2-finetuned/merges.txt',
 './gpt2-finetuned/added_tokens.json')

In [77]:
test_sentences = ["QUESTION: What is the Iliad about? ANSWER:", 
                  "QUESTION: Who are the characters in the Iliad? ANSWER:", 
                  "QUESTION:Who is Achilles? ANSWER:", 
                  "QUESTION:Who is Achilles' wife? ANSWER:"]

In [82]:
test_inputs = tokenizer(test_sentences, return_tensors="pt", padding=True, truncation=True).to(device)

generated_ids = foundation_model.generate(input_ids=test_inputs['input_ids'], max_length=150)

generated_responses = tokenizer.batch_decode(generated_ids[:, test_inputs['input_ids'].shape[-1]:], skip_special_tokens=True)

for i, sentence in enumerate(test_sentences):
    print(f"Input:\n{sentence}")
    print(f"Response:\n{generated_responses[i]}")
    print("-" * 50)
    print("\n")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Input:
QUESTION: What is the Iliad about? ANSWER:
Response:
 The Iliad by Homeric is a passage from The Iliad by Homeric, and is a passage from The Iliad by Homeric by Homeric.
--------------------------------------------------


Input:
QUESTION: Who are the characters in the Iliad? ANSWER:
Response:
 Homeric, Homeric, and the Iliad by Homeric. Homeric is the only known writer who has written a passage of the Iliad by Homeric. Homeric is the only known writer who has written a passage by Homeric by Homeric. Homeric is the only known writer who has written a passage by Homeric by Homeric. Homeric is the only known writer who has written a passage by Homeric by Homeric by Homeric. Homeric is the only known writer who has written a passage by Homeric by Homeric by Homeric by Homeric. Homeric is the only known writer who has written a passage
--------------------------------------------------


Input:
QUESTION:Who is Achilles? ANSWER:
Response:
 Achilles
-----------------------------------

## Conclusion
    - In this notebook I showcased why foundation models pretrained on a large corpus are so important. If we just train the GPT model on the domain specific dataset we don't get the chatbot behavior. All downstream specialized tasks require pretraining so several teams have complete the undifferentiated pretraining for the community.
    - Finetuned models are available on HuggingFace and are excellent ways to generate synthetic data for NLP tasks.
    - My finetuned Iliad-GPT still didn't quite have the behavior I would like. Just look at the answer to 'Who is Achilles' Wife?'. In reality, it would require better data from the Q&A pipeline. If I had more compute, I could use models with larger context windows and provide better questions. Further, I could do more hyperparameter tuning with the extra compute.
    - I think other finetuning frameworks like LoRA would have been useful to experiment with as well if time permitted.