# Training rapGPT: A Visually Friendly Guide

This file is designed to provide a visually friendly process for training rapGPT. 

## Purpose of This File
The purpose of this file is to offer detailed explanations of the training process, along with intermediate outputs to help understand how each step works. 

If you are looking for a script without the explanations and intermediate outputs, please refer to the corresponding script file: train.py

**Imports**

In [2]:
import pandas as pd
import re
#import tiktoken
import torch
import torch.nn as nn
from torch.nn import functional as F

#custom functions
from scripts import utils, train_tokens

**Hyperparameters**

These are the **final** hyperparamters used in the model.

Hyperparameters have been adjusted accordingly to maximize model performance 

In [3]:
batch_size = 16  # how many independent sequences will be processed in parallel
block_size = 512  # maximum context length (tokens)
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4
eval_iters = 200
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_embd = 384
n_head = 6
n_layer = 6
dropout = 0.2

# Processing Eminem Lyrics Dataset from Kaggle

## Overview

The dataset contains information about Eminem's songs. The data consists of the following 5 columns:

1. **Album Name**: The name of the album the song belongs to.
2. **Song Name**: The name of the song.
3. **Song Lyrics**: The lyrics of the song.
4. **Album URL**: The URL of the album.
5. **Song Views**: The number of views the song has received.
6. **Release Date**: The date when the song was released.

For our purpose, we will focus on the **Song Lyrics** column and ignore the other columns.

## Dataset Link

You can access the dataset [here](https://www.kaggle.com/datasets/aditya2803/eminem-lyrics/data).

## Steps for Processing the Dataset

We will be using **Pandas** for data manipulation and extraction of song lyrics.

In [4]:
PATH = "Raw Data/Eminem_Lyrics.csv"
data = pd.read_csv(PATH, sep='\t', comment='#', encoding = "ISO-8859-1")
data.head(5)

Unnamed: 0,Album_Name,Song_Name,Lyrics,Album_URL,Views,Release_date,Unnamed: 6
0,Music To Be Murdered By: Side B,Alfred (Intro),"[Intro: Alfred Hitchcock]\nThus far, this albu...",https://genius.com/albums/Eminem/Music-to-be-m...,24.3K,"December 18, 2020",
1,Music To Be Murdered By: Side B,Black Magic,"[Chorus: Skylar Grey & Eminem]\nBlack magic, n...",https://genius.com/albums/Eminem/Music-to-be-m...,180.6K,"December 18, 2020",
2,Music To Be Murdered By: Side B,Alfredï¿½s Theme,"[Verse 1]\nBefore I check the mic (Check, chec...",https://genius.com/albums/Eminem/Music-to-be-m...,285.6K,"December 18, 2020",
3,Music To Be Murdered By: Side B,Tone Deaf,"[Intro]\nYeah, I'm sorry (Huh?)\nWhat did you ...",https://genius.com/albums/Eminem/Music-to-be-m...,210.9K,"December 18, 2020",
4,Music To Be Murdered By: Side B,Book of Rhymes,"[Intro]\nI don't smile, I don't frown, get too...",https://genius.com/albums/Eminem/Music-to-be-m...,193.3K,"December 18, 2020",


## Extracting Lyrics to a Text File
Intermediary Files will be saved in case it may be used in the future

In [5]:
output_file_path = 'Text File/'
lyrics_file_name = 'eminem_lyrics.txt'
lyrics = data['Lyrics']

# Write lyrics to the text file, each lyric on a new line
with open(output_file_path + lyrics_file_name, 'w', encoding='utf-8') as f:
    for lyric in lyrics:
        f.write(lyric + '\n')

print(f"Lyrics have been written to {output_file_path + lyrics_file_name}")

Lyrics have been written to Text File/eminem_lyrics.txt


Lyrics are separated into Intro, Outro, Chorus, Verse, etc. <br><br>
**We are only interested in the [Verse] part of the lyrics since it contains the 'rap' portion**

In [6]:
#open lyrics text file 
with open(output_file_path + lyrics_file_name, 'r', encoding="utf-8") as file:
    text = file.read()
# Use regex to capture everything after '[Verse ...]' and before the next section
verse_only = re.findall(r'\[Verse.*?\]\n(.*?)(?=\n\[\w|\Z)', text, re.DOTALL)
# Join the found text into a single string
verse_only = '\n'.join(verse_only)

verse_file_name = 'verse_only.txt'
# Output the result
with open(output_file_path+verse_file_name, "w", encoding="utf-8") as f:
    f.write(verse_only)

## Normalize Text
1. Remove unwanted characters but keep newlines
2. Normalize multiple spaces to a single space
3. Remove trailing spaces before newlines
4. Normalize multiple newlines to a single newline
5. Convert to lower case

**We are keeping newlines since it:**

1. **Preserves Structure and Rhythm:**
   - Rap lyrics are often structured in lines with rhymes, rhythms, and pauses. Keeping newlines helps the model learn this structure, making the generated lyrics feel more natural and rhythmic.
2. **Improves Readability:**
   - If the model generates lyrics with line breaks, it will be easier to read and evaluate during testing or usage.
3. **Captures Line-Level Context:**
   - By retaining newlines, the model can learn dependencies between consecutive lines without treating them as a continuous block of text.
4. **Helps During Post-Processing:**
   - You can always remove or modify newlines later if needed, but adding them back after training might be harder since the original structure would have been lost.

In [7]:
cleaned_verse_only = utils.preprocess_text_with_newlines(verse_only)
cleaned_verse_only[:100]
cleaned_verse_file_name = 'cleaned_verse_only.txt'
# Output the result
with open(output_file_path+cleaned_verse_file_name, "w", encoding="utf-8") as f:
    f.write(cleaned_verse_only)
    
words = cleaned_verse_only.split()
# Get the number of words
num_words = len(words)
print(f"Number of words: {num_words}")

Number of words: 180104


## gpt2 BPE Tokenizer will be used to encode the text (Not used for now)

In [8]:
"""
# Load the GPT-2 tokenizer
gpt_tokenizer = tiktoken.get_encoding("gpt2")
# Tokenize the text
tokens = gpt_tokenizer.encode(cleaned_verse_only)

# Decode the tokens back to text
#decoded_text = tokenizer.decode(tokens[:10])
#print("Decoded text:", decoded_text)
"""

'\n# Load the GPT-2 tokenizer\ngpt_tokenizer = tiktoken.get_encoding("gpt2")\n# Tokenize the text\ntokens = gpt_tokenizer.encode(cleaned_verse_only)\n\n# Decode the tokens back to text\n#decoded_text = tokenizer.decode(tokens[:10])\n#print("Decoded text:", decoded_text)\n'

## Tokenizer Training Plan

- **Tokenizer Choice**: 
  - The trained tokenizer will be used with a vocab size of **30,000**, which is typically used for a model with a **small corpus**.
- **Corpus Size**:
  - The corpus that will be used for training has a size of **180,104 words**
- **Tokenizer Types**:
  - The corpus will be trained using both **BPE (Byte Pair Encoding)** since the model architecture wilk be based on the GPT model
- **File Location**:
  - The **train_tokenizer** script is saved in the `scripts` folder.

In [9]:
#create tokenizer
bpe_tokenizer = train_tokens.train_tokenizer(input_files=["Text File/cleaned_verse_only.txt"], vocab_size=30000, tokenizer_type="bpe")






Encode **cleaned_verse_only** using the BPE tokenizer

In [10]:
# Tokenize the rap lyrics using the trained tokenizer
bpe_tokenized_output = bpe_tokenizer.encode(cleaned_verse_only)
# Print the tokenized output
print("BPE Tokens:", bpe_tokenized_output.tokens[:10])  # Prints the list of token strings
print("Len of Tokens:",  len(bpe_tokenized_output.ids))

BPE Tokens: ["we're ", 'volatile ', "i can't call it ", 'though\n', "it's like ", 'too ', 'large ', 'a ', 'pe', 'g and ']
Len of Tokens: 116686


Try decoding the first 10 ids to verify if decoder is working properly

In [11]:
#get the numerical ids of the encoded toknes
bpe_ids = bpe_tokenized_output.ids
#get tokenized lyrics
tokenized_lyrics = bpe_tokenized_output.tokens
#try decoding first 10 ids
output = bpe_tokenizer.decode(bpe_ids[:10])
#remove empty spaces
cleaned_output = re.sub(r'\s+', ' ', output).strip()
#print output
print(cleaned_output)

we're volatile i can't call it though it's like too large a pe g and


## Spliting the data into test and validation sets
90% of the data will be used for training, 10% for validation

In [12]:
train_data, val_data = utils.train_test_split(tokenizer_ids = bpe_ids, device= device)
train_data.shape, val_data.shape

(torch.Size([105017]), torch.Size([11669]))

## Training Setup for rapGPT

We will be creating batches to train the data in parallel:

- **Blocksize** = 512 (Each batch will contain 512 tokens at once)
- **Batch size** = 16 (This indicates how many independent sequences will be processed in parallel)

(16 batches are chosen based on max performance of my GPU: RTX4080 with 16GB VRAM)

This setup allows efficient training by processing multiple sequences simultaneously, taking advantage of parallelization, while keeping the block size manageable for memory usage.

In [13]:
X_train, y_train = utils.get_batch(data = train_data, block_size = block_size, batch_size = batch_size, device= device)
X_val, y_val = utils.get_batch(data = val_data, block_size = block_size, batch_size = batch_size, device= device)
print("batch size of: ", X_train.shape[0])
print("block size of: ", X_train.shape[1])

batch size of:  16
block size of:  512


Ensure the labels of the data matches the train data at index + 1

In [15]:
torch.equal(X_train[0][1:], y_train[0][:-1])

True

# **Creating the rapGPT model**

## Overview
This model uses a **decoder-only Transformer** architecture designed for autoregressive language modeling, where the model **generates text one token at a time** based on previously seen tokens. It consists of **stacked decoder blocks** with **multi-head self-attention** and **feedforward layers**.


## Model Pipeline
1. Input tokenized text → Embedded + Positional Encoding
2. Pass through **N stacked decoder blocks**
3. Final linear projection → Softmax → Next token prediction

## Key Features
- **Causal Masking** prevents future token access.
- **Scales with depth (N layers) and attention heads**.

## Summary of Transformer Decoder Block
```plaintext
Input Embeddings → LayerNorm → Masked Multi-Head Self-Attention  →
LayerNorm → Feedforward Network → Output
```
## Model Script
refer to **gpt.py**

## Architecture Components

The following explains the architecture components of the model

### (a) Token and Positional Embeddings
Each token is converted into a dense vector using an **embedding layer**, and a **positional encoding** is added to capture the order of tokens:

In [18]:
%%script false --no-raise-error
# create token embeddings for each sample in the batch and block
token_embedding = self.token_embeddings_table(input_tokens)  # (B, T, n_embd)
        # create positional embeddings for each token in the block
positional_embedding = self.position_embedding_table(torch.arange(T, device=device))  # (T, n_embd)
x = token_embedding + positional_embedding  # (B, T, n_embd)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


### (b) Transformer Decoder Block
Each decoder block consists of:

1. **Masked Multi-Head Self-Attention**
2. **Layer Normalization**
3. **Feedforward Network (FFN)**
4. **Residual Connections**

In [21]:
%%script false --no-raise-error
class Block(nn.Module):
    """Transformer Block: communication(multihead attention) followed by computation(FeedForward)"""

    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head  # divide channel (feature embd) by num of heads
        # self attention step
        self.self_attention = MultiHeadAttention(n_head, head_size)
        self.feedforward = FeedForward(n_embd)  # feedforward step
        self.ln1 = nn.LayerNorm(normalized_shape=n_embd)  # first layer norm
        self.ln2 = nn.LayerNorm(normalized_shape=n_embd)  # second layer norm

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


### (c) Multi-Head Self-Attention
Each token attends to **previous** tokens (causal attention), meaning it cannot see future tokens. The self-attention mechanism computes:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} + M \right) V $$

where:
- **Q** (queries), **K** (keys), **V** (values) are projections of the input.
- **M** is a masking matrix that prevents attending to future tokens.
- The softmax ensures attention weights sum to 1.

In [28]:
%%script false --no-raise-error
def forward(self, x):  # x: input for the model
        B, T, C = x.shape
        k = self.key(x)  # (B, T, C)
        q = self.query(x)  # (B, T, C)
        # compute attention score
        # (B, T, head_size) * (B, head_size, T) -> (B, T, T), # divide by sqrt(dim)
        attn_score = q @ k.transpose(-2, -1) * C**-0.5
        # mask upper right triangle by converting 0 -> -inf for softmax
        attn_score = attn_score.masked_fill(self.tril[:T, :T] == 0, float("-inf"))
        attn_score = F.softmax(attn_score, dim=-1)  # normalize using softmax

        # apply weighted aggregation of values
        v = self.value(x)  # (B, T, head_size)
        out = attn_score @ v  # (B, T, head_size) * (B, T, T) -> (B, T, head_size)
        return out

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


### (d) Position-Wise Feedforward Network
Each token is passed through a **2-layer MLP** with non-linearity:

$$ FFN(x) = \text{ReLU}(x W_1 + b_1) W_2 + b_2 $$

### (e) Layer Normalization & Residual Connections
Each sublayer (self-attention and FFN) has **LayerNorm + residual connections**:

$$ X' = \text{Self-Attention}(X + \text{LayerNorm}(X)) $$
$$ X'' = \text{FFN}(X' + \text{LayerNorm}(X')) $$

In [32]:
%%script false --no-raise-error
def forward(self, x):
        # pre-layer norm
        # residual connections (add positional embeddings at the end)
        # output = Activation(layer(X) + X)
        """
        Input -> [LayerNorm] -> [Self-Attention] -> + (Residual Connection)
        -> [LayerNorm] -> [Feedforward Network] -> + (Residual Connection) -> Output
        """
        x = x + self.self_attention(self.ln1(x))
        x = x + self.feedforward(self.ln2(x))
        return x

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


### (f) Output Projection (Logits & Softmax)
The final hidden states are projected back to vocabulary size:

$$ \text{logits} = X W_o $$

A **softmax function** then converts logits into probabilities.

In [35]:
%%script false --no-raise-error
def generate(self, idx, max_new_tokens):
        # input is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crops input to get the last 'block size' tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions,  loss will be ignored (uses forward function)
            logits, loss = self(idx_cond, targets=None)
            # focus only on the last time step, becomes (B, 1 ,C) last element in the time dimension -> last token
            logits = logits[:, -1, :]
            # apply softmax
            probs = F.softmax(logits, dim=-1)  # (B, 1, C)
            # sample from distribution, (B, 1) single prediction for what comes next
            next_token = torch.multinomial(probs, num_samples=1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, next_token), dim=1)  # (B, T+1)
        return idx


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


# **Starting the Training Loop**

## **Overview**
The training loop is responsible for optimizing the model using **gradient-based learning**. This section explains the setup using the following hyperparameters:

- **Batch Size**: `16` (number of sequences processed in parallel)
- **Block Size**: `512` (maximum token context length)
- **Optimizer**: `AdamW` (Adaptive optimization method)
- **Learning Rate**: `3e-4`
- **Vocab Size**: `30000`
- **Number of Iterations**: `5000`
- **Evaluation Interval**: Every `500` steps
- **Dropout**: `0.2` (to prevent overfitting)
- **Transformer Parameters**:
  - Embedding Dimension: `512`
  - Number of Attention Heads: `8`
  - Number of Layers: `8`

- **Optimizer**: Uses `AdamW` with a learning rate of `3e-4`.
- **Training Loop (5000 iterations)**:
   - Loads a batch from `train_data`.
   - Computes the forward pass.
   - Computes **loss** and backpropagates.
   - Updates model parameters using `optimizer.step()`.
   - Evaluates every `500` steps using `evaluate_loss()`.

## Evaluation Process
- Runs on `200` batches.
- Computes the **average training and validation loss**.
- Helps monitor model performance over time.

## Training Script
refer to **train.py**

# Results of the Initial Run

calculate the initial loss

In [38]:
import math
vocab_size = 30000
initial_loss = -math.log(1/vocab_size)
print("initial loss should be around: ", initial_loss)

initial loss should be around:  10.308952660644293


## Overfitting Detected: Training vs. Test Performance

The model is performing **extremely well** on the training set but **worse** on the validation set.  
This is a clear indication of **overfitting**, where the model memorizes training data but fails to generalize to unseen data.

### Steps to Improve Generalization:
- **Reduce Vocab Size** 
- **Increase Regularization** (e.g., L2 weight decay, dropout)
- **Reduce Model Complexity** (e.g., fewer parameters or layers)

By applying these techniques, we aim to achieve **better generalization** and improved test performance. 🚀

In [32]:
unique_elements = torch.unique(X_val)
num_unique = unique_elements.numel()  # Get count of unique elements

print("Unique elements:", unique_elements)
print("Number of unique elements:", num_unique)

Unique elements: tensor([    4,     6,     7,  ..., 29986, 29995, 29998], device='cuda:0')
Number of unique elements: 3065


In [30]:
unique_tokens = set(token for seq in y_train for token in seq)
vocab_size = len(unique_tokens)
print("Vocab size:", vocab_size)

Vocab size: 8192
