# Training rapGPT: A Visually Friendly Guide

This file is designed to provide a visually friendly process for training rapGPT. 

## Purpose of This File
The purpose of this file is to offer detailed explanations of the training process, along with intermediate outputs to help understand how each step works. 

If you are looking for a script without the explanations and intermediate outputs, please refer to the corresponding script file: train.py

**Imports**

In [1]:
import pandas as pd
import re
#import tiktoken
import torch
import torch.nn as nn
from torch.nn import functional as F

#custom functions
from scripts import utils, train_tokens

**Set Hyperparameters**

In [2]:
batch_size = 16  # how many independent sequences will be processed in parallel
block_size = 512  # maximum context length (tokens)
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4
eval_iters = 200
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_embd = 384
n_head = 6
n_layer = 6
dropout = 0.2

# Processing Eminem Lyrics Dataset from Kaggle

## Overview

The dataset contains information about Eminem's songs. The data consists of the following 5 columns:

1. **Album Name**: The name of the album the song belongs to.
2. **Song Name**: The name of the song.
3. **Song Lyrics**: The lyrics of the song.
4. **Album URL**: The URL of the album.
5. **Song Views**: The number of views the song has received.
6. **Release Date**: The date when the song was released.

For our purpose, we will focus on the **Song Lyrics** column and ignore the other columns.

## Dataset Link

You can access the dataset [here](https://www.kaggle.com/datasets/aditya2803/eminem-lyrics/data).

## Steps for Processing the Dataset

We will be using **Pandas** for data manipulation and extraction of song lyrics.

In [3]:
PATH = "Raw Data/Eminem_Lyrics.csv"
data = pd.read_csv(PATH, sep='\t', comment='#', encoding = "ISO-8859-1")
data.head(5)

Unnamed: 0,Album_Name,Song_Name,Lyrics,Album_URL,Views,Release_date,Unnamed: 6
0,Music To Be Murdered By: Side B,Alfred (Intro),"[Intro: Alfred Hitchcock]\r\nThus far, this al...",https://genius.com/albums/Eminem/Music-to-be-m...,24.3K,"December 18, 2020",
1,Music To Be Murdered By: Side B,Black Magic,"[Chorus: Skylar Grey & Eminem]\r\nBlack magic,...",https://genius.com/albums/Eminem/Music-to-be-m...,180.6K,"December 18, 2020",
2,Music To Be Murdered By: Side B,Alfredï¿½s Theme,"[Verse 1]\r\nBefore I check the mic (Check, ch...",https://genius.com/albums/Eminem/Music-to-be-m...,285.6K,"December 18, 2020",
3,Music To Be Murdered By: Side B,Tone Deaf,"[Intro]\r\nYeah, I'm sorry (Huh?)\r\nWhat did ...",https://genius.com/albums/Eminem/Music-to-be-m...,210.9K,"December 18, 2020",
4,Music To Be Murdered By: Side B,Book of Rhymes,"[Intro]\r\nI don't smile, I don't frown, get t...",https://genius.com/albums/Eminem/Music-to-be-m...,193.3K,"December 18, 2020",


## Extracting Lyrics to a Text File
Intermediary Files will be saved in case it may be used in the future

In [4]:
output_file_path = 'Text File/'
lyrics_file_name = 'eminem_lyrics.txt'
lyrics = data['Lyrics']

# Write lyrics to the text file, each lyric on a new line
with open(output_file_path + lyrics_file_name, 'w', encoding='utf-8') as f:
    for lyric in lyrics:
        f.write(lyric + '\n')

print(f"Lyrics have been written to {output_file_path + lyrics_file_name}")

Lyrics have been written to Text File/eminem_lyrics.txt


Lyrics are separated into Intro, Outro, Chorus, Verse, etc. <br><br>
**We are only interested in the [Verse] part of the lyrics since it contains the 'rap' portion**

In [5]:
#open lyrics text file 
with open(output_file_path + lyrics_file_name, 'r', encoding="utf-8") as file:
    text = file.read()
# Use regex to capture everything after '[Verse ...]' and before the next section
verse_only = re.findall(r'\[Verse.*?\]\n(.*?)(?=\n\[\w|\Z)', text, re.DOTALL)
# Join the found text into a single string
verse_only = '\n'.join(verse_only)

verse_file_name = 'verse_only.txt'
# Output the result
with open(output_file_path+verse_file_name, "w", encoding="utf-8") as f:
    f.write(verse_only)

## Normalize Text
1. Remove unwanted characters but keep newlines
2. Normalize multiple spaces to a single space
3. Remove trailing spaces before newlines
4. Normalize multiple newlines to a single newline
5. Convert to lower case

**We are keeping newlines since it:**

1. **Preserves Structure and Rhythm:**
   - Rap lyrics are often structured in lines with rhymes, rhythms, and pauses. Keeping newlines helps the model learn this structure, making the generated lyrics feel more natural and rhythmic.
2. **Improves Readability:**
   - If the model generates lyrics with line breaks, it will be easier to read and evaluate during testing or usage.
3. **Captures Line-Level Context:**
   - By retaining newlines, the model can learn dependencies between consecutive lines without treating them as a continuous block of text.
4. **Helps During Post-Processing:**
   - You can always remove or modify newlines later if needed, but adding them back after training might be harder since the original structure would have been lost.

In [6]:
cleaned_verse_only = utils.preprocess_text_with_newlines(verse_only)
cleaned_verse_only[:100]
cleaned_verse_file_name = 'cleaned_verse_only.txt'
# Output the result
with open(output_file_path+cleaned_verse_file_name, "w", encoding="utf-8") as f:
    f.write(cleaned_verse_only)
    
words = cleaned_verse_only.split()
# Get the number of words
num_words = len(words)
print(f"Number of words: {num_words}")

Number of words: 180104


## gpt2 BPE Tokenizer will be used to encode the text (Not used for now)

In [7]:
"""
# Load the GPT-2 tokenizer
gpt_tokenizer = tiktoken.get_encoding("gpt2")
# Tokenize the text
tokens = gpt_tokenizer.encode(cleaned_verse_only)

# Decode the tokens back to text
#decoded_text = tokenizer.decode(tokens[:10])
#print("Decoded text:", decoded_text)
"""

'\n# Load the GPT-2 tokenizer\ngpt_tokenizer = tiktoken.get_encoding("gpt2")\n# Tokenize the text\ntokens = gpt_tokenizer.encode(cleaned_verse_only)\n\n# Decode the tokens back to text\n#decoded_text = tokenizer.decode(tokens[:10])\n#print("Decoded text:", decoded_text)\n'

## Tokenizer Training Plan

- **Tokenizer Choice**: 
  - The trained tokenizer will be used with a vocab size of **30,000**, which is typically used for a model with a **small corpus**.
- **Corpus Size**:
  - The corpus that will be used for training has a size of **180,104 words**
- **Tokenizer Types**:
  - The corpus will be trained using both **BPE (Byte Pair Encoding)** since the model architecture wilk be based on the GPT model
- **File Location**:
  - The **train_tokenizer** script is saved in the `scripts` folder.

In [8]:
#create tokenizer
bpe_tokenizer = train_tokens.train_tokenizer(input_files=["Text File/cleaned_verse_only.txt"], vocab_size=30000, tokenizer_type="bpe")

Encode **cleaned_verse_only** using the BPE tokenizer

In [9]:
# Tokenize the rap lyrics using the trained tokenizer
bpe_tokenized_output = bpe_tokenizer.encode(cleaned_verse_only)
# Print the tokenized output
print("BPE Tokens:", bpe_tokenized_output.tokens[:10])  # Prints the list of token strings
print("Len of Tokens:",  len(bpe_tokenized_output.ids))

BPE Tokens: ['\n', "we're ", 'volatile ', "i can't call it ", 'though', '\n', "it's like ", 'too ', 'large ', 'a ']
Len of Tokens: 155430


Try decoding the first 10 ids to verify if decoder is working properly

In [10]:
#get the numerical ids of the encoded toknes
bpe_ids = bpe_tokenized_output.ids
#get tokenized lyrics
tokenized_lyrics = bpe_tokenized_output.tokens
#try decoding first 10 ids
output = bpe_tokenizer.decode(bpe_ids[:10])
#remove empty spaces
cleaned_output = re.sub(r'\s+', ' ', output).strip()
#print output
print(cleaned_output)

we're volatile i can't call it though it's like too large a


## Spliting the data into test and validation sets
90% of the data will be used for training, 10% for validation

In [11]:
train_data, val_data = utils.train_test_split(tokenizer_ids = bpe_ids, device= device)
train_data.shape, val_data.shape

(torch.Size([139887]), torch.Size([15543]))

## Training Setup for rapGPT

We will be creating batches to train the data in parallel:

- **Blocksize** = 512 (Each batch will contain 512 tokens at once)
- **Batch size** = 16 (This indicates how many independent sequences will be processed in parallel)

(16 batches are chosen based on max performance of my GPU: RTX4080 with 16GB VRAM)

This setup allows efficient training by processing multiple sequences simultaneously, taking advantage of parallelization, while keeping the block size manageable for memory usage.

In [None]:
X_train, y_train = utils.get_batch(data = train_data, block_size = block_size, batch_size = batch_size, device= device)
X_train.shape, y_train.shape

(tensor([ 7211,  3014,  1572, 13526,  2069,  3196,   184,    42,    36,     4],
        device='cuda:0'),
 tensor([ 3014,  1572, 13526,  2069,  3196,   184,    42,    36,     4,   616],
        device='cuda:0'))

In [19]:
torch.equal(X_train[1:31], y_train[:30])

True

In [22]:
print("Min input index:", X_train.min().item())
print("Max input index:", X_train.max().item())
print("Vocab size:", 30000)


Min input index: 4
Max input index: 29968
Vocab size: 30000
