# Processing Eminem Lyrics Dataset from Kaggle

## Overview

The dataset contains information about Eminem's songs. The data consists of the following 5 columns:

1. **Album Name**: The name of the album the song belongs to.
2. **Song Name**: The name of the song.
3. **Song Lyrics**: The lyrics of the song.
4. **Album URL**: The URL of the album.
5. **Song Views**: The number of views the song has received.
6. **Release Date**: The date when the song was released.

For our purpose, we will focus on the **Song Lyrics** column and ignore the other columns.

## Dataset Link

You can access the dataset [here](https://www.kaggle.com/datasets/aditya2803/eminem-lyrics/data).

## Steps for Processing the Dataset

We will be using **Pandas** for data manipulation and extraction of song lyrics.

**Imports**

In [41]:
import pandas as pd
import re
from scripts import normalize
import tiktoken

In [27]:
PATH = "Raw Data/Eminem_Lyrics.csv"
data = pd.read_csv(PATH, sep='\t', comment='#', encoding = "ISO-8859-1")
data.head(5)

Unnamed: 0,Album_Name,Song_Name,Lyrics,Album_URL,Views,Release_date,Unnamed: 6
0,Music To Be Murdered By: Side B,Alfred (Intro),"[Intro: Alfred Hitchcock]\nThus far, this albu...",https://genius.com/albums/Eminem/Music-to-be-m...,24.3K,"December 18, 2020",
1,Music To Be Murdered By: Side B,Black Magic,"[Chorus: Skylar Grey & Eminem]\nBlack magic, n...",https://genius.com/albums/Eminem/Music-to-be-m...,180.6K,"December 18, 2020",
2,Music To Be Murdered By: Side B,Alfredï¿½s Theme,"[Verse 1]\nBefore I check the mic (Check, chec...",https://genius.com/albums/Eminem/Music-to-be-m...,285.6K,"December 18, 2020",
3,Music To Be Murdered By: Side B,Tone Deaf,"[Intro]\nYeah, I'm sorry (Huh?)\nWhat did you ...",https://genius.com/albums/Eminem/Music-to-be-m...,210.9K,"December 18, 2020",
4,Music To Be Murdered By: Side B,Book of Rhymes,"[Intro]\nI don't smile, I don't frown, get too...",https://genius.com/albums/Eminem/Music-to-be-m...,193.3K,"December 18, 2020",


## Extracting Lyrics to a Text File
Intermediary Files will be saved in case it may be used in the future

In [23]:
output_file_path = 'Text File/'
lyrics_file_name = 'eminem_lyrics.txt'
lyrics = data['Lyrics']

# Write lyrics to the text file, each lyric on a new line
with open(output_file_path + lyrics_file_name, 'w', encoding='utf-8') as f:
    for lyric in lyrics:
        f.write(lyric + '\n')

print(f"Lyrics have been written to {output_file_path + lyrics_file_name}")

Lyrics have been written to Text File/eminem_lyrics.txt


Lyrics are separated into Intro, Outro, Chorus, Verse, etc. <br><br>
**We are only interested in the [Verse] part of the lyrics since it contains the 'rap' portion**

In [24]:
#open lyrics text file 
with open(output_file_path + lyrics_file_name, 'r', encoding="utf-8") as file:
    text = file.read()
# Use regex to capture everything after '[Verse ...]' and before the next section
verse_only = re.findall(r'\[Verse.*?\]\n(.*?)(?=\n\[\w|\Z)', text, re.DOTALL)
# Join the found text into a single string
verse_only = '\n'.join(verse_only)

verse_file_name = 'verse_only.txt'
# Output the result
with open(output_file_path+verse_file_name, "w", encoding="utf-8") as f:
    f.write(verse_only)

## Normalize Text
1. Remove unwanted characters but keep newlines
2. Normalize multiple spaces to a single space
3. Remove trailing spaces before newlines
4. Normalize multiple newlines to a single newline
5. Convert to lower case

**We are keeping newlines since it:**

1. **Preserves Structure and Rhythm:**
   - Rap lyrics are often structured in lines with rhymes, rhythms, and pauses. Keeping newlines helps the model learn this structure, making the generated lyrics feel more natural and rhythmic.
2. **Improves Readability:**
   - If the model generates lyrics with line breaks, it will be easier to read and evaluate during testing or usage.
3. **Captures Line-Level Context:**
   - By retaining newlines, the model can learn dependencies between consecutive lines without treating them as a continuous block of text.
4. **Helps During Post-Processing:**
   - You can always remove or modify newlines later if needed, but adding them back after training might be harder since the original structure would have been lost.

In [43]:
cleaned_verse_only = normalize.preprocess_text_with_newlines(verse_only)
cleaned_verse_only[:100]

"we're volatile i can't call it though\nit's like too large a peg and too small a hole yeah\nbut she ch"

## gpt2 BPE Tokenizer will be used to encode the text 

In [48]:
# Load the GPT-2 tokenizer
tokenizer = tiktoken.get_encoding("gpt2")
# Tokenize the text
tokens = tokenizer.encode(cleaned_verse_only)

# Decode the tokens back to text
#decoded_text = tokenizer.decode(tokens[:10])
#print("Decoded text:", decoded_text)