# Pre-Processing Data
This notebook takes the large dataset from [this Kaggle repo](https://www.kaggle.com/datasets/nikhilnayak123/5-million-song-lyrics-dataset?resource=download) and extracts the country music data, then standardizes the format and outputs a csv file containing ~100k songs.

### Import Packages

In [10]:
import pandas as pd
import re

### Import Data

In [2]:
data = pd.read_csv('ds2.csv')

For each tag (rap, rb, rock, pop, misc, country), create a new csv with the data from that tag.

In [7]:
for tag in data.tag.unique():
    print(tag)
    k = data.query(f"tag=='{tag}'")
    k.to_csv(f"{tag}_data.csv",encoding='utf-8',index=False)

rap
rb
rock
pop
misc
country


Import the country data.

In [8]:
country = pd.read_csv('country_data.csv')

0    O Death, where is thy sting?\nO Grave, where i...
1    [Verse 1]\nThey used to call me lightning\nI w...
2    [Verse 1]\nYou were in college, working part-t...
3    [Verse 1]\nHe was born in the summer of his 27...
4    [Verse 1: Kid Rock]\nA shimmy shimmy go go mot...
Name: lyrics, dtype: object

### Cleaning Data

First, we define a cleaning function to remove all tags, format as lowercase, and remove all punctuation.

In [81]:
def clean(lyrics):
    j = re.sub("\[.*\]","",lyrics.lower())
    j = re.sub("verse 1","",j)
    j = re.sub("verse 2","",j)
    j = re.sub("verse 3","",j)
    j = re.sub("verse 4","",j)
    j = re.sub("chorus","",j)
    j = re.sub("outro","",j)
    while len(j) > 0 and j[0] == '\n':
        j = j[1:]

    while"\n\n" in j:
        j = re.sub("\n\n","\n",j)

    j = re.sub('[^0-9a-zA-Z_ \n]+', '', j)
    j = re.sub('\n',' \n ', j)
    return j

Example of cleaning to make sure it works:

In [82]:
clean(country.lyrics[1])

'they used to call me lightning \n i was always quick to strike \n had everything i own \n in the saddles on my bike \n i had a reputation \n for never staying very long \n just like a wild and restless drifter \n like a cowboy in a song \n i met a dark haired beauty \n where they lay the whiskey down \n in southern arizona \n in a little border town \n she had to dance for money \n in that dusty old saloon \n i dropped a dollar in the jukebox \n played that girl a tune yea \n never see it coming \n it just hits you by surprise \n its that cold place in your soul \n that fire in her eyes \n makes you come together \n like wild horses when they run \n now the cards are on the table and \n the bullets in the gun \n she was sitting on my lap \n we still had shots to kill \n when a man pulled up \n who owned the bar \n in a cadillac deville \n he grabbed her by her raven hair \n and threw her on the floor \n said no free rides for the cowboys \n that aint what i pay you for no \n she jumpe

In [83]:
cleaned = [k for k in [clean(l) for l in country.lyrics if type(l)==str] if len(k) > 200]

### Export
Export our dataframe with utf-8 encoding.

In [84]:
clean_pd = pd.DataFrame(cleaned)
clean_pd.to_csv("country_data_cleaned_spaces.csv",encoding="utf-8",index=False,header=False)