In this Jupyter Notebook we will only clean and prepare the data for an analysis.

The following lines are used to ignore error messages that occur using pandas

In [1]:
import warnings
warnings.filterwarnings('ignore')

Now we will use pandas to load our CSV file so that we can work with the data. 'Lyrics.csv' is an original file from kaggle. 

In [2]:
import pandas as pd
df = pd.read_csv('lyrics.csv')

Here we will look at the different genres that exist in the dataset.

In [3]:
df.genre.unique()

array(['Pop', 'Hip-Hop', 'Not Available', 'Other', 'Rock', 'Metal',
       'Country', 'Jazz', 'Electronic', 'Folk', 'R&B', 'Indie'],
      dtype=object)

We don't want to use _Not Available_ or _Other_, so therefore we are deleting these two columns.

In [4]:
df = df[(df.genre != 'Not Available') & (df.genre != 'Other')]

Now we want to remove the rows where the years are not reasonable.

In [5]:
years = df.year.unique().tolist()

In [6]:
for year in years:
    if year < 1900:
        df = df[df.year != year]

We don't want lyrics from _2038_ either, so we'll delete these as well.

In [7]:
df = df[df.year != 2038]

We also want to delete all rows that have _NaN_ as lyrics as we are analyzing the lyrics as we are only interested in lyrics to build our classifier.

In [8]:
df = df[df.lyrics.notnull()]

Now we have cleaned the data. Now we will make it ready to be analyzed.
<br>
We are going to do the following:<br>
Lowercasing: Tomato → tomato<br>
Normalization: normalisation → normalization<br>
Stemming or lemmatization: dogs -> dog

First we remove stop words

In [12]:
from nltk.corpus import stopwords
stop = nltk.corpus.stopwords.words('english')

We need to remove all special characters:<br>

In [13]:
df['lyrics'] = [x.strip().replace('\n', ' ')
                     .replace('!', '')
                     .replace('?', '')
                     .replace(',', '')
                     .replace('.', '')
                     .replace('\t', '')
                      for x in df['lyrics']]


Now we will see how our lyrics look.

We only want to use English lyrics as well, so we will delete all rows that are in non English.

The lyrics look good. Now we need to split the string into an array of elements so that our function can train (it only accepts tokens).

In [15]:
df['lyrics'] = df['lyrics'].apply(lambda x: [item for item in x.split(' ') if item.lower() not in stop])

In [16]:
import nltk

In [17]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()

In [18]:
def stemSentence(sentence):
    stem_sentence=[]
    for word in sentence:
        stem_sentence.append(porter.stem(word.lower()))
    return stem_sentence

In [None]:
df['lyrics'] = [stemSentence(x) for x in df['lyrics']]

In [22]:
df.to_csv('./cleaned.csv', sep='\t')