In this Jupyter Notebook we will only clean and prepare the data for an analysis.

The following lines are used to ignore error messages that occur using pandas

In [1]:
import warnings
warnings.filterwarnings('ignore')

Now we will use pandas to load our CSV file so that we can work with the data.

In [2]:
import pandas as pd
df = pd.read_csv('lyrics.csv')

Here we will look at the different genres that exist in the dataset.

In [3]:
df.genre.unique()

array(['Pop', 'Hip-Hop', 'Not Available', 'Other', 'Rock', 'Metal',
       'Country', 'Jazz', 'Electronic', 'Folk', 'R&B', 'Indie'],
      dtype=object)

We don't want to use _Not Available_ or _Other_, so therefore we are deleting these two columns.

In [4]:
df = df[(df.genre != 'Not Available') & (df.genre != 'Other')]

Now we want to remove the rows where the years are not reasonable.

In [5]:
df.year.unique()

array([2009, 2007, 2013, 2010, 2012, 2006, 2016, 2011, 2015, 2008, 2014,
       1998, 2002, 1995, 2004, 1972, 2005, 1978, 1970, 1981, 1994, 1997,
       2003, 1983, 1987, 1993, 1982, 1986, 1992, 1984, 1977, 1989, 1979,
       1996, 2001, 1999, 1990, 1974, 1975, 1973, 1991, 1985, 1971, 2000,
       1980, 1976,  702, 1988,  112, 2038, 1968,   67], dtype=int64)

In [6]:
years = df.year.unique().tolist()

In [7]:
for year in years:
    if year < 1900:
        df = df[df.year != year]

We don't want lyrics from _2038_ either, so we'll delete these as well.

In [8]:
df = df[df.year != 2038]

We also want to delete all rows that have _NaN_ as lyrics as we are analyzing the lyrics as we are only interested in lyrics to build our classifier.

In [9]:
df = df[df.lyrics.notnull()]

Now we have cleaned the data. Now we will make it ready to be analyzed.
<br>
We are going to do the following:<br>
Lowercasing: Tomato → tomato<br>
Normalization: normalisation → normalization<br>
Stemming or lemmatization: dogs -> dog

First we remove stop words

In [10]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

We need to remove all special characters:<br>

In [11]:
df['lyrics'] = [x.strip().replace('\n', ' ')
                     .replace('!', '')
                     .replace('?', '')
                     .replace(',', '')
                     .replace('.', '')
                     .replace('\t', '')
                      for x in df['lyrics']]


Now we will see how our lyrics look.

In [12]:
df['lyrics'][0]

"Oh baby how you doing You know I'm gonna cut right to the chase Some women were made but me myself I like to think that I was created for a special purpose You know what's more special than you You feel me It's on baby let's get lost You don't need to call into work 'cause you're the boss For real want you to show me how you feel I consider myself lucky that's a big deal Why Well you got the key to my heart But you ain't gonna need it I'd rather you open up my body And show me secrets you didn't know was inside No need for me to lie It's too big it's too wide It's too strong it won't fit It's too much it's too tough He talk like this 'cause he can back it up He got a big ego such a huge ego I love his big ego it's too much He walk like this 'cause he can back it up Usually I'm humble right now I don't choose You can leave with me or you could have the blues Some call it arrogant I call it confident You decide when you find on what I'm working with Damn I know I'm killing you with them

We only want to use English lyrics as well, so we will delete all rows that are in non English.

In [13]:
df.shape

(237421, 6)

In [14]:
import langdetect
from langdetect import detect

for index, row in df.iterrows():
    try:
        if detect(row['lyrics']) != 'en':
            df.drop(index, inplace=True)
    except:
        df.drop(index, inplace=True)

In [15]:
df.shape

(216614, 6)

The lyrics look good. Now we need to split the string into an array of elements so that our function can train (it only accepts tokens).

In [16]:
df['lyrics'] = df['lyrics'].apply(lambda x: [item for item in x.split(' ') if item.lower() not in stop])

In [18]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()

In [19]:
def stemSentence(sentence):
    stem_sentence=[]
    for word in sentence:
        stem_sentence.append(porter.stem(word.lower()))
    return stem_sentence

In [20]:
df['lyrics'] = [stemSentence(x) for x in df['lyrics']]

In [22]:
df.to_csv('./cleaned.csv', sep='\t')