### Cleaning the dataframe

In this notebook, I am going to clean up my dateframe by mainly using regex to get rid of unwanted characters as well as removing duplicates

In [594]:
import pandas as pd
from difflib import SequenceMatcher as sm # For comparing similarity of lyrics
import regex as re
from nltk.tokenize import RegexpTokenizer

Reading in my newly constructed dataframe

In [595]:
df = pd.read_csv('./Rappers.csv', index_col=0, parse_dates=['year'])

In [598]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 34109 entries, 0 to 99
Data columns (total 4 columns):
lyrics    32570 non-null object
year      34109 non-null object
title     34108 non-null object
artist    34109 non-null object
dtypes: object(4)
memory usage: 1.3+ MB


Dropping rows where there are nulls in the lyrics columnn. I lose $1539$ rows after doing this

In [599]:
len(df[df['lyrics'].isnull()].index)

1539

In [600]:
df = df[df['lyrics'].isnull() == False]

In [601]:
df.reset_index(inplace=True)

Getting rid of words/characters from lyrics that are contained inside brackets, curly braces, or parentheses. I'm also getting rid of newline characters

In [602]:
pat = "\\[.+?\\]|\\(.+?\\)|\\{.+?\\}"
re_pat = re.compile(pat)
for i in range(len(df)):
    df.loc[i,'lyrics'] = re_pat.sub("", df.loc[i,'lyrics'])

In [603]:
pat = "\\n+"
re_pat = re.compile(pat)
for i in range(len(df)):
    df.loc[i,'lyrics'] = re_pat.sub(" ", df.loc[i,'lyrics'])

In [604]:
df.drop('index',axis=1,inplace=True)

In [605]:
df.reset_index(inplace=True)

In [606]:
df.drop('index',axis=1,inplace=True)

Dropping rows that have lyrics that are an empty string

In [607]:
for i in range(len(df)):
    if df.loc[i,'lyrics'] == '':
        df.drop(index=i,inplace=True)

In [608]:
df.reset_index(inplace=True)

In [623]:
df.drop('index',inplace=True,axis=1)

Creating a function that makes use of the sequence matcher library. This function compares to strings and returns a ratio indicating how similar they are. If the ratio is above $.3$, it returns a true. This function will be used to remove duplicates

In [609]:
def songsAreSame(s1, s2):    
    seqA = sm(None, s1, s2)
    seqB = sm(None, s2, s1)
    return seqA.ratio() > 0.3 or seqB.ratio() > 0.3

Creating a similar function but it compares titles to see if they are the same. The threshold in this function is $.5$

In [611]:
def titlesAreSame(s1, s2):    
    seqA = sm(None, s1, s2)
    seqB = sm(None, s2, s1)
    return seqA.ratio() > 0.5 or seqB.ratio() > 0.5

Dropping rows that are duplicates by comparing the titles and lyrics of rows that are next to each other (all duplicates appear next to each other). Dropping duplicates results in losing $6569$ rows. 

In [612]:
to_drop = []
for i in range(len(df)-1):
    if songsAreSame(df.loc[i,'lyrics'],df.loc[i+1,'lyrics']) or titlesAreSame(df.loc[i, 'title'],df.loc[i+1,'title']):
        to_drop.append(i+1)

In [613]:
len(to_drop)

6569

In [614]:
for i in to_drop:
    df.drop(index=i,inplace=True)

Replacing the names of some of the artists with the correct punctuation manually 

In [615]:
df.replace('J Cole', 'J. Cole', inplace=True)

In [616]:
df.replace('TI', 'T.I.', inplace=True)

In [617]:
df.replace('The Notorious BIG', 'Notorious B.I.G.', inplace=True)

In [618]:
df.replace('DMC', 'Run DMC', inplace=True)

In [619]:
df.replace('NWA', 'N.W.A', inplace=True)

In [620]:
df.replace('BoB', 'B.o.B', inplace=True)

In [621]:
df.replace('RA The Rugged Man', 'R.A. The Rugged Man', inplace=True)

In [622]:
df.replace('Sir Mix-a-Lot', 'Sir Mix-A-Lot', inplace=True)

In [623]:
df.drop('index',inplace=True,axis=1)

In [627]:
df.reset_index(inplace=True)

Checking my artists to make sure their names are spelled correctly and with the correct punctuation / characters

In [632]:
df['artist'].unique()

array(['MF Doom', 'XXXTentacion', 'A$AP Rocky', 'Chance the Rapper',
       '2 Chainz', "Cam'ron", 'Pimp C', 'Raekwon', 'Nelly', 'J Dilla',
       'Vic Mensa', 'Lil Dicky', 'D12', 'Trick Daddy', 'Lil Wayne',
       'MC Lyte', 'Slick Rick', 'Talib Kweli', 'Nas', 'Joey Bada$$',
       'Kendrick Lamar', 'Biz Markie', 'Lisa Lopes', 'Brother Ali',
       'Scarface', 'André 3000', 'Wyclef Jean', 'Run DMC', 'Lauryn Hill',
       'Eminem', 'Heavy D', 'Tyler, the Creator', 'Joell Ortiz',
       'Sir Mix-A-Lot', 'Chuck D', 'Method Man', 'Mac Miller', 'G-Eazy',
       'Gucci Mane', 'Joe Budden', 'Sean Combs', 'Chamillionaire',
       'Waka Flocka Flame', 'Danny Brown', 'Lupe Fiasco', 'Kid Rock',
       'Phife Dawg', 'J. Cole', 'Lil Boosie', 'DMX', 'Notorious B.I.G.',
       'Kid Ink', 'Nicki Minaj', 'Cypress Hill', 'Snoop Dogg', 'The Game',
       'Big Daddy Kane', 'Young Buck', "Ol' Dirty Bastard", 'N.W.A',
       'Rae Sremmurd', 'Del tha Funkee Homosapien', 'Kid Cudi', 'Common',
       'Westsid

In [635]:
df.to_csv('./clean_df')