# Data Cleaning 

The following code combines 1990-1999 songs and 2000-2009 songs from the Kaggle dataset and the Billboard Top Songs dataset, assigning label 1 to the Billboard Top Songs and 0 otherwise.

In [1]:

import pandas as pd 
import re

## Top 100 Songs Dataset

In [3]:

top_100= pd.read_csv('top_100_final.csv')
top_100

Unnamed: 0,Year,Song Title,Artist,Label,Lyrics,URL,Genre
0,1960,Theme from A Summer Place,Percy Faith,1,There's a summer placeWhere it may rain or sto...,https://genius.com/percy-faith-theme-from-a-su...,easy listening
1,1960,He'll Have to Go,Jim Reeves,1,Put your sweet lips a little closer to the pho...,https://genius.com/jim-reeves-hell-have-to-go-...,country
2,1960,Cathy's Clown,The Everly Brothers,1,Don't want your love anymoreDon't want your ki...,https://genius.com/the-everly-brothers-cathys-...,rock
3,1960,Running Bear,Johnny Preston,1,*vocalizations*On the bank of the riverStood R...,https://genius.com/johnny-preston-running-bear...,pop
4,1960,Teen Angel,Mark Dinning,1,Teen AngelTeen AngelTeen AngelThat fateful nig...,https://genius.com/mark-dinning-teen-angel-lyrics,pop
...,...,...,...,...,...,...,...
6221,2022,Flower Shops,Ernest featuring Morgan Wallen,1,"It's a beautiful day, she's been cryin' all ni...",https://genius.com/ernest-flower-shops-lyrics,
6222,2022,To the Moon,Jnr Choi and Sam Tompkins,1,"Sit by myselfTalking to the moonTeh, ha, pull ...",https://genius.com/jnr-choi-to-the-moon-lyrics,
6223,2022,Unholy,Sam Smith and Kim Petras,1,Mummy don't know Daddy's getting hot At the Bo...,,pop
6224,2022,One Mississippi,Kane Brown,1,You and IHad this off and on so longYou've bee...,https://genius.com/kane-brown-one-mississippi-...,country


## Labels for 1990-1999 songs

### Top 1990-1999 songs

In [4]:
top_90_99= top_100[(top_100['Year']>=1990 ) & (top_100['Year']<=1999)].reset_index(drop=True)
top_90_99

Unnamed: 0,Year,Song Title,Artist,Label,Lyrics,URL,Genre
0,1990,Hold On,Wilson Phillips,1,I know this pain Why do you lock yourself up i...,https://genius.com/wilson-phillips-hold-on-lyrics,pop
1,1990,It Must Have Been Love,Roxette,1,Must have been loveBut it's over nowLay a whis...,https://genius.com/roxette-it-must-have-been-l...,rock
2,1990,Nothing Compares 2 U,Sinéad O'Connor,1,It's been seven hours and fifteen daysSince yo...,https://genius.com/sinead-oconnor-nothing-comp...,pop
3,1990,Poison,Bell Biv DeVoe,1,"Yeah, Spyderman and Freeze in full effectUh-hu...",https://genius.com/bell-biv-devoe-poison-lyrics,rb
4,1990,Vogue,Madonna,1,Strike a poseStrike a poseVogue Vogue Look aro...,https://genius.com/madonna-vogue-lyrics,pop
...,...,...,...,...,...,...,...
992,1999,Better Days (And the Bottom Drops Out),Citizen King,1,In my shoes my toes are bustedMy kitchen says ...,https://genius.com/citizen-king-better-days-an...,other
993,1999,Music of My Heart,NSYNC and Gloria Estefan,1,You'll never know what you've done for meWhat ...,https://genius.com/nsync-music-of-my-heart-lyrics,pop
994,1999,Write This Down,George Strait,1,I never saw the end in sightFools are kind of ...,https://genius.com/george-strait-write-this-do...,country
995,1999,When You Believe,Whitney Houston and Mariah Carey,1,Many nights we've prayedWith no proof anyone c...,https://genius.com/whitney-houston-when-you-be...,rb


In [5]:
df_9= top_90_99.copy()
df_9

Unnamed: 0,Year,Song Title,Artist,Label,Lyrics,URL,Genre
0,1990,Hold On,Wilson Phillips,1,I know this pain Why do you lock yourself up i...,https://genius.com/wilson-phillips-hold-on-lyrics,pop
1,1990,It Must Have Been Love,Roxette,1,Must have been loveBut it's over nowLay a whis...,https://genius.com/roxette-it-must-have-been-l...,rock
2,1990,Nothing Compares 2 U,Sinéad O'Connor,1,It's been seven hours and fifteen daysSince yo...,https://genius.com/sinead-oconnor-nothing-comp...,pop
3,1990,Poison,Bell Biv DeVoe,1,"Yeah, Spyderman and Freeze in full effectUh-hu...",https://genius.com/bell-biv-devoe-poison-lyrics,rb
4,1990,Vogue,Madonna,1,Strike a poseStrike a poseVogue Vogue Look aro...,https://genius.com/madonna-vogue-lyrics,pop
...,...,...,...,...,...,...,...
992,1999,Better Days (And the Bottom Drops Out),Citizen King,1,In my shoes my toes are bustedMy kitchen says ...,https://genius.com/citizen-king-better-days-an...,other
993,1999,Music of My Heart,NSYNC and Gloria Estefan,1,You'll never know what you've done for meWhat ...,https://genius.com/nsync-music-of-my-heart-lyrics,pop
994,1999,Write This Down,George Strait,1,I never saw the end in sightFools are kind of ...,https://genius.com/george-strait-write-this-do...,country
995,1999,When You Believe,Whitney Houston and Mariah Carey,1,Many nights we've prayedWith no proof anyone c...,https://genius.com/whitney-houston-when-you-be...,rb


### Titles and Artists Cleaning

This function standardizes the input text by converting it to lowercase, removing unnecessary characters and words and ensuring consistent spacing. 
It is applied to standardize titles and artist names between the Kaggle dataset and the Billboard Top Songs dataset since they are extracted from two different sources (i.e., artist collaborations might be reported differently, using "and", "featuring", "feat", etc.). It ensures the correct combination of the two datasets.

In [6]:

def clean_text(text):
    #Convert text to lowercase and remove any character that is not alphanumeric or whitespace
    c_text = re.sub(r'[^a-zA-Z0-9\s]', '', text.lower())
    
    #Remove 'and', 'featuring', 'feat', and 'feat.' using word boundaries to match whole words
    c_text2 = re.sub(r'\b(and|featuring|feat\.?|feat\b)\b', '', c_text)
    
    #Replace multiple spaces with a single space and strip leading spaces
    c_text3 = re.sub(r'\s+', ' ', c_text2).strip()
    
    return c_text3
    

In [7]:
titles= top_90_99['Song Title'].tolist()
titles= [clean_text(title) for title in titles]
print(titles)

authors= top_90_99['Artist'].tolist()
authors= [clean_text(author) for author in authors]
print(authors)



artist_songs_dict = {}
for author, title in zip(authors, titles):
    if author in artist_songs_dict:
        artist_songs_dict[author].append(title)
    else:
        artist_songs_dict[author] = [title]


print(artist_songs_dict)

['wilson phillips', 'roxette', 'sinad oconnor', 'bell biv devoe', 'madonna', 'mariah carey', 'phil collins', 'en vogue', 'billy idol', 'jon bon jovi', 'bell biv devoe', 'michael bolton', 'technotronic', 'paula abdul', 'janet jackson', 'heart', 'maxi priest', 'alannah myles', 'wilson phillips', 'linda ronstadt aaron neville', 'lisa stansfield', 'calloway', 'johnny gill', 'glenn medeiros bobby brown', 'sweet sensation', 'snap', 'nelson', 'taylor dayne', 'jane child', 'seduction', 'linear', 'poison', 'new kids on the block', 'roxette', 'billy joel', 'james ingram', 'rod stewart', 'janet jackson', 'tommy page', 'the b52s', 'jody watley', 'soul ii soul caron wheeler', 'luther vandross', 'janet jackson', 'vanilla ice', 'milli vanilli', 'mc hammer', 'taylor dayne', 'janet jackson', 'michelle', 'george michael', 'michael bolton', 'phil collins', 'after 7', 'mc hammer', 'phil collins', 'lou gramm', 'phil collins', 'janet jackson', 'after 7', 'aerosmith', 'digital underground', 'taylor dayne', '

After standardizing the artist names, we create a dictionary where the key is the artist's name and the value is a dictionary containing the associated songs.


### Kaggle Dataset Songs

In [8]:
data_90_99_1= pd.read_parquet('1990_1999_part1.parquet')
data_90_99_2= pd.read_parquet('1990_1999_part2.parquet')

data_90_99= pd.concat([data_90_99_1, data_90_99_2])
data_90_99.reset_index(drop=True)

Unnamed: 0,title,tag,artist,year,views,lyrics
0,Can I Live,rap,JAY-Z,1996,468624,""" Yeah, hah, yeah, Roc-A-Fella We invite you..."
1,Rockin and Rollin,rap,Cam'ron,1998,6399,""" Ay yo you wonder who I are I guzzle up at th..."
2,DEvils,rap,JAY-Z,1996,504959,""" """"Dear God – I wonder"
3,Its Hot Some Like It Hot,rap,JAY-Z,1999,103549,""" Yo, show closer, J-to-the-A-Y-Hovah Place ..."
4,Its Like That,rap,JAY-Z,1998,37692,""" Yeah, uh-huh, watch this y'all Uhh, watch th..."
...,...,...,...,...,...,...
222922,Since 92,pop,Undeclinable Ambuscade,1998,2,We're here since '92 If that's not punk enough...
222923,Horror Story Latnem Version,rap,House Of Krazees,1994,4,Next motherfuckers's gonna get my metal I c...
222924,From Season to Season,pop,Louis Philippe,1990,4,When there's nothing else I can do Except thin...
222925,Long Way From Love,rock,Marcie Free,1993,3,"""Is it tenderness or torture Joy that only end..."


In [9]:
#Initially, we set all the labels to 0
data_90_99['Label']=0


In [10]:
#This function cleans the song titles

def clean_song(song): 
    clean_song= re.sub(r'"', '', song)
    return clean_song



In [11]:
data_90_99['lyrics']= data_90_99['lyrics'].apply(clean_song)
data_90_99= data_90_99.reset_index(drop=True)

In [12]:
data_90_99

Unnamed: 0,title,tag,artist,year,views,lyrics,Label
0,Can I Live,rap,JAY-Z,1996,468624,"Yeah, hah, yeah, Roc-A-Fella We invite you ...",0
1,Rockin and Rollin,rap,Cam'ron,1998,6399,Ay yo you wonder who I are I guzzle up at the...,0
2,DEvils,rap,JAY-Z,1996,504959,Dear God – I wonder,0
3,Its Hot Some Like It Hot,rap,JAY-Z,1999,103549,"Yo, show closer, J-to-the-A-Y-Hovah Place s...",0
4,Its Like That,rap,JAY-Z,1998,37692,"Yeah, uh-huh, watch this y'all Uhh, watch thi...",0
...,...,...,...,...,...,...,...
222922,Since 92,pop,Undeclinable Ambuscade,1998,2,We're here since '92 If that's not punk enough...,0
222923,Horror Story Latnem Version,rap,House Of Krazees,1994,4,Next motherfuckers's gonna get my metal I c...,0
222924,From Season to Season,pop,Louis Philippe,1990,4,When there's nothing else I can do Except thin...,0
222925,Long Way From Love,rock,Marcie Free,1993,3,Is it tenderness or torture Joy that only ends...,0


We update the 'Label' column in the Kaggle Dataset to 1 if a "Billboard Top Song" is already in the dataset. It removes matching entries from the "Billboard Top Songs" dataset to check if the combination of top songs in the two datasets sum up to 997 (they are not exactly equal to 1000 since we removed some songs without lyrics).

In [13]:

for index, row in data_90_99.iterrows():
    artist_cleaned = clean_text(row['artist'])
    title_cleaned = clean_text(row['title'])
    
    if artist_cleaned in artist_songs_dict and title_cleaned in artist_songs_dict[artist_cleaned]:
        data_90_99.loc[index, 'Label'] = 1

       
        condition = (top_90_99['Artist'].apply(clean_text) == artist_cleaned) & (top_90_99['Song Title'].apply(clean_text) == title_cleaned)
        top_90_99 = top_90_99.drop(top_90_99[condition].index).reset_index(drop=True)

In [14]:
#Check how many top songs are already in the dataset
data_90_99[data_90_99['Label']==1]

Unnamed: 0,title,tag,artist,year,views,lyrics,Label
21,1st of tha Month,rap,Bone Thugs-N-Harmony,1995,266526,"Wake up, wake up, wake up, wake up, wake up I...",1
36,Elevators Me You,rap,OutKast,1996,325019,"One for the money, yes, sir, two for the show...",1
41,I Wish,rap,Skee-Lo,1995,459740,"Hey, this is radio station WSKEE We're taking...",1
50,Hard Knock Life Ghetto Anthem,rap,JAY-Z,1998,489194,"Take the bass line out Uh-huh, Jigga Uh-huh...",1
69,It Was a Good Day,rap,Ice Cube,1992,1463692,Break 'em Yeah Yeah Yeah Uh Just wakin' u...,1
...,...,...,...,...,...,...,...
198791,If,rb,Janet Jackson,1993,88343,"Sittin' over here, starin' in your face With ...",1
198794,Because of Love,rb,Janet Jackson,1993,3761,You got me singing Shoo-doo-doo Shoo-shoo-doo...,1
198894,You Mean The World to Me,rb,Toni Braxton,1994,18124,If you could give me one good reason Why I sh...,1
200696,Dont Cry for Me Argentina,pop,Madonna,1997,23256,"It won't be easy, you'll think it's strange W...",1


In [15]:
top_90_99

Unnamed: 0,Year,Song Title,Artist,Label,Lyrics,URL,Genre
0,1990,Another Day in Paradise,Phil Collins,1,"She calls out to the man on the street""Sir, ca...",https://genius.com/phil-collins-another-day-in...,rock
1,1990,How Am I Supposed to Live Without You,Michael Bolton,1,"I could hardly believe it, when I heard the ne...",https://genius.com/michael-bolton-how-am-i-sup...,easy listening
2,1990,Pump Up the Jam,Technotronic,1,Pump up the jamPump it upWhile your feet are s...,https://genius.com/technotronic-pump-up-the-ja...,pop
3,1990,Opposites Attract,Paula Abdul,1,"Baby, it seems we never ever agreeYou like the...",https://genius.com/paula-abdul-opposites-attra...,pop
4,1990,Escapade,Janet Jackson,1,As I was walkin' bySaw you standin' there with...,https://genius.com/janet-jackson-escapade-lyrics,pop
...,...,...,...,...,...,...,...
248,1999,Iris,Goo Goo Dolls,1,And I'd give up forever to touch you'Cause I k...,https://genius.com/goo-goo-dolls-iris-lyrics,rock
249,1999,Satisfy You,Puff Daddy featuring R. Kelly,1,All I want is somebody who's gonna love me for...,https://genius.com/puff-daddy-satisfy-you-lyrics,hip hop
250,1999,Music of My Heart,NSYNC and Gloria Estefan,1,You'll never know what you've done for meWhat ...,https://genius.com/nsync-music-of-my-heart-lyrics,pop
251,1999,When You Believe,Whitney Houston and Mariah Carey,1,Many nights we've prayedWith no proof anyone c...,https://genius.com/whitney-houston-when-you-be...,rb


In [20]:
len(data_90_99[data_90_99['Label']==1])+len(top_90_99)

951

The combination of top songs are not exactly equal to 997 because there could be some duplicates which have been dropped.

In [16]:
df1 = df_9[['Artist', 'Song Title']].map(clean_text)
df2= data_90_99[['artist', 'title']].map(clean_text).rename(columns={'artist': 'Artist', 'title': 'Song Title'})
merged_df = pd.merge(df1, df2, on=['Artist', 'Song Title'], how='inner')
merged_df

Unnamed: 0,Artist,Song Title
0,wilson phillips,hold on
1,roxette,it must have been love
2,sinad oconnor,nothing compares 2 u
3,bell biv devoe,poison
4,madonna,vogue
...,...,...
739,fastball,out of my head
740,jayz,hard knock life ghetto anthem
741,joey mcintyre,stay the same
742,citizen king,better days the bottom drops out


In [17]:
merged_df[merged_df.duplicated()].value_counts().sum()


46

Indeed, there are 46 duplicates and removing them we have a total of 698 top songs already in the Kaggle Dataset.

### Concatenation of the two datasets

Finally, we concatenate the previously obtained dataset with the remaining top songs that were not already present in the Kaggle dataset.

In [167]:

titles= top_90_99['Song Title'].tolist()
tags= top_90_99['Genre'].tolist()
artists= top_90_99['Artist'].tolist()
years= top_90_99['Year'].tolist()
lyrics= top_90_99['Lyrics'].tolist()
label= top_90_99['Label'].tolist()

top_90_99= pd.DataFrame({'title': titles, 'tag': tags, 'artist': artists, 'year': years, 'lyrics': lyrics, 'Label': label})


data_90_99= data_90_99.drop(columns=['views'])


total90_99= pd.concat([data_90_99, top_90_99]).reset_index(drop=True)
total90_99.to_csv('total90_99.csv', index= False)


In [168]:
total90_99

Unnamed: 0,title,tag,artist,year,lyrics,Label
0,Can I Live,rap,JAY-Z,1996,"Yeah, hah, yeah, Roc-A-Fella We invite you ...",0
1,Rockin and Rollin,rap,Cam'ron,1998,Ay yo you wonder who I are I guzzle up at the...,0
2,DEvils,rap,JAY-Z,1996,Dear God – I wonder,0
3,Its Hot Some Like It Hot,rap,JAY-Z,1999,"Yo, show closer, J-to-the-A-Y-Hovah Place s...",0
4,Its Like That,rap,JAY-Z,1998,"Yeah, uh-huh, watch this y'all Uhh, watch thi...",0
...,...,...,...,...,...,...
223175,Iris,rock,Goo Goo Dolls,1999,And I'd give up forever to touch you'Cause I k...,1
223176,Satisfy You,hip hop,Puff Daddy featuring R. Kelly,1999,All I want is somebody who's gonna love me for...,1
223177,Music of My Heart,pop,NSYNC and Gloria Estefan,1999,You'll never know what you've done for meWhat ...,1
223178,When You Believe,rb,Whitney Houston and Mariah Carey,1999,Many nights we've prayedWith no proof anyone c...,1


#### We repeat the exact same procedure for 2000-2009 songs

## Labels for 2000-2009 songs

### Top 2000-2009 songs

In [24]:
top_00_09= top_100[(top_100['Year']>=2000 ) & (top_100['Year']<=2009)].reset_index(drop=True)
top_00_09

Unnamed: 0,Year,Song Title,Artist,Label,Lyrics,URL,Genre
0,2000,Breathe,Faith Hill,1,I can feel the magic floating in the airBeing ...,https://genius.com/faith-hill-breathe-lyrics,pop
1,2000,Smooth,Santana featuring Rob Thomas,1,"Man, it's a hot oneLike seven inches from the ...",https://genius.com/santana-smooth-lyrics,rock
2,2000,Maria Maria,Santana featuring The Product G&B,1,"Ladies and gents, turn up your sound systemsTo...",https://genius.com/santana-maria-maria-lyrics,
3,2000,I Wanna Know,Joe,1,"YeahOh-oh, yeahAlright, oh, oh, ohIt's amazing...",https://genius.com/joe-i-wanna-know-lyrics,rb
4,2000,Everything You Want,Vertical Horizon,1,Somewhere there's speakingIt's already coming ...,https://genius.com/vertical-horizon-everything...,rock
...,...,...,...,...,...,...,...
994,2009,Goodbye,Kristinia DeBarge,1,Am I supposed to put my life on holdBecause yo...,https://genius.com/kristinia-debarge-goodbye-l...,pop
995,2009,Say Hey (I Love You),Michael Franti & Spearhead featuring Cherine A...,1,"Comme da selecta Ayy, uh-huh, woo (That's righ...",,hip hop
996,2009,Pop Champagne,Jim Jones featuring Ron Browz and Juelz Santana,1,"Ether boy! Hey!How we ball in the club, I know...",https://genius.com/jim-jones-pop-champagne-lyrics,
997,2009,Pretty Wings,Maxwell,1,Time will bring the real end of our trialOne d...,https://genius.com/maxwell-pretty-wings-lyrics,soul


In [25]:
df_10= top_00_09.copy()
df_10

Unnamed: 0,Year,Song Title,Artist,Label,Lyrics,URL,Genre
0,2000,Breathe,Faith Hill,1,I can feel the magic floating in the airBeing ...,https://genius.com/faith-hill-breathe-lyrics,pop
1,2000,Smooth,Santana featuring Rob Thomas,1,"Man, it's a hot oneLike seven inches from the ...",https://genius.com/santana-smooth-lyrics,rock
2,2000,Maria Maria,Santana featuring The Product G&B,1,"Ladies and gents, turn up your sound systemsTo...",https://genius.com/santana-maria-maria-lyrics,
3,2000,I Wanna Know,Joe,1,"YeahOh-oh, yeahAlright, oh, oh, ohIt's amazing...",https://genius.com/joe-i-wanna-know-lyrics,rb
4,2000,Everything You Want,Vertical Horizon,1,Somewhere there's speakingIt's already coming ...,https://genius.com/vertical-horizon-everything...,rock
...,...,...,...,...,...,...,...
994,2009,Goodbye,Kristinia DeBarge,1,Am I supposed to put my life on holdBecause yo...,https://genius.com/kristinia-debarge-goodbye-l...,pop
995,2009,Say Hey (I Love You),Michael Franti & Spearhead featuring Cherine A...,1,"Comme da selecta Ayy, uh-huh, woo (That's righ...",,hip hop
996,2009,Pop Champagne,Jim Jones featuring Ron Browz and Juelz Santana,1,"Ether boy! Hey!How we ball in the club, I know...",https://genius.com/jim-jones-pop-champagne-lyrics,
997,2009,Pretty Wings,Maxwell,1,Time will bring the real end of our trialOne d...,https://genius.com/maxwell-pretty-wings-lyrics,soul


### Titles and Artists Cleaning

This function standardizes the input text by converting it to lowercase, removing unnecessary characters and words and ensuring consistent spacing. 
It is applied to standardize titles and artist names between the Kaggle dataset and the Billboard Top Songs dataset since they are extracted from two different sources (i.e., artist collaborations might be reported differently, using "and", "featuring", "feat", etc.). This ensures the correct combination of the two datasets.

In [26]:

def clean_text(text):
    #Convert text to lowercase and remove any character that is not alphanumeric or whitespace
    c_text = re.sub(r'[^a-zA-Z0-9\s]', '', text.lower())
    
    #Remove 'and', 'featuring', 'feat', and 'feat.' using word boundaries to match whole words
    c_text2 = re.sub(r'\b(and|featuring|feat\.?|feat\b)\b', '', c_text)
    
    #Replace multiple spaces with a single space and strip leading spaces
    c_text3 = re.sub(r'\s+', ' ', c_text2).strip()
    
    return c_text3

In [27]:

titles= top_00_09['Song Title'].tolist()
titles= [clean_text(title) for title in titles]
print(titles)

authors= top_00_09['Artist'].tolist()
authors= [clean_text(author) for author in authors]
print(authors)



artist_songs_dict = {}
for author, title in zip(authors, titles):
    if author in artist_songs_dict:
        artist_songs_dict[author].append(title)
    else:
        artist_songs_dict[author] = [title]


print(artist_songs_dict)

['breathe', 'smooth', 'maria maria', 'i wanna know', 'everything you want', 'say my name', 'i knew i loved you', 'amazed', 'bent', 'he wasnt man enough', 'higher', 'try again', 'jumpin jumpin', 'thong song', 'kryptonite', 'there you go', 'music', 'doesnt really matter', 'what a girl wants', 'back at one', 'bye bye bye', 'you sang to me', 'i need to know', 'get it on tonite', 'incomplete', 'i try', 'its gonna be me', 'thats the way it is', 'country grammar hot shit', 'bring it all to me', 'show me the meaning of being lonely', 'hot boyz', 'back here', 'it feels so good', 'absolutely story of a girl', 'with arms wide open', 'be with you', 'come on over baby all i want is you', 'no more', 'all the small things', 'the way you love me', 'i turn to you', 'never let you go', 'i need you', 'thank god i found you', 'lets get married', 'my love is your love', 'then the morning comes', 'blue da ba dee', 'desert rose', 'the real slim shady', 'most girls', 'wifey', 'wonderful', 'oops i did it again

After standardizing the artist names, we create a dictionary where the key is the artist's name and the value is a dictionary containing the associated songs.

### Kaggle Dataset Songs

In [28]:
data_00_09_1= pd.read_parquet('2000_2009_part1.parquet')
data_90_99_2= pd.read_parquet('2000_2009_part2.parquet')

data_00_09= pd.concat([data_00_09_1, data_90_99_2])
data_00_09


Unnamed: 0,title,tag,artist,year,views,lyrics
0,Killa Cam,rap,Cam'ron,2004,173166,""" Killa Cam, Killa Cam, Cam Killa Cam, Killa C..."
1,Forgive Me Father,rap,Fabolous,2003,4743,"""Maybe cause I'm eatin And these bastards fien..."
2,Down and Out,rap,Cam'ron,2004,144404,""" Ugh, Killa! Baby! Kanye, this that 1970s H..."
3,Fly In,rap,Lil Wayne,2005,78271,""" So they ask me """"Young boy What you gon' do ..."
4,Lollipop Remix,rap,Lil Wayne,2008,580832,""" Haha Uh-huh No homo (Young Mula, baby!) I sa..."
...,...,...,...,...,...,...
218784,Opposites Detract,rock,Dyscarnate,2008,1,Crimson faced and riddled with scars. Caricatu...
218785,Kid Youll Move Mountains,rock,Junior Astronomers,2009,5,"""(Verse 1) They say, """"Focus!"""" All you are is..."
218786,Attacked by Whales,rock,Junior Astronomers,2009,3,She asked my opinion I told her the truth You'...
218787,Societys Child Live 2001,pop,Janis Ian,2001,4,""" You come to my door, baby Face is clean and ..."


In [29]:
#Initially, we set all the labels to 0
data_00_09['Label']=0


In [30]:
#This function cleans the song titles
def clean_song(song): 
    clean_song= re.sub(r'"', '', song)
    return clean_song

In [31]:
data_00_09['lyrics']= data_00_09['lyrics'].apply(clean_song)
data_00_09.reset_index(drop=True)

Unnamed: 0,title,tag,artist,year,views,lyrics,Label
0,Killa Cam,rap,Cam'ron,2004,173166,"Killa Cam, Killa Cam, Cam Killa Cam, Killa Ca...",0
1,Forgive Me Father,rap,Fabolous,2003,4743,Maybe cause I'm eatin And these bastards fiend...,0
2,Down and Out,rap,Cam'ron,2004,144404,"Ugh, Killa! Baby! Kanye, this that 1970s He...",0
3,Fly In,rap,Lil Wayne,2005,78271,So they ask me Young boy What you gon' do the...,0
4,Lollipop Remix,rap,Lil Wayne,2008,580832,"Haha Uh-huh No homo (Young Mula, baby!) I say...",0
...,...,...,...,...,...,...,...
452198,Opposites Detract,rock,Dyscarnate,2008,1,Crimson faced and riddled with scars. Caricatu...,0
452199,Kid Youll Move Mountains,rock,Junior Astronomers,2009,5,"(Verse 1) They say, Focus! All you are is not ...",0
452200,Attacked by Whales,rock,Junior Astronomers,2009,3,She asked my opinion I told her the truth You'...,0
452201,Societys Child Live 2001,pop,Janis Ian,2001,4,"You come to my door, baby Face is clean and s...",0


In [32]:
data_00_09= data_00_09[data_00_09['title'].notnull()].reset_index(drop=True)

We update the 'Label' column in the Kaggle Dataset to 1 if a "Billboard Top Song" is already in the dataset. It removes matching entries from the "Billboard Top Songs" dataset to check if the combination of top songs in the two datasets sum up to 999 (they are not exactly equal to 1000 since we removed some songs without lyrics).

In [33]:

for index, row in data_00_09.iterrows():

    artist_cleaned = clean_text(row['artist'])
    title_cleaned = clean_text(row['title'])
    
    if artist_cleaned in artist_songs_dict and title_cleaned in artist_songs_dict[artist_cleaned]:
        data_00_09.loc[index, 'Label'] = 1

       
        condition = (top_00_09['Artist'].apply(clean_text) == artist_cleaned) & (top_00_09['Song Title'].apply(clean_text) == title_cleaned)
        top_00_09 = top_00_09.drop(top_00_09[condition].index).reset_index(drop=True)

In [34]:
#Check how many top songs are already in the dataset
data_00_09[data_00_09['Label']==1]

Unnamed: 0,title,tag,artist,year,views,lyrics,Label
48,A Milli,rap,Lil Wayne,2008,1237174,"Bangladesh Young Money! You dig? Mack, I'm go...",1
59,We Fly High,rap,Jim Jones,2006,63337,I wear a mean dark pair of shades And you can...,1
100,Forever,rap,"Drake, Kanye West, Lil Wayne & Eminem",2009,2061952,It may not mean nothin' to y'all But understa...,1
139,Lose Yourself,rap,Eminem,2002,6263971,"Look, if you had one shot or one opportunity ...",1
163,In da Club,rap,50 Cent,2003,2005139,"Go, go, go, go, go, go Go shawty, it's your...",1
...,...,...,...,...,...,...,...
396397,I Hope You Dance,country,Lee Ann Womack,2000,80415,I hope you never lose your sense of wonder Yo...,1
405253,One Step at a Time,pop,Jordin Sparks,2007,32536,"Hurry up and wait So close, but so far away E...",1
405330,It’s A Great Day To Be Alive,country,Travis Tritt,2000,28018,I got rice cookin' in the microwave Got a thr...,1
405564,Wait for You,pop,Elliott Yamin,2007,44614,I never felt nothing in the world like this b...,1


In [35]:
top_00_09

Unnamed: 0,Year,Song Title,Artist,Label,Lyrics,URL,Genre
0,2000,Breathe,Faith Hill,1,I can feel the magic floating in the airBeing ...,https://genius.com/faith-hill-breathe-lyrics,pop
1,2000,Smooth,Santana featuring Rob Thomas,1,"Man, it's a hot oneLike seven inches from the ...",https://genius.com/santana-smooth-lyrics,rock
2,2000,Maria Maria,Santana featuring The Product G&B,1,"Ladies and gents, turn up your sound systemsTo...",https://genius.com/santana-maria-maria-lyrics,
3,2000,Everything You Want,Vertical Horizon,1,Somewhere there's speakingIt's already coming ...,https://genius.com/vertical-horizon-everything...,rock
4,2000,Say My Name,Destiny's Child,1,"Say my name, say my nameIf no one is around yo...",https://genius.com/destinys-child-say-my-name-...,rb
...,...,...,...,...,...,...,...
384,2009,How Do You Sleep?,Jesse McCartney featuring Ludacris,1,"Oh-oh, oh-oh, oh-oh, oh-ohOh-oh, oh-oh, oh-oh,...",https://genius.com/jesse-mccartney-how-do-you-...,pop
385,2009,I Run to You,Lady Antebellum,1,I run from hateI run from prejudiceI run from ...,https://genius.com/lady-antebellum-i-run-to-yo...,country
386,2009,Green Light,John Legend featuring André 3000,1,Give me the green lightGive me just one nightI...,https://genius.com/john-legend-green-light-lyrics,soul
387,2009,Say Hey (I Love You),Michael Franti & Spearhead featuring Cherine A...,1,"Comme da selecta Ayy, uh-huh, woo (That's righ...",,hip hop


In [37]:
len(data_00_09[data_00_09['Label']==1])+len(top_00_09)

958

The combination of top songs are not exactly equal to 999 because there could be some duplicates which have been dropped.

In [38]:
df1 = df_10[['Artist', 'Song Title']].map(clean_text)
df2= data_00_09[['artist', 'title']].map(clean_text).rename(columns={'artist': 'Artist', 'title': 'Song Title'})
merged_df = pd.merge(df1, df2, on=['Artist', 'Song Title'], how='inner')
merged_df

Unnamed: 0,Artist,Song Title
0,joe,i wanna know
1,matchbox twenty,bent
2,toni braxton,he wasnt man enough
3,aaliyah,try again
4,3 doors down,kryptonite
...,...,...
605,zac brown band,whatever it is
606,kelly clarkson,already gone
607,kristinia debarge,goodbye
608,maxwell,pretty wings


In [39]:
merged_df[merged_df.duplicated()].value_counts().sum()

41

Indeed, there are 41 duplicates and removing them we have a total of 569 top songs already present in the Kaggle Dataset.

### Combination of the two datasets 

Finally, we concatenate the previously obtained dataset with the remaining top songs that were not already present in the Kaggle dataset.

In [185]:
titles= top_00_09['Song Title'].tolist()
tags= top_00_09['Genre'].tolist()
artists= top_00_09['Artist'].tolist()
years= top_00_09['Year'].tolist()
lyrics= top_00_09['Lyrics'].tolist()
label= top_00_09['Label'].tolist()

top_00_09= pd.DataFrame({'title': titles, 'tag': tags, 'artist': artists, 'year': years, 'lyrics': lyrics, 'Label': label})


data_00_09= data_00_09.drop(columns=['views'])


total00_09= pd.concat([data_00_09, top_00_09]).reset_index(drop=True)
total00_09.to_csv('total00_09.csv', index= False)


In [186]:
total00_09

Unnamed: 0,title,tag,artist,year,lyrics,Label
0,Killa Cam,rap,Cam'ron,2004,"Killa Cam, Killa Cam, Cam Killa Cam, Killa Ca...",0
1,Forgive Me Father,rap,Fabolous,2003,Maybe cause I'm eatin And these bastards fiend...,0
2,Down and Out,rap,Cam'ron,2004,"Ugh, Killa! Baby! Kanye, this that 1970s He...",0
3,Fly In,rap,Lil Wayne,2005,So they ask me Young boy What you gon' do the...,0
4,Lollipop Remix,rap,Lil Wayne,2008,"Haha Uh-huh No homo (Young Mula, baby!) I say...",0
...,...,...,...,...,...,...
452582,How Do You Sleep?,pop,Jesse McCartney featuring Ludacris,2009,"Oh-oh, oh-oh, oh-oh, oh-ohOh-oh, oh-oh, oh-oh,...",1
452583,I Run to You,country,Lady Antebellum,2009,I run from hateI run from prejudiceI run from ...,1
452584,Green Light,soul,John Legend featuring André 3000,2009,Give me the green lightGive me just one nightI...,1
452585,Say Hey (I Love You),hip hop,Michael Franti & Spearhead featuring Cherine A...,2009,"Comme da selecta Ayy, uh-huh, woo (That's righ...",1
