# Using TF-IDF to classify song lyrics

In this notebook, we'll explore how we can use TF-IDF to classify songs in the Billboard Hot 100 playlist.

In a nutshell, the TF-IDF method characterizes how unique a particular term or set of terms might be to a particular document or set of documents compared to the general term frequency list of a larger body of documents.

Using TF-IDF we can generate

In [1]:
import numpy as np
import pandas as pd

Import the dataset (warts 'n all):

In [71]:
hot100 = pd.read_csv('hot_100_with_lyrics.csv')

We've had considerable trouble consistently generating Genius.com urls to produce our set of lyrics, so we'll want to drop all the songs for which we could not currently find lyrics using our current methods. We'll also want to remove all the newlines. Let's clean up some warts too.

In [72]:
hot100 = hot100[(hot100['lyrics'] != 'URL-ERROR-LYRICS-NOT-FOUND')]
hot100.drop_duplicates(subset=['title','artists'], inplace = True)
hot100.drop(labels = ['Unnamed: 0'], axis = 1, inplace = True)
hot100.drop(labels = ['Unnamed: 0.1'], axis = 1, inplace = True)
lyrics = hot100['lyrics'].copy().tolist()
fmtd_lyrics = []
for song in lyrics:
    fmtd_lyrics.append(song.replace('\n', ' '))
hot100['lyrics'] = fmtd_lyrics
hot100

Unnamed: 0,title,artists,rank,year,first_artist,title_fmtd,lyrics
0,TiK ToK,Ke$ha,1,2010,kesha,tik tok,Wake up in the morning feelin' like P. Diddy ...
1,Need You Now,Lady Antebellum,2,2010,lady antebellum,need you now,"""Hey, sorry I missed your call, just leave a ..."
2,"Hey, Soul Sister",Train,3,2010,train,hey soul sister,Heyy He-e-e-e-ey He-e-e-e-ey Your lipstick s...
3,California Gurls,Katy Perry Featuring Snoop Dogg,4,2010,katy perry,california gurls,"Greetings, loved ones Let's take a journey I..."
4,OMG,Usher Featuring will.i.am,5,2010,usher,omg,"Oh my gosh Baby let me I did it again, so I'm..."
...,...,...,...,...,...,...,...
1093,More Than My Hometown,Morgan Wallen,96,2020,morgan wallen,more than my hometown,"Girl, our mamas are best friends and so are w..."
1094,Lovin' On You,Luke Combs,97,2020,luke combs,lovin on you,Don't get me wrong I like a bobber on the wat...
1095,Said Sum,Moneybagg Yo,98,2020,moneybagg yo,said sum,"Turn me up, YC Huh? What? Ah, I thought a ..."
1096,Slide,H.E.R. Featuring YG,99,2020,her,slide,You always wearin' them glasses You don't wan...


As you see in the above table, the lyrics are at the right. As transcribed, they still appear with newlines (\n). Hence we will need to first remove the \n from each of these, as these do not matter for our purposes in this notebook. Here's a function that cleans up lyrics and returns the lyrics as a frequency list (dictionary) of words.

Or, we could skip all that potentially and just use the [sklearn object TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) that does this basically automatically. Why not? [This](https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#sphx-glr-auto-examples-text-plot-document-clustering-py) looks like a reasonable example to follow.

In [73]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

## TfidfVectorizer assumes that we have a list of "separate documents".
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(lyrics)

In [74]:
## Distance between two vectors; a secret tool we'll need later
def distance(v, w):
    cumsum = 0
    for i in range(len(v)):
        cumsum += (v[i]-w[i])**2
    return np.sqrt(cumsum)

In [75]:
## Now let's try some very simple clustering
clusters = 9
kmeans = KMeans(n_clusters=clusters).fit(X)
hot100['kmeans label'] = kmeans.labels_

## This calculates the distance from any song to its cluster center
## takes a minute to finish
distances = []
centers = kmeans.cluster_centers_
labels = kmeans.labels_
for i in range(len(hot100['lyrics'])):
    distances.append(distance(X.A[i], list(centers[labels[i]])))
hot100['distance to cluster center'] = distances

In [76]:
hot100

Unnamed: 0,title,artists,rank,year,first_artist,title_fmtd,lyrics,kmeans label,distance to cluster center
0,TiK ToK,Ke$ha,1,2010,kesha,tik tok,Wake up in the morning feelin' like P. Diddy ...,5,0.928653
1,Need You Now,Lady Antebellum,2,2010,lady antebellum,need you now,"""Hey, sorry I missed your call, just leave a ...",0,0.887804
2,"Hey, Soul Sister",Train,3,2010,train,hey soul sister,Heyy He-e-e-e-ey He-e-e-e-ey Your lipstick s...,0,0.948777
3,California Gurls,Katy Perry Featuring Snoop Dogg,4,2010,katy perry,california gurls,"Greetings, loved ones Let's take a journey I...",5,0.729920
4,OMG,Usher Featuring will.i.am,5,2010,usher,omg,"Oh my gosh Baby let me I did it again, so I'm...",5,0.540497
...,...,...,...,...,...,...,...,...,...
1093,More Than My Hometown,Morgan Wallen,96,2020,morgan wallen,more than my hometown,"Girl, our mamas are best friends and so are w...",0,0.913887
1094,Lovin' On You,Luke Combs,97,2020,luke combs,lovin on you,Don't get me wrong I like a bobber on the wat...,0,0.948515
1095,Said Sum,Moneybagg Yo,98,2020,moneybagg yo,said sum,"Turn me up, YC Huh? What? Ah, I thought a ...",2,0.872687
1096,Slide,H.E.R. Featuring YG,99,2020,her,slide,You always wearin' them glasses You don't wan...,6,0.885830


In [81]:
## generic frequency list maker
def bagger(l):
    frequencies = {}
    for i in l:
        if (i in frequencies):
            frequencies[i] += 1
        else:
            frequencies[i] = 1
    return frequencies

# How many songs correspond to each cluster
bagger(hot100['kmeans label'])

{5: 36, 0: 245, 1: 207, 2: 170, 7: 157, 4: 13, 6: 55, 3: 17, 8: 30}

In [82]:
## The 10 songs closest to the center of cluster 0
closest = hot100[hot100['kmeans label'] == 0].sort_values('distance to cluster center')
closest[['artists','title','kmeans label','distance to cluster center']].head(10)

Unnamed: 0,artists,title,kmeans label,distance to cluster center
221,Drake Featuring Rihanna,Take Care,0,0.780589
636,gnash Featuring Olivia O'Brien,I Hate U I Love U,0,0.792438
672,Tory Lanez,Say It,0,0.809494
708,James Arthur,Say You Won't Let Go,0,0.811815
856,Cardi B,Be Careful,0,0.816651
329,Ariana Grande Featuring Mac Miller,The Way,0,0.821599
895,Dua Lipa,IDGAF,0,0.827402
651,Bryson Tiller,Exchange,0,0.829459
918,Chris Brown Featuring Drake,No Guidance,0,0.832062
6,Eminem Featuring Rihanna,Love The Way You Lie,0,0.832731


In [83]:
## The 10 songs closest to the center of cluster 1
closest = hot100[hot100['kmeans label'] == 1].sort_values('distance to cluster center')
closest[['artists','title','kmeans label','distance to cluster center']].head(10)

Unnamed: 0,artists,title,kmeans label,distance to cluster center
419,Idina Menzel,Let It Go,1,0.923866
79,"Kevin Rudolf Featuring Birdman, Jay Sean, & Li...",I Made It (Cash Money Heroes),1,0.934343
159,Usher,More,1,0.935161
631,P!nk,Just Like Fire,1,0.936723
58,Ke$ha,Take It Off,1,0.937828
953,Blanco Brown,The Git Up,1,0.938263
101,LMFAO Featuring Lauren Bennett & GoonRock,Party Rock Anthem,1,0.939762
178,Zac Brown Band Featuring Jimmy Buffett,Knee Deep,1,0.940163
136,Diddy - Dirty Money Featuring Skylar Grey,Coming Home,1,0.940783
123,OneRepublic,Good Life,1,0.942706


In [84]:
## The 10 songs closest to the center of cluster 2
closest = hot100[hot100['kmeans label'] == 2].sort_values('distance to cluster center')
closest[['artists','title','kmeans label','distance to cluster center']].head(10)

Unnamed: 0,artists,title,kmeans label,distance to cluster center
487,Lil Wayne Featuring Drake,Believe Me,2,0.837851
839,Travis Scott,Sicko Mode,2,0.841434
119,Chris Brown Featuring Lil Wayne & Busta Rhymes,Look At Me Now,2,0.844864
145,"DJ Khaled Featuring Drake, Rick Ross & Lil Wayne",I'm On One,2,0.846194
1002,DaBaby Featuring Roddy Ricch,Rockstar,2,0.846626
89,Lil Wayne Featuring Drake,Right Above It,2,0.851575
597,Drake,Back To Back,2,0.85457
790,Gucci Mane Featuring Migos,I Get The Bag,2,0.858692
549,"Nicki Minaj Featuring Drake, Lil Wayne & Chris...",Only,2,0.86094
1026,DaBaby,BOP,2,0.864153


In [86]:
## The 10 songs closest to the center of cluster 3
closest = hot100[hot100['kmeans label'] == 3].sort_values('distance to cluster center')
closest[['artists','title','kmeans label','distance to cluster center']].head(10)

Unnamed: 0,artists,title,kmeans label,distance to cluster center
278,B.o.B,So Good,3,0.417703
109,Jennifer Lopez Featuring Pitbull,On The Floor,3,0.425842
146,Cobra Starship Featuring Sabi,You Make Me Feel...,3,0.433245
137,Pitbull Featuring T-Pain,Hey Baby (Drop It To The Floor),3,0.486608
552,X Ambassadors,Renegades,3,0.503176
1081,BENEE Featuring Gus Dapperton,Supalonely,3,0.532359
181,Dev,In The Dark,3,0.594459
160,Avril Lavigne,What The Hell,3,0.598365
967,SHAED,Trampoline,3,0.633049
480,Naughty Boy Featuring Sam Smith,La La La,3,0.674802


In [87]:
## The 10 songs closest to the center of cluster 4
closest = hot100[hot100['kmeans label'] == 4].sort_values('distance to cluster center')
closest[['artists','title','kmeans label','distance to cluster center']].head(10)

Unnamed: 0,artists,title,kmeans label,distance to cluster center
451,Trey Songz,Na Na,4,0.298413
331,Selena Gomez,Come & Get It,4,0.322458
1035,BTS,Dynamite,4,0.431864
118,Rihanna Featuring Drake,What's My Name?,4,0.472959
788,Cheat Codes Featuring Demi Lovato,No Promises,4,0.507235
793,Camila Cabello Featuring Young Thug,Havana,4,0.550364
226,Justin Bieber,Boyfriend,4,0.575673
110,Rihanna,S&M,4,0.583892
208,One Direction,What Makes You Beautiful,4,0.604192
24,Iyaz,Replay,4,0.604623


In [88]:
## The 10 songs closest to the center of cluster 5
closest = hot100[hot100['kmeans label'] == 5].sort_values('distance to cluster center')
closest[['artists','title','kmeans label','distance to cluster center']].head(10)

Unnamed: 0,artists,title,kmeans label,distance to cluster center
4,Usher Featuring will.i.am,OMG,5,0.540497
125,Britney Spears,Till The World Ends,5,0.552383
131,Ke$ha,Blow,5,0.588244
483,Shakira Featuring Rihanna,Can't Remember To Forget You,5,0.656982
936,Taylor Swift,You Need To Calm Down,5,0.658111
236,Owl City & Carly Rae Jepsen,Good Time,5,0.672468
161,Tinie Tempah Featuring Eric Turner,Written In The Stars,5,0.673842
324,Swedish House Mafia Featuring John Martin,Don't You Worry Child,5,0.677457
212,fun.,Some Nights,5,0.69412
256,Eric Church,Springsteen,5,0.704834


In [89]:
## The 10 songs closest to the center of cluster 6
closest = hot100[hot100['kmeans label'] == 6].sort_values('distance to cluster center')
closest[['artists','title','kmeans label','distance to cluster center']].head(10)

Unnamed: 0,artists,title,kmeans label,distance to cluster center
315,Miley Cyrus,We Can't Stop,6,0.822243
539,Natalie La Rose Featuring Jeremih,Somebody,6,0.848499
128,Ke$ha,We R Who We R,6,0.855257
1073,Lil Baby,The Bigger Picture,6,0.855925
47,Kris Allen,Live Like We're Dying,6,0.860888
164,The Script,For The First Time,6,0.867077
474,Sam Hunt,Leave The Night On,6,0.870738
230,Snoop Dogg & Wiz Khalifa Featuring Bruno Mars,"Young, Wild & Free",6,0.872539
223,Jason Mraz,I Won't Give Up,6,0.873252
784,Harry Styles,Sign Of The Times,6,0.875181


In [90]:
## The 10 songs closest to the center of cluster 7
closest = hot100[hot100['kmeans label'] == 7].sort_values('distance to cluster center')
closest[['artists','title','kmeans label','distance to cluster center']].head(10)

Unnamed: 0,artists,title,kmeans label,distance to cluster center
543,Sam Hunt,Take Your Time,7,0.854438
847,Kendrick Lamar Featuring Zacari,Love.,7,0.877545
443,Justin Timberlake,Not A Bad Thing,7,0.877695
367,Luke Bryan,Crash My Party,7,0.885434
354,Calvin Harris Featuring Ellie Goulding,I Need Your Love,7,0.888952
913,Ed Sheeran & Justin Bieber,I Don't Care,7,0.889498
37,Timbaland Featuring Justin Timberlake,Carry Out,7,0.890348
983,Kane Brown,Good As You,7,0.892593
698,Ed Sheeran,Shape Of You,7,0.896265
1082,Luke Combs,Even Though I'm Leaving,7,0.898151


In [92]:
## The 10 songs closest to the center of cluster 8
closest = hot100[hot100['kmeans label'] == 8].sort_values('distance to cluster center')
closest[['artists','title','kmeans label','distance to cluster center']].head(10)

Unnamed: 0,artists,title,kmeans label,distance to cluster center
615,Calvin Harris Featuring Rihanna,This Is What You Came For,8,0.554147
430,Tove Lo,Habits (Stay High),8,0.568815
414,Eminem Featuring Rihanna,The Monster,8,0.607375
515,Jason Derulo,Want To Want Me,8,0.622108
216,Maroon 5,One More Night,8,0.68028
114,Enrique Iglesias Featuring Ludacris & DJ Frank E,Tonight (I'm Lovin' You),8,0.681188
547,Rich Homie Quan,Flex (Ooh Ooh Ooh),8,0.692437
794,Maroon 5 Featuring SZA,What Lovers Do,8,0.699204
247,Phillip Phillips,Home,8,0.699988
231,Taylor Swift,We Are Never Ever Getting Back Together,8,0.720246


In [96]:
## This cell prints some information I want to know.
print("We're clustering in a "+str(len((kmeans.cluster_centers_)[0]))+"-dimensional space.")
def dist_between_centroids(m,n):
    return distance((kmeans.cluster_centers_)[m], (kmeans.cluster_centers_)[n])
for i in range(clusters-1):
    print("The distance between the centers of clusters "+str(i)+" and "
          +str(i+1)+ " is "+str(dist_between_centroids(i,i+1))+".")
print("\n")
for i in range(clusters):
    mean = hot100[hot100['kmeans label'] == i]['rank'].mean()
    print("The average rank of the songs in cluster "+str(i)+" is "+str(mean)+".")
    mean = hot100[hot100['kmeans label'] == i]['year'].mean()
    print("The average year of the songs in cluster "+str(i)+" is "+str(mean)+".")
    print("\n")

We're clustering in a 11682-dimensional space.
The distance between the centers of clusters 0 and 1 is 0.2843815278803648.
The distance between the centers of clusters 1 and 2 is 0.23878176008233265.
The distance between the centers of clusters 2 and 3 is 0.7658396357076475.
The distance between the centers of clusters 3 and 4 is 1.0472148899635345.
The distance between the centers of clusters 4 and 5 is 0.9462254680827006.
The distance between the centers of clusters 5 and 6 is 0.582536224703529.
The distance between the centers of clusters 6 and 7 is 0.30925253413049275.
The distance between the centers of clusters 7 and 8 is 0.5570979062784898.


The average rank of the songs in cluster 0 is 48.7469387755102.
The average year of the songs in cluster 0 is 2014.926530612245.


The average rank of the songs in cluster 1 is 50.391304347826086.
The average year of the songs in cluster 1 is 2014.502415458937.


The average rank of the songs in cluster 2 is 54.11176470588235.
The average y

# Testing junk below

In [139]:
hot100[hot100['kmeans label'] == 0]['rank'].mean()

47.04651162790697

In [164]:
len(hot100['lyrics'])

1022

# Old Code

In [79]:
## Takes in a string of lyrics, cleans them a smidgeon, 
## and returns a frequency list of each word as a dictionary
def lyric_bagger(lyrics):
    word_list = lyrics.lower().split()
    frequencies = {}
    for word in word_list:
        if (word in frequencies):
            frequencies[word] += 1
        else:
            frequencies[word] = 1
    return frequencies

## Convert all the lyrics into a frequency dictionaries stored in fmtd_lyrics
lyrics = hot100['lyrics'].copy().tolist()
fmtd_lyrics = []
for l in lyrics:
    fmtd_lyrics.append(lyric_bagger(l))