# Using TF-IDF to classify song lyrics

In this notebook, we'll explore how we can use TF-IDF to classify songs in the Billboard Hot 100 playlist.

In a nutshell, the TF-IDF method characterizes how unique a particular term or set of terms might be to a particular document or set of documents compared to the general term frequency list of a larger body of documents.

Using TF-IDF we can generate

In [9]:
import numpy as np
import pandas as pd

Import the dataset (warts 'n all):

In [104]:
hot100 = pd.read_csv('hot_100_with_lyrics.csv')

We've had considerable trouble consistently generating Genius.com urls to produce our set of lyrics, so we'll want to drop all the songs for which we could not currently find lyrics using our current methods. We'll also want to remove all the newlines.

In [114]:
hot100 = hot100[(hot100['lyrics'] != 'URL-ERROR-LYRICS-NOT-FOUND')]
lyrics = hot100['lyrics'].copy().tolist()
fmtd_lyrics = []
for song in lyrics:
    fmtd_lyrics.append(song.replace('\n', ' '))
hot100['lyrics'] = fmtd_lyrics
hot100

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,title,artists,rank,year,first_artist,title_fmtd,lyrics,kmeans label
0,0,0,TiK ToK,Ke$ha,1,2010,kesha,tik tok,Wake up in the morning feelin' like P. Diddy ...,1
1,1,1,Need You Now,Lady Antebellum,2,2010,lady antebellum,need you now,"""Hey, sorry I missed your call, just leave a ...",0
2,2,2,"Hey, Soul Sister",Train,3,2010,train,hey soul sister,Heyy He-e-e-e-ey He-e-e-e-ey Your lipstick s...,0
3,3,3,California Gurls,Katy Perry Featuring Snoop Dogg,4,2010,katy perry,california gurls,"Greetings, loved ones Let's take a journey I...",1
4,4,4,OMG,Usher Featuring will.i.am,5,2010,usher,omg,"Oh my gosh Baby let me I did it again, so I'm...",1
...,...,...,...,...,...,...,...,...,...,...
1093,1093,1093,More Than My Hometown,Morgan Wallen,96,2020,morgan wallen,more than my hometown,"Girl, our mamas are best friends and so are w...",2
1094,1094,1094,Lovin' On You,Luke Combs,97,2020,luke combs,lovin on you,Don't get me wrong I like a bobber on the wat...,2
1095,1095,1095,Said Sum,Moneybagg Yo,98,2020,moneybagg yo,said sum,"Turn me up, YC Huh? What? Ah, I thought a ...",4
1096,1096,1096,Slide,H.E.R. Featuring YG,99,2020,her,slide,You always wearin' them glasses You don't wan...,2


As you see in the above table, the lyrics are at the right. As transcribed, they still appear with newlines (\n). Hence we will need to first remove the \n from each of these, as these do not matter for our purposes in this notebook. Here's a function that cleans up lyrics and returns the lyrics as a frequency list (dictionary) of words.

In [17]:
## Takes in a string of lyrics, cleans them a smidgeon, 
## and returns a frequency list of each word as a dictionary
def lyric_bagger(lyrics):
    word_list = lyrics.lower().split()
    frequencies = {}
    for word in word_list:
        if (word in frequencies):
            frequencies[word] += 1
        else:
            frequencies[word] = 1
    return frequencies

In [21]:
## Convert all the lyrics into a frequency dictionaries stored in fmtd_lyrics
lyrics = hot100['lyrics'].copy().tolist()

fmtd_lyrics = []
for l in lyrics:
    fmtd_lyrics.append(lyric_bagger(l))

Or, we could skip all that potentially and just use the [sklearn object TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) that does this basically automatically. Why not? [This](https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#sphx-glr-auto-examples-text-plot-document-clustering-py) looks like a reasonable example to follow.

In [160]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

## TfidfVectorizer assumes that we have a list of "separate documents".
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(lyrics)

## Now let's try some very simple clustering
clusters = 3
kmeans = KMeans(n_clusters=clusters).fit(X)
hot100['kmeans label'] = kmeans.labels_

In [161]:
hot100[hot100['kmeans label'] == 0].sample(10)[['artists','title','kmeans label']]

Unnamed: 0,artists,title,kmeans label
1056,Old Dominion,One Man Band,0
577,O.T. Genasis,CoCo,0
784,Harry Styles,Sign Of The Times,0
770,Katy Perry Featuring Skip Marley,Chained To The Rhythm,0
598,Calvin Harris & Disciples,How Deep Is Your Love,0
920,Ava Max,Sweet But Psycho,0
377,Kendrick Lamar,Swimming Pools (Drank),0
185,Jason Derulo,Don't Wanna Go Home,0
1022,Mustard & Roddy Ricch,Ballin',0
705,Sam Hunt,Body Like A Back Road,0


In [162]:
hot100[hot100['kmeans label'] == 1].sample(10)[['artists','title','kmeans label']]

Unnamed: 0,artists,title,kmeans label
862,Charlie Puth,How Long,1
414,Eminem Featuring Rihanna,The Monster,1
936,Taylor Swift,You Need To Calm Down,1
480,Naughty Boy Featuring Sam Smith,La La La,1
208,One Direction,What Makes You Beautiful,1
842,G-Eazy & Halsey,Him & I,1
709,"DJ Khaled Featuring Justin Bieber, Quavo, Chan...",I'm The One,1
3,Katy Perry Featuring Snoop Dogg,California Gurls,1
308,Katy Perry,Roar,1
94,Sara Bareilles,King Of Anything,1


In [163]:
hot100[hot100['kmeans label'] == 2].sample(10)[['artists','title','kmeans label']]

Unnamed: 0,artists,title,kmeans label
229,Katy Perry,Part Of Me,2
975,Jonas Brothers,Only Human,2
172,Taylor Swift,Back To December,2
520,Meghan Trainor,Lips Are Movin,2
901,Billie Eilish,Bad Guy,2
735,Maroon 5 Featuring Kendrick Lamar,Don't Wanna Know,2
367,Luke Bryan,Crash My Party,2
63,Lady Gaga,Paparazzi,2
35,Ludacris,How Low,2
971,Lil Baby,Close Friends,2


In [156]:
## This cell prints some information I want to know.
print("We're clustering in a "+str(len((kmeans.cluster_centers_)[0]))+"-dimensional space.")
def dist_between_centroids(m,n):
    length = len((kmeans.cluster_centers_)[m])
    cumsum = 0;
    for i in range(length):
        cumsum += ((kmeans.cluster_centers_)[m][i]-(kmeans.cluster_centers_)[n][i])**2
    return np.sqrt(cumsum)
print("The distance between the centroids of the first two clusters is "+
    str(dist_between_centroids(0,1))+".")
for i in range(clusters):
    mean = hot100[hot100['kmeans label'] == i]['rank'].mean()
    print("The average rank of the songs in cluster "+str(i)+" is "+str(mean)+".")
    mean = hot100[hot100['kmeans label'] == i]['year'].mean()
    print("The average year of the songs in cluster "+str(i)+" is "+str(mean)+".")

We're clustering in a 11682-dimensional space.
The distance between the centroids of the first two clusters is 0.43669082236060186.
The average rank of the songs in cluster 0 is 46.24.
The average year of the songs in cluster 0 is 2013.8666666666666.
The average rank of the songs in cluster 1 is 50.96029776674938.
The average year of the songs in cluster 1 is 2014.7766749379653.
The average rank of the songs in cluster 2 is 50.2389705882353.
The average year of the songs in cluster 2 is 2015.0735294117646.


# Testing junk below

In [139]:
hot100[hot100['kmeans label'] == 0]['rank'].mean()

47.04651162790697