# Using TF-IDF to classify song lyrics

In this notebook, we'll explore how we can use TF-IDF to classify songs in the Billboard Hot 100 playlist.

In a nutshell, the TF-IDF method characterizes how unique a particular term or set of terms might be to a particular document or set of documents compared to the general term frequency list of a larger body of documents.

Using TF-IDF we can generate

In [9]:
import numpy as np
import pandas as pd

Import the dataset (warts 'n all):

In [10]:
hot100 = pd.read_csv('hot_100_with_lyrics.csv')

We've had considerable trouble consistently generating Genius.com urls to produce our set of lyrics, so we'll want to drop all the songs for which we could not currently find lyrics using our current methods.

In [11]:
hot100 = hot100[(hot100['lyrics'] != 'URL-ERROR-LYRICS-NOT-FOUND')]
len(hot100.index)

1022

As you see in the above table, the lyrics are at the right. As transcribed, they still appear with newlines (\n). Hence we will need to first remove the \n from each of these, as these do not matter for our purposes in this notebook. Here's a function that cleans up lyrics and returns the lyrics as a frequency list (dictionary) of words.

In [17]:
## Takes in a string of lyrics, cleans them a smidgeon, 
## and returns a frequency list of each word as a dictionary
def lyric_bagger(lyrics):
    word_list = lyrics.lower().split()
    #word_list.sort()
    frequencies = {}
    for word in word_list:
        if (word in frequencies):
            frequencies[word] += 1
        else:
            frequencies[word] = 1
    
    return frequencies

In [21]:
## Convert all the lyrics into a frequency dictionaries stored in fmtd_lyrics
lyrics = hot100['lyrics'].copy().tolist()

fmtd_lyrics = []
for l in lyrics:
    fmtd_lyrics.append(lyric_bagger(l))

Or, we could skip all that potentially and just use the [sklearn object TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) that does this basically automatically. Why not? [This](https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#sphx-glr-auto-examples-text-plot-document-clustering-py) looks like a reasonable example to follow.

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer

## This tool seems to assume that we have a list and that each list is a separate document.
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(lyrics)

print(vectorizer.get_feature_names())



In [11]:
## now let's try some very simple clustering

from sklearn.cluster import KMeans

In [12]:
kmeans = KMeans(n_clusters=9).fit(X)
kmeans.predict([[0, 0], [12, 3]])

ValueError: X has 2 features, but KMeans is expecting 11682 features as input.

In [13]:
labels = KMeans.predict(kmeans)

TypeError: predict() missing 1 required positional argument: 'X'