# Naive Bayes for Lyrics

In [1]:
def read_file(fname='cleaned_nn.csv'):
    fp = open(fname, 'r')
    all_labels = []
    all_lyrics = []
    for line in fp:
        index, genre, *lyrics = line.split(',')
        text = ' '.join(lyrics)
        all_labels.append(genre)
        all_lyrics.append(text)
        
    return( all_labels[1:], all_lyrics[1:] ) # Remove Header

In [2]:
all_labels, all_lyrics = read_file()

In [3]:
(all_labels[0], all_lyrics[0])

('Rock',
 '"Downtown to the courthouse For some pass-the-buck justice It\'s a knock-down  drag \'em out  beat \'em up  mace the crowd Strip search  busted Can\'t control the fire Coming through barbed wire It\'s a junkyard  live hard  bear the scars of broke and charred Sellers and buyers So when all is said and done Watch out  \'cause here they come Shadows of the night It\'s the last of the old ways It\'s the wave of the new age Computerized  digitized  money markets  mechanized Internet highway Infected by the needy Neglected by the greedy Chaos  blood loss  nickel bagging debutants Looking for a freebie We got radio silence Another act of violence Backstage  spray paint  security in rollerskates Beating on a new dance Going down on a choker Stun gun smoker Open range  who\'s game  shoot until the scream of pain Never getting older" shadows-of-the-night dead-moon\n')

Creating sparse representation of lyrics in order to perform text analysis.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(all_lyrics)

In [5]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

Using a Multinomial Naive Bayes classifier. Although this model assumes features are independent, it is extremely efficient in the training process. In addition the previous words impact may not be too large in this setting. We can definitely look at other models such as HMM's or LTSM Neural networks to determine if previous words and future words are correlated.

In [6]:
model = MultinomialNB()

In [7]:
model.fit(X_train, all_labels)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [8]:
expected = all_labels
predicted = model.predict(X_train) 

Printing the training error. This number should be relatively high.

In [9]:
print(metrics.accuracy_score(expected, predicted))

0.811625


Looking at cross validation without any feature engineering.

In [10]:
from sklearn.model_selection import cross_val_score
y = all_labels
scores = cross_val_score(model, X_train, y, cv=10, scoring='accuracy')

In [32]:
scores

array([0.6525 , 0.6475 , 0.65125, 0.65   , 0.65625, 0.66625, 0.65875,
       0.69625, 0.66625, 0.6575 ])

In [33]:
scores.mean()

0.66025

Adding ngram range to select more than one words at a time. This allows for multiple words to be selected and tested against rather than singular words, which may explain more about the track.

In [43]:
vectorizer = CountVectorizer(ngram_range=(1, 3),token_pattern=r'\b\w+\b', min_df=1)
X_train = vectorizer.fit_transform(all_lyrics)

In [14]:
model.fit(X_train, all_labels)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [45]:
y = all_labels
scores = cross_val_score(model, X_train, y, cv=10, scoring='accuracy')

scores.mean()

0.635375

Using a token pattern to match words and added stop words that should be removed to ensure these words are not the main words classifying the lyrics to the genre. Generally, these stop words add a negative effect because they have no relevance to the song/lyrics.

In [41]:
vectorizer = CountVectorizer(ngram_range=(1, 2),token_pattern=r'\b\w+\b', min_df=1, stop_words = ['the', 'a', 'and', 'an', 'in'])
X_train = vectorizer.fit_transform(all_lyrics)

In [42]:
model.fit(X_train, all_labels)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [39]:
y = all_labels
scores = cross_val_score(model, X_train, y, cv=10, scoring='accuracy')

scores.mean()

0.645375

Added another paramter where only words which appear more than twice are being analyzed. This can be especially useful when singular words which may have different contexts were stated in the song. As a result, looking at words which appear twice may boost accuracy.

In [46]:
vectorizer = CountVectorizer(ngram_range=(1, 2),token_pattern=r'\b\w+\b', min_df=2, stop_words = ['the', 'a', 'and', 'an', 'in'])
X_train = vectorizer.fit_transform(all_lyrics)

In [47]:
model.fit(X_train, all_labels)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [48]:
y = all_labels
scores = cross_val_score(model, X_train, y, cv=10, scoring='accuracy')

scores.mean()

0.662

Now trying words which appear more than 3 times. This however, may cause some negative effects since words are generally not repeated too many times in a song, besides the stop words.

In [49]:
vectorizer = CountVectorizer(ngram_range=(1, 2),token_pattern=r'\b\w+\b', min_df=3, stop_words = ['the', 'a', 'and', 'an', 'in'])
X_train = vectorizer.fit_transform(all_lyrics)

In [50]:
model.fit(X_train, all_labels)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [51]:
y = all_labels
scores = cross_val_score(model, X_train, y, cv=10, scoring='accuracy')

scores.mean()

0.654875

After feature engineering and and using the Gaussian Naive Bayes model, we achieved a accuracy score of rougly 0.662. This is higher than the scores received for Logistic Regression. We were not able to use any other methods of Naive Bayes because Gaussian Naive bayes is the only model where sparse matrices align with the model definition.