# Final Project Code: Genre and Sentiment Classification with Song Lyrics

Datasets can be found in /datasets folder

### Step #0: NOTES FROM FEEDBACK

a. We found a classifier not used in class called SGD and used it.  We were not refering to Linear SVC() because when we tried it, it would not run.
From scikit website: "This estimator implements regularized linear models with stochastic gradient descent (SGD) learning"

b. We didn't pick a value for KNN because it is the scikit model where we don't pick a specific number

### Step #1: Mount Google Drive and download libraries

a. Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


b. Import Libraries: 

In [None]:
import re
import pandas as pd
import nltk
import gensim

from sklearn.datasets import fetch_openml as fetch_mldata
from sklearn.model_selection import cross_val_score

from numpy import mean
import numpy as np
import matplotlib.pyplot as plt

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

from gensim.models import Word2Vec
bigmodel = gensim.models.KeyedVectors.load_word2vec_format("/content/drive/MyDrive/GoogleNews-vectors-negative300-SLIM.bin", binary=True)

c. Download Datasets from OpenML and store them into dataframes:


In [None]:
genreData =  fetch_mldata(name='Music-Dataset--1950-to-2019', version=1)
gdf = genreData.data
x = gdf["lyrics"]
y = gdf["genre"]
print(y[1])
print(x[1])   

pop
believe drop rain fall grow believe darkest night candle glow believe go astray come believe believe believe smallest prayer hear believe great hear word time hear bear baby touch leaf believe believe believe lord heaven guide sin hide believe calvary die pierce believe death rise meet heaven loud amen know believe


In [None]:
spotifyData = fetch_mldata(name='150K-Lyrics-Labeled-with-Spotify-Valence', version=1)
sdf = spotifyData.data
spotify_x = sdf["seq"]
spotify_y = sdf["label"]
print(spotify_x[1])
print(spotify_y[1])   

The drinks go down and smoke goes up, I feel myself, got to let go
My cares get lost up in that crowd that go up, up and away yo

Slow down the lights, eyes open wide
We live till we die, live till we die, live till we die

Ain't kill my vibe, don't blow my high, don't doubt that he from the band though
I'm listening to this song, now I'm up up and away yo

Slow down the lights, eyes open wide
We live till we die, live till we die, live till we die
Slow down the lights, eyes open wide
We live till we die, live till we die, live till we die
Slow down the lights, eyes open wide
We live till we die, live till we die, live till we die

Nights like this, I go all out, up so high, I can't come down
Let me live just for right now
Yeah, yeah, bite me, I'm so gone

Happier than a motherfucker, I can't feel my fa-ace
No squares in my circle, get up out my way
Happier than a motherfucker, I can't feel my fa-ace
Who you tryna dance? Get up out my way
Happier than a motherfucker, I can't feel my fa

### Step #2: Tokenize Lyrics for both datasets


In [None]:
nltk.download('punkt');
def tokenize(column):
    tokens = nltk.word_tokenize(column)
    return [w for w in tokens if w.isalpha()]   

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
gdf['lyrics_tokenized'] = gdf.apply(lambda x: tokenize(x['lyrics']), axis=1)
gdf[['lyrics_tokenized']].head() 

sdf['lyrics_tokenized'] = sdf.apply(lambda x: tokenize(x['seq']), axis=1)
sdf[['lyrics_tokenized']].head()  


Unnamed: 0,lyrics_tokenized
0,"[No, no, I, ai, ever, trapped, out, the, bando..."
1,"[The, drinks, go, down, and, smoke, goes, up, ..."
2,"[She, do, live, on, planet, Earth, no, more, S..."
3,"[Trippin, off, that, Grigio, mobbin, lights, l..."
4,"[I, see, a, midnight, panther, so, gallant, an..."


### Spotify data: Bin Valences (for baselines data) in spotify valence data

USE sdf['Valence'] FOR OUTPUTS

--> 0 - 0.2 : Very Negative  

--> 0.2 - 0.4: Somewhat Negative 

--> 0.4 - 0.6: Neutral 

--> 0.6 - 0.8: Somewhat Positive 

--> 0.8 - 1.0 : Very Postive 

In [None]:
valence = [] 
bertvalencetest = []
count = len(sdf) 
vn = 0 
sn = 0 
n = 0 
sp = 0 
vp = 0 
# Bin valences 
for val in spotify_y: 
  if val < 0.2: 
    valence.append('Very Negative')
    vn +=1
  elif val >= 0.2 and val < 0.4: 
    valence.append('Somewhat Negative')
    sn +=1
  elif val >= 0.4 and val < 0.6: 
    valence.append('Neutral')
    n +=1
  elif val >= 0.6 and val < 0.8:
    valence.append('Somewhat Positive')
    sp +=1
  elif val >= 0.8: 
    valence.append('Very Positive')
    vp +=1
  if val<0.5:
    bertvalencetest.append(0)
  if val>=0.5 and val<1:
    bertvalencetest.append(1)
sdf['valence'] = valence 
sdf['bertvalence'] = bertvalencetest
print('---------------------TRAINING SET VALENCE BASELINE DATA------------------')
print('Very negative songs: ', vn/count)
print('Somewhat negative songs: ', sn/count)
print('Neutral songs: ', n/count)
print('Somewhat positive songs: ', sp/count)
print('Very positive songs: ', vp/count)

---------------------TRAINING SET VALENCE BASELINE DATA------------------
Very negative songs:  0.14866784967761898
Somewhat negative songs:  0.25023207643682155
Neutral songs:  0.2464367583815905
Somewhat positive songs:  0.21503223810095168
Very positive songs:  0.1396310774030173


### Delete 'pop' genre, because many of the songs are overrepresented as pop songs

In [None]:
gdf.drop(gdf.loc[gdf['genre'] == 'pop'].index, inplace=True)
print(gdf.shape)

(21330, 32)


###Step #3: Split Training and Testing Data and reassign x and y values

Adapted from this source: https://towardsdatascience.com/how-to-split-a-dataset-into-training-and-testing-sets-b146b1649830


*   x1 --> lyrics for spotify valence, training data
*   y1 --> valence for training data 
*   x1test --> " ", testing data 
* y1test --> " ", testing data 
* x2 --> lyrics for genre dataset, training data 
* y2 --> genre for trainng data
* x2test --> " ", testing data 
* x2test --> " ", testing data

In [None]:
## We are using a 80/20 split 
spotify_training_data = sdf.sample(frac=0.8, random_state=25)
spotify_testing_data = sdf.drop(spotify_training_data.index)
print(f"No. of spotify training examples: {spotify_training_data.shape[0]}")
print(f"No. of spotify testing examples: {spotify_testing_data.shape[0]}") 

genre_training_data = gdf.sample(frac=0.8, random_state=25)
genre_testing_data = gdf.drop(genre_training_data.index)

print(f"No. of genre training examples: {genre_training_data.shape[0]}")
print(f"No. of genre testing examples: {genre_testing_data.shape[0]}")  

x1 = spotify_training_data['lyrics_tokenized']  
y1 = spotify_training_data['valence'] 

x1test = spotify_testing_data['lyrics_tokenized']
y1test = spotify_testing_data['valence']

print(x1.shape, y1.shape)
print(x1test.shape, y1test.shape)

x2 = genre_training_data['lyrics_tokenized']
y2 = genre_training_data['genre']
x2test = genre_testing_data['lyrics_tokenized']
y2test = genre_testing_data['genre']

print(x2.shape, y2.shape)
print(x2test.shape, y2test.shape)

No. of spotify training examples: 126682
No. of spotify testing examples: 31671
No. of genre training examples: 17064
No. of genre testing examples: 4266
(126682,) (126682,)
(31671,) (31671,)
(17064,) (17064,)
(4266,) (4266,)


### Step 4: Baselines for Music Genre Data 

Calculate how many songs fall into each genre category  
Genres include: pop, rock, country, blues, jazz, hip hop, and reggae

Should we drop pop songs? They are the majority amount of songs, so we may have an unbalanced dataset...

In [None]:
## Baselines for genre data 
count = len(gdf)
# 7 genres 
print(gdf['genre'].unique())
popcount = 0
rockcount = 0
countrycount = 0
bluescount = 0 
jazzcount = 0 
hhcount = 0 
regcount = 0 

for genre in y: 
  if genre == 'pop': 
    popcount += 1 
  elif genre == 'rock': 
    rockcount += 1 
  elif genre == 'country': 
    countrycount += 1
  elif genre == 'blues': 
    bluescount += 1 
  elif genre == 'jazz': 
    jazzcount += 1 
  elif genre == 'reggae': 
    regcount += 1 
  elif genre == 'hip hop': 
    hhcount += 1
print('---------------------GENRE BASELINE DATA------------------')
print('Percentage of pop songs', popcount/count)
print('Percentage of blues songs', bluescount/count)
print('Percentage of rock songs', rockcount/count)
print('Percentage of country songs', countrycount/count)
print('Percentage of jazz songs', jazzcount/count)
print('Percentage of reggae songs', regcount/count)
print('Percentage of hip hop songs', hhcount/count)

['country' 'blues' 'jazz' 'reggae' 'rock' 'hip hop']
---------------------GENRE BASELINE DATA------------------
Percentage of pop songs 0.3301453352086263
Percentage of blues songs 0.21584622597280825
Percentage of rock songs 0.18912330051570558
Percentage of country songs 0.2552742616033755
Percentage of jazz songs 0.1802625410220347
Percentage of reggae songs 0.11711204875761838
Percentage of hip hop songs 0.04238162212845757


## Step 5: Classification 

*   Use wordtovec to create a term-document matrix for each song 



GENRE CLASSIFICATION

In [None]:
# TRAINING DATA 
genretargets = []
genrevectors = []
for genre in y2: 
  if genre == 'rock': 
    genretargets.append(1)
  elif genre == 'country': 
    genretargets.append(2)
  elif genre == 'blues': 
    genretargets.append(3)
  elif genre == 'jazz': 
    genretargets.append(4)
  elif genre == 'reggae': 
    genretargets.append(5) 
  elif genre == 'hip hop': 
    genretargets.append(6)
for h in x2:
    totvec = np.zeros(300)
    for w in h:
        if w.lower() in bigmodel:
            totvec = totvec + bigmodel[w.lower()]
    genrevectors.append(totvec)

# TESTING DATA 
gtesttargets = []
gtestvectors = []
for genre in y2test: 
  if genre == 'rock': 
    gtesttargets.append(1)
  elif genre == 'country': 
    gtesttargets.append(2)
  elif genre == 'blues': 
    gtesttargets.append(3)
  elif genre == 'jazz': 
    gtesttargets.append(4)
  elif genre == 'reggae': 
    gtesttargets.append(5) 
  elif genre == 'hip hop': 
    gtesttargets.append(6)
for h in x2test:
    totvec = np.zeros(300)
    for w in h:
        if w.lower() in bigmodel:
            totvec = totvec + bigmodel[w.lower()]
    gtestvectors.append(totvec)


GENRE MODELS

In [None]:
gmodel = GaussianNB()
gmodel.fit(genrevectors, genretargets)

expected = gtesttargets
predicted = gmodel.predict(gtestvectors)
print("GENRE NB CLASSIFIER")
print(metrics.classification_report(expected, predicted))


gmodel2 = SGDClassifier()
gmodel2.max_iter = 100000
gmodel2.fit(genrevectors, genretargets)

expected2 = gtesttargets
predicted2 = gmodel2.predict(gtestvectors)
print("GENRE SGD CLASSIFIER")
print(metrics.classification_report(expected2, predicted2))


gmodel3 = LogisticRegression()
gmodel3.max_iter = 100000
gmodel3.fit(genrevectors, genretargets)

expected3 = gtesttargets
predicted3 = gmodel3.predict(gtestvectors)
print("GENRE LR CLASSIFIER")
print(metrics.classification_report(expected3, predicted3))


gmodel4 = MLPClassifier()
gmodel4.max_iter = 100000
gmodel4.fit(genrevectors, genretargets)

expected4 = gtesttargets
predicted4 = gmodel4.predict(gtestvectors)
print("GENRE MLP CLASSIFIER")
print(metrics.classification_report(expected4, predicted4))


gmodel5 = KNeighborsClassifier()
gmodel5.max_iter = 100000
gmodel5.fit(genrevectors, genretargets)

expected5 = gtesttargets
predicted5 = gmodel5.predict(gtestvectors)
print("GENRE KNN CLASSIFIER")
print(metrics.classification_report(expected5, predicted5))

GENRE NB CLASSIFIER
              precision    recall  f1-score   support

           1       0.29      0.09      0.14       756
           2       0.30      0.77      0.44      1081
           3       0.44      0.05      0.09       955
           4       0.32      0.01      0.03       782
           5       0.25      0.44      0.32       503
           6       0.21      0.30      0.24       189

    accuracy                           0.29      4266
   macro avg       0.30      0.28      0.21      4266
weighted avg       0.32      0.29      0.21      4266

GENRE SGD CLASSIFIER
              precision    recall  f1-score   support

           1       0.38      0.22      0.28       756
           2       0.47      0.50      0.48      1081
           3       0.39      0.18      0.25       955
           4       0.26      0.51      0.34       782
           5       0.31      0.44      0.37       503
           6       0.86      0.03      0.06       189

    accuracy                        

b. WordtoVec for Spotify Valence

In [None]:
#Training data
spotifytargets = []
spotifyvectors = []
for label in y1:#spotify_testing_data['label']: 
  if label == 'Very Negative':
    spotifytargets.append(0)
  elif label == 'Somewhat Negative':
    spotifytargets.append(1)
  elif label == 'Neutral':   
    spotifytargets.append(2)
  elif label == 'Somewhat Positive':
    spotifytargets.append(3)
  elif label == 'Very Positive': 
    spotifytargets.append(4)
for h in x1:
    totvec = np.zeros(300)
    for w in h:
        if w.lower() in bigmodel:
            totvec = totvec + bigmodel[w.lower()]
    spotifyvectors.append(totvec)

#Testing data
stesttargets = []
stestvectors = []
for label in y1test: 
  if label == 'Very Negative': 
    stesttargets.append(0)
  elif label == 'Somewhat Negative': 
    stesttargets.append(1)
  elif label == 'Neutral':  
    stesttargets.append(2)
  elif label == 'Somewhat Positive': 
    stesttargets.append(3)
  elif label == 'Very Positive': 
    stesttargets.append(4)
for h in x1test:
    totvec = np.zeros(300)
    for w in h:
        if w.lower() in bigmodel:
            totvec = totvec + bigmodel[w.lower()]
    stestvectors.append(totvec)


c. Classification Models for Spotify Valence 

In [None]:
smodel = GaussianNB()
smodel.fit(stestvectors, stesttargets)

expected = stesttargets
predicted = smodel.predict(stestvectors)
print("SPOTIFY NB CLASSIFIER")
print(metrics.classification_report(expected, predicted))


smodel2 = SGDClassifier()
smodel2.max_iter = 100000
smodel2.fit(spotifyvectors, spotifytargets)

expected2 = stesttargets
predicted2 = smodel2.predict(stestvectors)
print("SPOTIFY SGD CLASSIFIER")
print(metrics.classification_report(expected2, predicted2))


smodel3 = LogisticRegression()
smodel3.max_iter = 100000
smodel3.fit(spotifyvectors, spotifytargets)

expected3 = stesttargets
predicted3 = smodel3.predict(stestvectors)
print("SPOTIFY LR CLASSIFIER")
print(metrics.classification_report(expected3, predicted3))


sgmodel4 = MLPClassifier()
sgmodel4.fit(spotifyvectors, spotifytargets)

expected4 = stesttargets
predicted4 = sgmodel4.predict(stestvectors)
print("SPOTIFY MLP CLASSIFIER")
print(metrics.classification_report(expected4, predicted4))


smodel5 = KNeighborsClassifier()
smodel5.fit(spotifyvectors, spotifytargets)

expected5 = stesttargets
predicted5 = smodel5.predict(stestvectors)
print("SPOTIFY KNN CLASSIFIER")
print(metrics.classification_report(expected5, predicted5))

SPOTIFY NB CLASSIFIER
              precision    recall  f1-score   support

           0       0.19      0.77      0.30      4727
           1       0.25      0.15      0.18      7831
           2       0.24      0.00      0.01      7751
           3       0.30      0.20      0.24      6897
           4       0.21      0.14      0.17      4465

    accuracy                           0.21     31671
   macro avg       0.24      0.25      0.18     31671
weighted avg       0.24      0.21      0.17     31671

SPOTIFY SGD CLASSIFIER
              precision    recall  f1-score   support

           0       0.36      0.01      0.02      4727
           1       0.28      0.82      0.42      7831
           2       1.00      0.00      0.00      7751
           3       0.00      0.00      0.00      6897
           4       0.24      0.48      0.32      4465

    accuracy                           0.27     31671
   macro avg       0.38      0.26      0.15     31671
weighted avg       0.40      0.2



SPOTIFY MLP CLASSIFIER
              precision    recall  f1-score   support

           0       0.40      0.25      0.31      4727
           1       0.32      0.48      0.38      7831
           2       0.27      0.28      0.28      7751
           3       0.31      0.30      0.31      6897
           4       0.35      0.17      0.22      4465

    accuracy                           0.31     31671
   macro avg       0.33      0.30      0.30     31671
weighted avg       0.32      0.31      0.31     31671

SPOTIFY KNN CLASSIFIER
              precision    recall  f1-score   support

           0       0.28      0.36      0.31      4727
           1       0.29      0.36      0.33      7831
           2       0.28      0.30      0.29      7751
           3       0.29      0.23      0.26      6897
           4       0.29      0.14      0.19      4465

    accuracy                           0.29     31671
   macro avg       0.29      0.28      0.27     31671
weighted avg       0.29      0.

## Step 6: Find a Trend between Genre and Valence

In [None]:
#WE ARE USING THE KNN CLASSIFIER TO PREDICT OUR RESULTS
#WE CHOSE MLP BECAUSE OF THE ACCURACIES IT PRODUCED ABOVE
predictedtrend1 = gmodel5.predict(stestvectors)
predictedtrend2 = smodel5.predict(stestvectors)

rocklist = []
countrylist = []
blueslist = []
jazzlist = []
reggaelist = []
hiphoplist = []

#Predictedternd1 is labeled 1-6 with the different genres
#Predictedtrend2 is labeled "Very Negatice", "Somewhat Negarive", etc
#Find the associated genre with every example via predictedtrend1
#Find the associated valance with every example via predictedtrend2

#This for loop matches each example to its genre and associated valence
for i in range(len(predictedtrend1)):
  if predictedtrend1[i] == 1:
    rocklist.append(predictedtrend2[i])
  if predictedtrend1[i] == 2:
    countrylist.append(predictedtrend2[i])
  if predictedtrend1[i] == 3:
    blueslist.append(predictedtrend2[i])
  if predictedtrend1[i] == 4:
    jazzlist.append(predictedtrend2[i])
  if predictedtrend1[i] == 5:
    reggaelist.append(predictedtrend2[i])
  if predictedtrend1[i] == 6:
    hiphoplist.append(predictedtrend2[i])

biglist = []
biglist.extend([rocklist, countrylist, blueslist, jazzlist, reggaelist, hiphoplist])


label = ""
current = 0
for item in biglist:
  counter0,counter1,counter2,counter3,counter4 = 0,0,0,0,0 
  #Count how many of each genre were found to be very negative/somewhat negative/etc
  for i in item:
    if i==0:
      counter0 += 1
    if i==1:
      counter1 += 1
    if i==2:
      counter2 += 1
    if i==3:
      counter3 += 1
    if i==4:
      counter4 += 1
  #These are just used to make everything neat in the output
  if current == 0:
    label = "Rock"
  if current == 1:
    label = "Country"
  if current == 2:
    label = "Blues"
  if current == 3:
    label = "Jazz"
  if current == 4:
    label = "Reggae"
  if current == 5:
    label = "HipHop"

  print(label, "is ", counter0/len(item), "% Very Negative")
  print(label, "is ", counter1/len(item), "% Somewhat Negative")
  print(label, "is ", counter2/len(item), "% Neutral")
  print(label, "is ", counter3/len(item), "% Somewhat Positive")
  print(label, "is ", counter4/len(item), "% Very Positve")
  current +=1

Rock is  0.18205897933957266 % Very Negative
Rock is  0.2661133674730708 % Somewhat Negative
Rock is  0.2825357584319265 % Neutral
Rock is  0.20801695214550592 % Somewhat Positive
Rock is  0.06127494260992407 % Very Positve
Country is  0.27131313131313134 % Very Negative
Country is  0.3515151515151515 % Somewhat Negative
Country is  0.20505050505050504 % Neutral
Country is  0.11494949494949495 % Somewhat Positive
Country is  0.05717171717171717 % Very Positve
Blues is  0.23597577034500922 % Very Negative
Blues is  0.32525678166973926 % Somewhat Negative
Blues is  0.24440347642875954 % Neutral
Blues is  0.13905715038188043 % Somewhat Positive
Blues is  0.05530682117461153 % Very Positve
Jazz is  0.18152828628794143 % Very Negative
Jazz is  0.31404275422227473 % Somewhat Negative
Jazz is  0.2556985945435219 % Neutral
Jazz is  0.1750324790362584 % Somewhat Positive
Jazz is  0.07369788591000355 % Very Positve
Reggae is  0.14629805410536306 % Very Negative
Reggae is  0.29461319411485526 % S