#Musical Emotion Classification using Lyrical Analysis
---
Alejandro Castaneda [cas29@pdx.edu]
Jacob Klusnick [klusnick@pdx.edu]

# **Data Collection**

Due to copyright issues, MoodyLyrics can only provide the song title, artist, and the valence-arousal feedback. **No lyrics are available.** To work around this for our project, we need to scrape the internet for lyrics to perform analysis.

Thanks to [John W Millr](https://github.com/johnwmillr), we're able to use a Library to utilize Genius' API.

The below installs the LyricsGenius Library. 

In [1]:
!pip install git+https://github.com/johnwmillr/LyricsGenius.git

Collecting git+https://github.com/johnwmillr/LyricsGenius.git
  Cloning https://github.com/johnwmillr/LyricsGenius.git to /tmp/pip-req-build-t3prw_vx
  Running command git clone -q https://github.com/johnwmillr/LyricsGenius.git /tmp/pip-req-build-t3prw_vx
Building wheels for collected packages: lyricsgenius
  Building wheel for lyricsgenius (setup.py) ... [?25l[?25hdone
  Created wheel for lyricsgenius: filename=lyricsgenius-3.0.0-cp37-none-any.whl size=44717 sha256=4d396ad8eb18aa9d0dc056b9e239fb7809b5810f5eb2ad25fc5cfd81e4b1bd59
  Stored in directory: /tmp/pip-ephem-wheel-cache-c8j0guop/wheels/4c/c2/c2/711389881353cc8ef2f0055a712da5db9637132bc10151212e
Successfully built lyricsgenius


In [2]:
# Imports for translating from Genius to a JSON 
from lyricsgenius import Genius
from google.colab import files

import csv
import json

# Imports for the Model
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report, multilabel_confusion_matrix
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MultiLabelBinarizer, OneHotEncoder

import numpy as np
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

The below code requires a **client access token** for the Genius API. Populate your own access token in the first line. 

The below connects the GeniusLyrics Library with an interface for our own data collection, handling any potential misuse of the API and storing that into an error file, or a data file respectively.

In [3]:
client_access_token = "insert your token here."

def get_lyrics(artist_name, song_title):
  '''
  get_lyrics will take an artist (string) and a song title (string), 
  and return the lyrics for that song (string) (if it is found). 
  '''  
  genius = Genius(client_access_token)
  genius.remove_section_headers = True
  genius.verbose = False
  artist = genius.search_artist(artist_name, max_songs=1, sort="title", include_features=True)
  if artist == None: 
    return None

  song = artist.song(song_title)
  if song == None:
    return None

  return song.lyrics

In [4]:
def gatherData(moodyLyricsFile, dataFileName, errorFileName):
  songs = [] 
  invalid = []

  with open(moodyLyricsFile) as csvf:
    reader = csv.reader(csvf, delimiter=',')
    headers = True
    i = 0
    for row in reader:
      if headers:
        headers = False
        continue
      
      artist = row[1]
      song = row[2]
      label = row[3]

      # do a try catch while loop so that it catches exceptions and keeps trying.
      while True:
        try:
          lyrics = get_lyrics(artist, song)
          break
        except:
          print("error with " + song)

      if lyrics == None:
        invalid.append({'artist': artist, 'song': song}) 
      else:
        songs.append({'artist': artist, 'song': song, 'lyrics': lyrics, 'label': label}) 

      print(f"{i} song retrieved.")
      i += 1

  with open(dataFileName, 'w') as lyr:
    print(f"{len(songs)} songs were added to the Lyrics.")
    json.dump(songs, lyr)
  with open(errorFileName, 'w') as err:
    print(f"There were {len(invalid)} invalid songs.")
    json.dump(invalid, err)

## Critical Information:
If you do not have the lyrics as a JSON for the following code, uncomment lines 4 & 5 below and run the program. Note that it takes a couple hours to run through and collect the data. Keep the Notebook active during that time.

In [5]:
moodyLyricsFile = 'ml_raw.csv'
outputJson = 'lyrics.txt'
# --- Uncomment below ---
#gatherData(moodyLyricsFile, outputJson, 'errors.txt')
#files.download(outputJson)

# The Model

The below code features a few helper functions to process the data, help with one-hot encoding on the y-axis (hence, LabelBinarizer, and train / test a neural network on the data.

In [6]:
def classify_song(label):
  # For MultiLabelBinarizer, we specifically just want the index
  # of whatever the song is at. This function is used to return
  # thet index of the label. 
  labels = ['relaxed', 'angry', 'sad', 'happy']
  return labels.index(label)

def tag_pos(text):
  # This function tags each set of text with NLTK's pos_taggers
  # then returns the words as a unique string like 'word_TAG'.
  token = nltk.word_tokenize(text)
  return ' '.join([w+'_'+t for w, t in nltk.pos_tag(token)])

def preprocess_data(data, mlb):
  # This function gathers all the data and splits it into two 
  # arrays, X (lyrics) and Y (labels).
  lyrics = []
  labels = []
  for song in data:
    # We don't want songs with more than 10k characters in.
    # Sorry musicals...
    if len(song['lyrics']) > 10000: 
      continue
    
    lyrics.append(song['lyrics'])  
    labels.append([song['label']])
  # One-Hot Encoding tutorial taken from [3]
  x = np.array(lyrics, dtype=object)
  y = mlb.fit_transform(labels)
  return x, y

With all the helper functions out of the way, we can load up the data and run it through a POS Tagger, Count Vectorizer, MLP Classifier, and assess the Neural Network. 

In [7]:
with open('lyrics.txt', 'r') as jf:
  data = json.load(jf)

# Create the Multi Label Binarizer (and keep for inverse data later)
# Load the data into x and y arrays
mlb = MultiLabelBinarizer()
x, y = preprocess_data(data, mlb)

# Split and Shuffle the arrays
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

# Create a pipeline that POS-tags then vectorizes all the words
# then run it through a MLP Classifier
pipeline = Pipeline([
  ('vect', CountVectorizer(preprocessor=tag_pos)),
  ('nn', MLPClassifier())                   
])

# And perform 10-cross validation on it.
gscv = GridSearchCV(pipeline, { }, cv=10)

# Fit to the training data
gscv.fit(x_train, y_train)

# Test the results.
y_pred = gscv.predict(x_test)

In [8]:
# reference, for 0, 1, 2, 3: 
labels = ['relaxed', 'angry', 'sad', 'happy']

print(classification_report(y_test, y_pred, target_names=labels))

print(multilabel_confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

     relaxed       0.91      0.62      0.74       134
       angry       0.96      0.80      0.87       189
         sad       0.94      0.68      0.79       158
       happy       0.93      0.64      0.76       130

   micro avg       0.94      0.70      0.80       611
   macro avg       0.93      0.69      0.79       611
weighted avg       0.94      0.70      0.80       611
 samples avg       0.69      0.70      0.69       611

[[[469   8]
  [ 51  83]]

 [[415   7]
  [ 38 151]]

 [[446   7]
  [ 50 108]]

 [[475   6]
  [ 47  83]]]


  _warn_prf(average, modifier, msg_start, len(result))


The below pulls some examples using our models and performs an inverse transform to see what the classification of each song is. 

In [11]:
result = gscv.predict([get_lyrics('The Beatles', 'Happiness is a Warm Gun' )])
print(mlb.inverse_transform(result))

result = gscv.predict([get_lyrics('Ariana Grande', 'pete davidson' )])
print(mlb.inverse_transform(result))

[('angry',)]
[('happy',)]


# References

[1] Agrawal Y., Shanker R. G. R., Alluri V. (2021). Transformer-based approach towards music emotion recognition from lyrics. Institute of Information Technology, Hyderabad, India. 

[2] Ã‡ano, E., & Morisio, M. (2017). MoodyLyrics: A Sentiment Annotated Lyrics Dataset. Proceedings of the 2017 International Conference on Intelligent Systems, Metaheuristics & Swarm Intelligence - ISMSI '17. doi:10.1145/3059336.3059340



[3] Kite. How to do One Hot Encoding with Numpy in Python. Retrieved from: https://www.kite.com/python/answers/how-to-do-one-hot-encoding-with-numpy-in-python 