# Genre Classification - Lyrical Sentiment Analysis<a id='SA'></a>

## Contents<a id='Contents'></a>
* [Imports](#Imports)
* [Load Data](#LoadData)
* [Data Cleaning](#DataCleaning)
* [Sentiment Analysis](#SentimentAnalysis)
* [Save Data](#SaveData)

### Imports<a id='Imports'></a>

In [1]:
import os
import re
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import torch as pt
from transformers import pipeline
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

### Load Data<a id='LoadData'></a>

In [2]:
df = pd.read_csv('data/tracks.csv')
df.head() 

Unnamed: 0,track_id,track,artist,album,release_date,genre,subgenre,duration_ms,popularity,danceability,...,acousticness,instrumentalness,liveness,valence,tempo,mode,release_year,duration_min,duration_minsec,lyrics_raw
0,0prNGof3XqfTvNDxHonvdK,Scars To Your Beautiful,Alessia Cara,Know-It-All (Deluxe),2015-01-01,pop,pop,230226,73,0.573,...,0.0285,0.0,0.111,0.451,97.085,1,2015,3.8371,"3 m, 50 s",[Verse 1]\nShe just wants to be beautiful\nShe...
1,1rfofaqEpACxVEHIZBJe6W,Havana (feat. Young Thug),Camila Cabello,Camila,2018-01-01,pop,pop,217306,80,0.765,...,0.184,3.6e-05,0.132,0.394,104.988,1,2018,3.621767,"3 m, 37 s",[Intro: Pharrell Williams]\nHey\n\n[Chorus: Ca...
2,4l0Mvzj72xxOpRrp6h8nHi,Lose You To Love Me,Selena Gomez,Rare,2020-01-01,pop,pop,206458,83,0.488,...,0.556,0.0,0.21,0.0978,102.819,1,2020,3.440967,"3 m, 26 s",[Verse 1]\nYou promised the world and I fell f...
3,6T6D9CIrHkALcHPafDFA6L,Vibez,ZAYN,Nobody Is Listening,2021-01-01,pop,pop,163346,73,0.643,...,0.241,0.0178,0.12,0.297,96.924,1,2021,2.722433,"2 m, 43 s","[Chorus]\nDon't keep me waitin' (Ooh, ooh)\nI ..."
4,15og0pCEcTFWEXOFKdcJlU,Hate Me,Ellie Goulding,Brightest Blue,2020-01-01,pop,pop,188066,68,0.64,...,0.0875,0.0,0.147,0.762,75.018,1,2020,3.134433,"3 m, 8 s","[Chorus: Ellie Goulding]\nHate me, hate me, st..."


### Data Cleaning<a id='DataCleaning'></a>

Helper Functions to preprocess and clean lyrical data.

In [3]:
def remove_tags(lyrics):
    '''
    Returns the lyrics of a song without the tags for its structure.
    '''
    if lyrics is None: return None
    else: return re.sub(r'[\(\[].*?[\)\]]', '', str(lyrics))

def split_lines(lyrics):
    '''
    Returns a list of lines from the lyrics of a song
    '''
    lyrics = remove_tags(lyrics)
    if lyrics is None: return None
    else: return list(filter(lambda x: len(x) > 0, lyrics.splitlines()))

Splitting lines of song lyrics into a list

In [4]:
df['lyrics_lines'] = df.apply(lambda row: split_lines(row['lyrics_raw']), axis=1)

Creating new columns to store the results of the sentiment analysis for each emotion

In [5]:
emotions = ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']
df = pd.concat([df, pd.DataFrame(0, df.index, emotions)], axis=1)
df.head()

Unnamed: 0,track_id,track,artist,album,release_date,genre,subgenre,duration_ms,popularity,danceability,...,duration_min,duration_minsec,lyrics_raw,lyrics_lines,sadness,joy,love,anger,fear,surprise
0,0prNGof3XqfTvNDxHonvdK,Scars To Your Beautiful,Alessia Cara,Know-It-All (Deluxe),2015-01-01,pop,pop,230226,73,0.573,...,3.8371,"3 m, 50 s",[Verse 1]\nShe just wants to be beautiful\nShe...,"[She just wants to be beautiful, She goes unno...",0,0,0,0,0,0
1,1rfofaqEpACxVEHIZBJe6W,Havana (feat. Young Thug),Camila Cabello,Camila,2018-01-01,pop,pop,217306,80,0.765,...,3.621767,"3 m, 37 s",[Intro: Pharrell Williams]\nHey\n\n[Chorus: Ca...,"[Hey, Havana, ooh na-na , Half of my heart is ...",0,0,0,0,0,0
2,4l0Mvzj72xxOpRrp6h8nHi,Lose You To Love Me,Selena Gomez,Rare,2020-01-01,pop,pop,206458,83,0.488,...,3.440967,"3 m, 26 s",[Verse 1]\nYou promised the world and I fell f...,"[You promised the world and I fell for it, I p...",0,0,0,0,0,0
3,6T6D9CIrHkALcHPafDFA6L,Vibez,ZAYN,Nobody Is Listening,2021-01-01,pop,pop,163346,73,0.643,...,2.722433,"2 m, 43 s","[Chorus]\nDon't keep me waitin' (Ooh, ooh)\nI ...","[Don't keep me waitin' , I been waitin' all ni...",0,0,0,0,0,0
4,15og0pCEcTFWEXOFKdcJlU,Hate Me,Ellie Goulding,Brightest Blue,2020-01-01,pop,pop,188066,68,0.64,...,3.134433,"3 m, 8 s","[Chorus: Ellie Goulding]\nHate me, hate me, st...","[Hate me, hate me, still tryna replace me, Cha...",0,0,0,0,0,0


### Sentiment Analysis<a id='SentimentAnalysis'></a>

Calculating the presence of emotions for each song in the dataset

In [6]:
checkpoint = 'bhadresh-savani/distilbert-base-uncased-emotion'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

for i in np.arange(len(df)-1):
    #if(i%144==0): print(round(i/len(df)*100, 2))
    lyrics = df.loc[i, 'lyrics_lines']
    sentiment_avg = pd.Series(dtype='float64')
    if (lyrics==['Instrumental'] or len(lyrics)>150):
        sentiment_avg = pd.Series(np.repeat([0], 6))
    else:
        try:
            tokens = tokenizer(lyrics, padding=True, truncation=True, return_tensors="pt")['input_ids']
            outputs = model(tokens)
            predictions = pt.nn.functional.softmax(outputs.logits, dim=-1)
            labels = list(model.config.id2label.values())
            sentiment_avg = pd.DataFrame(predictions.detach().numpy(), columns=labels).mean(axis=0)
        except:
            sentiment_avg = pd.Series(np.repeat([0], 6))
    
    df.loc[i, sentiment_avg.index] = sentiment_avg
        
df.head()

0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
13.0
14.0
15.0
16.0
17.0
18.0
19.0
20.0
21.0
22.0
23.0
24.0
25.0
26.0
27.0
28.0
29.0
30.0
31.0
32.0
33.0
34.0
35.0
36.0
37.0
38.0
39.0
40.0
41.0
42.0
43.0
44.0
45.0
46.0
47.0
48.0
49.0
50.0
51.0
52.0
53.0
54.0
55.0
56.0
57.0
58.0
59.0
60.0
61.0
62.0
63.0
64.0
65.0
66.0
67.0
68.0
69.0
70.0
71.0
72.0
73.0
74.0
75.0
76.0
77.0
78.0
79.0
80.0
81.0
82.0
83.0
84.0
85.0
86.0
87.0
88.0
89.0
90.0
91.0
92.0
93.0
94.0
95.0
96.0
97.0
98.0
99.0


Unnamed: 0,track_id,track,artist,album,release_date,genre,subgenre,duration_ms,popularity,danceability,...,love,anger,fear,surprise,0,1,2,3,4,5
0,0prNGof3XqfTvNDxHonvdK,Scars To Your Beautiful,Alessia Cara,Know-It-All (Deluxe),2015-01-01,pop,pop,230226,73,0.573,...,0.012936,0.118524,0.071253,0.003554,,,,,,
1,1rfofaqEpACxVEHIZBJe6W,Havana (feat. Young Thug),Camila Cabello,Camila,2018-01-01,pop,pop,217306,80,0.765,...,0.028685,0.123605,0.049248,0.005481,,,,,,
2,4l0Mvzj72xxOpRrp6h8nHi,Lose You To Love Me,Selena Gomez,Rare,2020-01-01,pop,pop,206458,83,0.488,...,0.049146,0.266661,0.029525,0.00202,,,,,,
3,6T6D9CIrHkALcHPafDFA6L,Vibez,ZAYN,Nobody Is Listening,2021-01-01,pop,pop,163346,73,0.643,...,0.009742,0.13813,0.071696,0.00392,,,,,,
4,15og0pCEcTFWEXOFKdcJlU,Hate Me,Ellie Goulding,Brightest Blue,2020-01-01,pop,pop,188066,68,0.64,...,0.042149,0.627069,0.009353,0.001169,,,,,,


### Save Data<a id='SaveData'></a>

In [24]:
df.to_csv('data/tracks.csv', index=False)