## Introduction 

First we'll import our libraries and data for this blog post.

In [25]:
import torch
import numpy as np
import pandas as pd

url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/tcc_ceds_music.csv"
df = pd.read_csv(url)

I accessed the data on Kaggle [here](https://www.kaggle.com/datasets/saurabhshahane/music-dataset-1950-to-2019). The data was originally collected from Spotify by researchers who published in the following data publication:

> Moura, Luan; Fontelles, Emanuel; Sampaio, Vinicius; França, Mardônio (2020), “Music Dataset: Lyrics and Metadata from 1950 to 2019”, Mendeley Data, V3, doi: 10.17632/3t9vbwxgr5.3

Here’s an excerpt of the data:

In [26]:
df.head()

Unnamed: 0.1,Unnamed: 0,artist_name,track_name,release_date,genre,lyrics,len,dating,violence,world/life,...,sadness,feelings,danceability,loudness,acousticness,instrumentalness,valence,energy,topic,age
0,0,mukesh,mohabbat bhi jhoothi,1950,pop,hold time feel break feel untrue convince spea...,95,0.000598,0.063746,0.000598,...,0.380299,0.117175,0.357739,0.454119,0.997992,0.901822,0.339448,0.13711,sadness,1.0
1,4,frankie laine,i believe,1950,pop,believe drop rain fall grow believe darkest ni...,51,0.035537,0.096777,0.443435,...,0.001284,0.001284,0.331745,0.64754,0.954819,2e-06,0.325021,0.26324,world/life,1.0
2,6,johnnie ray,cry,1950,pop,sweetheart send letter goodbye secret feel bet...,24,0.00277,0.00277,0.00277,...,0.00277,0.225422,0.456298,0.585288,0.840361,0.0,0.351814,0.139112,music,1.0
3,10,pérez prado,patricia,1950,pop,kiss lips want stroll charm mambo chacha merin...,54,0.048249,0.001548,0.001548,...,0.225889,0.001548,0.686992,0.744404,0.083935,0.199393,0.77535,0.743736,romantic,1.0
4,12,giorgos papadopoulos,apopse eida oneiro,1950,pop,till darling till matter know till dream live ...,48,0.00135,0.00135,0.417772,...,0.0688,0.00135,0.291671,0.646489,0.975904,0.000246,0.597073,0.394375,romantic,1.0


We're going to use Torch to predict the *genre* of the track based on the track's lyrics and engineered features. The lyrics are contained in the `lyrics` column.

It will also be useful to have a list of the engineered features:

In [27]:
engineered_features = ['dating', 'violence', 'world/life', 'night/time','shake the audience','family/gospel', 'romantic', 'communication','obscene', 'music', 'movement/places', 'light/visual perceptions','family/spiritual', 'like/girls', 'sadness', 'feelings', 'danceability','loudness', 'acousticness', 'instrumentalness', 'valence', 'energy']      

The features were engineered by teams at Spotify to describe attributes of the tracks.

Let's see what are base classification rate is:

In [28]:
total = len(df)
df.groupby(["genre"]).size() / total

genre
blues      0.162273
country    0.191915
hip hop    0.031862
jazz       0.135521
pop        0.248202
reggae     0.088045
rock       0.142182
dtype: float64

Looks like the most popular genre is pop at ~25%. Let's construct some models to try and do better!

## Constructing Neural Networks

We'll construct three different neural networks with Torch and train them:

1. Using **only** the *lyrics* to classify genre.
2. Using **only** the *engineered features* from Spotify to classify genre.
3. Using both lyrics and engineered features!

4. We'll also visualize the word embedding learned by the model.

### First Model: Only Lyrics

To use text to predict the genre, we'll use **word embeddings**.

In [29]:
# for embedding visualization later:
import plotly.express as px
import plotly.io as pio

# for VSCode plotly rendering
pio.renderers.default = "notebook"

# for appearance
pio.templates.default = "plotly_white"

# for train-test split
from sklearn.model_selection import train_test_split

We're now going to encode the genres as integers:

In [30]:
genres = {
    "blues"     : 0,
    "country"   : 1,
    "hip hop"   : 2,
    "jazz"      : 3,
    "pop"       : 4,
    "reggae"    : 5,
    "rock"      : 6
}

df_lyrics = df[["genre", "lyrics"]]
df_lyrics = df_lyrics[df_lyrics["genre"].apply(lambda x: x in genres.keys())]
df_lyrics.head()

Unnamed: 0,genre,lyrics
0,pop,hold time feel break feel untrue convince spea...
1,pop,believe drop rain fall grow believe darkest ni...
2,pop,sweetheart send letter goodbye secret feel bet...
3,pop,kiss lips want stroll charm mambo chacha merin...
4,pop,till darling till matter know till dream live ...


In [31]:
df_lyrics["genre"] = df_lyrics["genre"].apply(genres.get)
df_lyrics.head()

Unnamed: 0,genre,lyrics
0,4,hold time feel break feel untrue convince spea...
1,4,believe drop rain fall grow believe darkest ni...
2,4,sweetheart send letter goodbye secret feel bet...
3,4,kiss lips want stroll charm mambo chacha merin...
4,4,till darling till matter know till dream live ...


We now need to wrap the Pandas dataframe as a Torch dataset.

In [33]:
from torch.utils.data import Dataset, DataLoader

# create our custom data loader class
class TextDataFromDF(Dataset):
    def __init__(self, df):
        self.df = df

    def __getitem__(self, index):
        # returns an item (row) of the dataset as the words then the label
        return self.df.iloc[index, 1], self.df.iloc[index, 0]
    
    def __len__(self):
        return len(self.df)

Now let's perform a train-validation split and make Datasets from each one.

In [34]:
df_lyrics_train, df_lyrics_val = train_test_split(df_lyrics, shuffle=True, test_size=0.2)
lyrics_train_data   = TextDataFromDF(df_lyrics_train)
lyrics_val_data     = TextDataFromDF(df_lyrics_val)

Let's take a look at one element of our train set:

In [45]:
lyrics_train_data[68]

('suffer disease morbid symptoms aren identifiable physicians disagree fight impossible vainly cure pain endure dead leave science hand research lead salvation state suspend animation aneasthesia come pure nitrogen degrees zero icebound human disabuse unknow disease maybe future machine stop freeze blood longer liquid palpitations heart stone cold harden intestines start age cure pain endure add dead leave science hand forever freeze destination state suspend animation',
 3)

### Second Model: Only Engineered Features

### Third Model: Lyrics + Engineered Features