# Introduction 

First we'll import our libraries and data for this blog post.

In [1]:
import torch
import numpy as np
import pandas as pd

url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/tcc_ceds_music.csv"
df = pd.read_csv(url)

  from .autonotebook import tqdm as notebook_tqdm


I accessed the data on Kaggle [here](https://www.kaggle.com/datasets/saurabhshahane/music-dataset-1950-to-2019). The data was originally collected from Spotify by researchers who published in the following data publication:

> Moura, Luan; Fontelles, Emanuel; Sampaio, Vinicius; França, Mardônio (2020), “Music Dataset: Lyrics and Metadata from 1950 to 2019”, Mendeley Data, V3, doi: 10.17632/3t9vbwxgr5.3

Here’s an excerpt of the data:

In [2]:
df.head()

Unnamed: 0.1,Unnamed: 0,artist_name,track_name,release_date,genre,lyrics,len,dating,violence,world/life,...,sadness,feelings,danceability,loudness,acousticness,instrumentalness,valence,energy,topic,age
0,0,mukesh,mohabbat bhi jhoothi,1950,pop,hold time feel break feel untrue convince spea...,95,0.000598,0.063746,0.000598,...,0.380299,0.117175,0.357739,0.454119,0.997992,0.901822,0.339448,0.13711,sadness,1.0
1,4,frankie laine,i believe,1950,pop,believe drop rain fall grow believe darkest ni...,51,0.035537,0.096777,0.443435,...,0.001284,0.001284,0.331745,0.64754,0.954819,2e-06,0.325021,0.26324,world/life,1.0
2,6,johnnie ray,cry,1950,pop,sweetheart send letter goodbye secret feel bet...,24,0.00277,0.00277,0.00277,...,0.00277,0.225422,0.456298,0.585288,0.840361,0.0,0.351814,0.139112,music,1.0
3,10,pérez prado,patricia,1950,pop,kiss lips want stroll charm mambo chacha merin...,54,0.048249,0.001548,0.001548,...,0.225889,0.001548,0.686992,0.744404,0.083935,0.199393,0.77535,0.743736,romantic,1.0
4,12,giorgos papadopoulos,apopse eida oneiro,1950,pop,till darling till matter know till dream live ...,48,0.00135,0.00135,0.417772,...,0.0688,0.00135,0.291671,0.646489,0.975904,0.000246,0.597073,0.394375,romantic,1.0


We're going to use Torch to predict the *genre* of the track based on the track's lyrics and engineered features. The lyrics are contained in the `lyrics` column.

It will also be useful to have a list of the engineered features:

In [3]:
engineered_features = ['dating', 'violence', 'world/life', 'night/time','shake the audience','family/gospel', 'romantic', 'communication','obscene', 'music', 'movement/places', 'light/visual perceptions','family/spiritual', 'like/girls', 'sadness', 'feelings', 'danceability','loudness', 'acousticness', 'instrumentalness', 'valence', 'energy']      

The features were engineered by teams at Spotify to describe attributes of the tracks.

Let's see what are base classification rate is:

In [4]:
total = len(df)
df.groupby(["genre"]).size() / total

genre
blues      0.162273
country    0.191915
hip hop    0.031862
jazz       0.135521
pop        0.248202
reggae     0.088045
rock       0.142182
dtype: float64

Looks like the most popular genre is pop at ~25%. Let's construct some models to try and do better!

# Constructing Neural Networks

We'll construct three different neural networks with Torch and train them:

1. Using **only** the *lyrics* to classify genre.
2. Using **only** the *engineered features* from Spotify to classify genre.
3. Using both lyrics and engineered features!

4. We'll also visualize the word embedding learned by the model.

## First Model: Only Lyrics

To use text to predict the genre, we'll use **word embeddings**.

In [5]:
# for embedding visualization later:
import plotly.express as px
import plotly.io as pio

# for VSCode plotly rendering
pio.renderers.default = "notebook"

# for appearance
pio.templates.default = "plotly_white"

# for train-test split
from sklearn.model_selection import train_test_split

We're now going to encode the genres as integers:

In [6]:
genres = {
    "blues"     : 0,
    "country"   : 1,
    "hip hop"   : 2,
    "jazz"      : 3,
    "pop"       : 4,
    "reggae"    : 5,
    "rock"      : 6
}

df_lyrics = df[["genre", "lyrics"]]
df_lyrics = df_lyrics[df_lyrics["genre"].apply(lambda x: x in genres.keys())]
df_lyrics.head()

Unnamed: 0,genre,lyrics
0,pop,hold time feel break feel untrue convince spea...
1,pop,believe drop rain fall grow believe darkest ni...
2,pop,sweetheart send letter goodbye secret feel bet...
3,pop,kiss lips want stroll charm mambo chacha merin...
4,pop,till darling till matter know till dream live ...


In [7]:
df_lyrics["genre"] = df_lyrics["genre"].apply(genres.get)
df_lyrics.head()

Unnamed: 0,genre,lyrics
0,4,hold time feel break feel untrue convince spea...
1,4,believe drop rain fall grow believe darkest ni...
2,4,sweetheart send letter goodbye secret feel bet...
3,4,kiss lips want stroll charm mambo chacha merin...
4,4,till darling till matter know till dream live ...


We now need to wrap the Pandas dataframe as a Torch dataset.

In [8]:
from torch.utils.data import Dataset, DataLoader

# create our custom data loader class
class TextDataFromDF(Dataset):
    def __init__(self, df):
        self.df = df

    def __getitem__(self, index):
        # returns an item (row) of the dataset as the words then the label
        return self.df.iloc[index, 1], self.df.iloc[index, 0]
    
    def __len__(self):
        return len(self.df)

Now let's perform a train-validation split and make Datasets from each one.

In [9]:
df_train, df_val = train_test_split(df_lyrics, shuffle=True, test_size=0.2)
lyrics_train_data   = TextDataFromDF(df_train)
lyrics_val_data     = TextDataFromDF(df_val)

Let's take a look at one element of our train set:

In [10]:
lyrics_train_data[68]

('go girl dream school go girl fool ring curtain certain present future pass know speak tie tie break life wayto break future pass star blue distance encounter resistance help miss arm illusion look heart confusion love live future pass',
 3)

### Text Vectorization

Now we'll vectorize our text using a tokenizer to split sentences into individual words.

In [11]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer('basic_english')

tokenized = tokenizer(lyrics_train_data[68][0])
tokenized

['go',
 'girl',
 'dream',
 'school',
 'go',
 'girl',
 'fool',
 'ring',
 'curtain',
 'certain',
 'present',
 'future',
 'pass',
 'know',
 'speak',
 'tie',
 'tie',
 'break',
 'life',
 'wayto',
 'break',
 'future',
 'pass',
 'star',
 'blue',
 'distance',
 'encounter',
 'resistance',
 'help',
 'miss',
 'arm',
 'illusion',
 'look',
 'heart',
 'confusion',
 'love',
 'live',
 'future',
 'pass']

Now we'll start constructing a vocabulary - a mapping from words to integers.

In [12]:
def yield_tokens(data_iter):
    for text, _ in data_iter:
        yield tokenizer(text)

# since there are so many words in this set, we'll use only those that appear at least 50 times using min_freq
vocab = build_vocab_from_iterator(yield_tokens(lyrics_train_data), specials=["<unk>"], min_freq = 50)
vocab.set_default_index(vocab["<unk>"])

Here are the first 10 elements from the vocabulary:

In [13]:
vocab.get_itos()[0:10]

['<unk>',
 'know',
 'like',
 'time',
 'come',
 'go',
 'away',
 'heart',
 'feel',
 'yeah']

We can apply it on a list of tokens:

In [14]:
vocab(tokenized)

[5,
 46,
 31,
 395,
 5,
 46,
 98,
 195,
 1480,
 790,
 1423,
 331,
 186,
 1,
 197,
 651,
 651,
 24,
 10,
 0,
 24,
 331,
 186,
 225,
 55,
 910,
 0,
 2869,
 138,
 110,
 129,
 1444,
 26,
 7,
 1184,
 62,
 15,
 331,
 186]

### Batch Collation

Now we’re ready to construct the function that is going to actually pass a batch of data to our training loop. Here are the main steps:

1. We pull some feature data (i.e. a batch of lyrics).
2. We represent lyrics as a sequence of integers using the `vocab`.
3. We pad the lyrics with an unused integer index if necessary so that all lyrics have the same length. This index corresponds to “blank” or “no words in this slot.”
4. We return the batch of lyrics as a consolidated tensor.

In [15]:
max_len = 500
num_tokens = len(vocab.get_itos())
def text_pipeline(x):
    tokens = vocab(tokenizer(x))
    y = torch.zeros(max_len, dtype=torch.int64) + num_tokens
    if len(tokens) > max_len:
        tokens = tokens[0:max_len]
    y[0:len(tokens)] = torch.tensor(tokens,dtype=torch.int64)
    return y

label_pipeline = lambda x: int(x)

In [16]:
text_pipeline("we can't believe")

tensor([   0,    0,    0,    0,   42, 2875, 2875, 2875, 2875, 2875, 2875, 2875,
        2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875,
        2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875,
        2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875,
        2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875,
        2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875,
        2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875,
        2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875,
        2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875,
        2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875,
        2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875,
        2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875, 2875,
        2875, 2875, 2875, 2875, 2875, 28

In [17]:
def collate_batch(batch):
    label_list, text_list = [], []
    for (_text, _label) in batch:

        # add label to list
         label_list.append(label_pipeline(_label))

         # add text (as sequence of integers) to list
         processed_text = text_pipeline(_text)
         text_list.append(processed_text)

    label_list = torch.tensor(label_list, dtype=torch.int64)
    text_list = torch.stack(text_list)
    return text_list, label_list

In [18]:
train_loader = DataLoader(lyrics_train_data, batch_size=8, shuffle=True, collate_fn=collate_batch)
val_loader = DataLoader(lyrics_val_data, batch_size=8, shuffle=True, collate_fn=collate_batch)

Let's take a look at a batch of data now:

In [19]:
next(iter(train_loader))

(tensor([[  54,   30,    0,  ..., 2875, 2875, 2875],
         [ 604,  158, 2567,  ..., 2875, 2875, 2875],
         [  19,  539,   48,  ..., 2875, 2875, 2875],
         ...,
         [ 803, 2327,  734,  ..., 2875, 2875, 2875],
         [  23,  315,    7,  ..., 2875, 2875, 2875],
         [   0,   60,   36,  ..., 2875, 2875, 2875]]),
 tensor([4, 4, 0, 1, 1, 3, 0, 6]))

## Modeling

### Word Embedding
A word embedding refers to a representation of a word in a vector space. Each word is assigned an individual vector. The general aim of a word embedding is to create a representation such that words with related meanings are close to each other in a vector space, while words with different meanings are farther apart.

Let's learn and train a model!

In [20]:
import time

loss_fn = torch.nn.CrossEntropyLoss()

def train(dataloader, model):
    epoch_start_time = time.time()
    # keep track of some counts for measuring accuracy
    total_acc, total_count = 0, 0
    log_interval = 300
    start_time = time.time()

    optimizer = torch.optim.Adam(model.parameters(), lr=.1)

    for idx, (text, label) in enumerate(dataloader):
        # zero gradients
        optimizer.zero_grad()
        # form prediction on batch
        predicted_label = model(text)
        # evaluate loss on prediction
        loss = loss_fn(predicted_label, label)
        # compute gradient
        loss.backward()
        # take an optimization step
        optimizer.step()

        # for printing accuracy
        total_acc   += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        
    print(f'| epoch {epoch:3d} | train accuracy {total_acc/total_count:8.3f} | time: {time.time() - epoch_start_time:5.2f}s')
    
def evaluate(dataloader, model):

    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (text, label) in enumerate(dataloader):
            predicted_label = model(text)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc/total_count

First Model

In [21]:
from torch import nn

class TextClassificationByLyrics(nn.Module):
    
    def __init__(self,vocab_size, embedding_dim, max_len, num_class):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size+1, embedding_dim)
        self.fc   = nn.Linear(max_len*embedding_dim, num_class)
        
    def forward(self, x):
        x = self.embedding(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)
        return(x)

In [22]:
vocab_size = len(vocab)
embedding_dim = 3
lyrics_model = TextClassificationByLyrics(vocab_size, embedding_dim, max_len, 7)

In [23]:
EPOCHS = 20
for epoch in range(1, EPOCHS + 1):
    train(train_loader, lyrics_model)

| epoch   1 | train accuracy    0.177 | time:  7.56s
| epoch   2 | train accuracy    0.183 | time:  7.51s
| epoch   3 | train accuracy    0.189 | time:  7.34s
| epoch   4 | train accuracy    0.185 | time:  7.34s
| epoch   5 | train accuracy    0.188 | time:  7.58s
| epoch   6 | train accuracy    0.192 | time:  7.38s
| epoch   7 | train accuracy    0.196 | time:  7.47s
| epoch   8 | train accuracy    0.197 | time:  7.51s
| epoch   9 | train accuracy    0.192 | time:  7.37s
| epoch  10 | train accuracy    0.196 | time:  7.39s
| epoch  11 | train accuracy    0.193 | time:  7.37s
| epoch  12 | train accuracy    0.199 | time:  7.37s
| epoch  13 | train accuracy    0.194 | time:  7.37s
| epoch  14 | train accuracy    0.197 | time:  7.36s
| epoch  15 | train accuracy    0.195 | time:  7.49s
| epoch  16 | train accuracy    0.197 | time:  7.35s
| epoch  17 | train accuracy    0.190 | time:  7.46s
| epoch  18 | train accuracy    0.195 | time:  7.48s
| epoch  19 | train accuracy    0.199 | time: 

In [24]:
evaluate(val_loader, lyrics_model)

0.186784140969163

## Second Model: Only Engineered Features

## Third Model: Lyrics + Engineered Features