# ISO-Based Deep Learning Using LSTMs

This notebook establishes the model architecture that is used to learn the mapping between various desired mood states into playlists. The notebook is split into four distinct section: *data loading*, *dataset*, *model architecture*, and finally *training and evaluation*. We proceed by summarizing these sections briefly; a detailed description of their purposes can be found under the section headers. 

The **data loading** section loads all of the variables from preprocessing, including the tokenization of the training set, as well as the tokenizer used to perform the tokenizations. 

The **dataset** section creates an `ISODataset` class which converts the Dataframe loaded in from the data loading section to be in a format which is easily accessible by torch. 

The **model architecture** section creates the actual model that is used for training. The specification of the model is also under its section heading. It should be important to note that `torch lightning` is used throughout the notebook, but in particular for designing the model architecture. Thus, the code in the training and evaluation section is minimal. This section also encapsulates the loss function and learning rate scheduler used for training.

The **training and evaluation** section contains code which kickstarts the training of the model. 

In [1]:
import json
import torch
import random
import pickle
import numpy as np
import pandas as pd
import pytorch_lightning as pl
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence

## Data Loading

Data from preprocessing is loaded into this notebook – including the tokenizations, cleaned audio features, and the tokenizer itself. It is important to note that for ease of operation, we are pre-emptively removing all the rows in the data frame corresponding to empty playlists, as these are still WIP. 

In [2]:
df = pd.read_csv('train.csv', index_col=0)
df = df[df['features'] != '[null]']

Reload the tokenizer.

In [3]:
class Tokenizer:
    def __init__(self):
        self.stoi = {}
        self.itos = {}
    
    def __len__(self):
        return len(self.stoi)
    
    def fit_on_moods(self, moods):
        flat = []
        
        Tokenizer.flatten(moods, flat)
        vocab = sorted(set(flat))
        vocab.append('<sos>')
        vocab.append('<eos>')
        vocab.append('<pad>')
        for index, word in enumerate(vocab):
            self.stoi[word] = index
        self.itos = {v : k for k, v in self.stoi.items()}

    def flatten(l, flat):
        """
        Recursively, flatten a list.
        """
        if type(l) != list:
            flat.append(l)
        else:
            for el in l:
                Tokenizer.flatten(el, flat)

    def moods_to_token(self, states, reverse=False):
        """
        Recursively tokenize moods, while preserving the
        structure of the list. When `reverse` is true, the
        method translates the tokens back into the mood strings
        """
        if type(states) != list:
            if reverse:
                return self.itos[states]
            else:
                return self.stoi[states]
        else:
            for index, state in enumerate(states):
                states[index] = self.moods_to_token(state, reverse)
            return states
tokenizer = torch.load('tokenizer.pth')

## Dataset

In this section, we package the training data into an `ISODataset` object. This is so that `torch`'s batching system can work with it more easily. Moreover, to make sure that all of the sequences are uniform, we assume that each states has at most 5 mood descriptors. Therefore, all the inputs to our network should be of shape `(batch_size, n, 5, 3)`, where $n$ is pre-determined.

In [4]:
class ISODataset(Dataset):
    """
    The `ISODataset` class packages training data into a single index-able object.
    This makes it easy for torch to use as a generator.
    """
    def __init__(self, df, maxlen=5):
        """
        Initializer.
        :param maxlen: The reader should note that this is the maximum number
        of mood transitions there can be. The constants (5) proceeding this
        block represent the number of descriptors allowed for each mood state.
        """
        self.df = df
        self.maxlen = maxlen
    
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        mood_states = json.loads(self.df.iloc[idx]['moods_states'])
        for index, state in enumerate(mood_states):
            mood_states[index] = np.pad(state, (0,5-len(state)), 
                                        constant_values=tokenizer.stoi['<pad>'])
        while len(mood_states) < 5:
            mood_states.append(np.full(5, tokenizer.stoi['<pad>']))
        mood_states = torch.LongTensor(mood_states)
        
        audio_features = self.df.iloc[idx]['features']
        audio_features = torch.Tensor(json.loads(audio_features))
        return mood_states, audio_features

Since the mood states are not only of variable length but also of variable dimension, we need to pad each batch so that a network will be able to process them. This preprocessing before it reaches the neural network is done through the `iso_collate` function below.

In [42]:
def iso_collate(batch):
    moods, features, lengths = [], [], []
    for data_point in batch:
        moods.append(data_point[0])
        features.append(data_point[1])
        lengths.append(len(data_point[1]))
    features = pad_sequence(features, batch_first=True)
    moods = pad_sequence(moods, batch_first=True, padding_value=tokenizer.stoi['<pad>'])
    return moods, torch.LongTensor(lengths), features

## Model Architecture
The model is split into two distinct components: the attention mechanism, and the LSTM component. The interaction between these two components are shown in the figure below:

![ISONet](imgs/isonet-lstm-attention-architecture.png)

In [43]:
class Attention(pl.LightningModule):
    """
    The attention mechanism of the network. On each time step, of the LSTM,
    the LSTM cell looks at the previous hidden state as well as the input
    to the LSTM, then it weights the various dimensions of the input based
    on the hidden state / input. This is done by applying two linear on the
    hidden and input states respectively, then combining the outputs, running
    them through another linear layer, and interpolating the final weights 
    using the softmax function. The result of the attention layer, is a 
    sum product of all the weights and the respective attributes.
    """
    def __init__(self, embed_dim, attention_dim=40, maxlen=5):
        """
        Initializer for the attention mechanism of the network.
        :embed_dim: the dimension of the embeddings – hyperparamters.
        :param attention_dim: specifies dimension of the hidden attention
        layer. This is simply a hyperparameter and will only affect the
        efficacy of the network, not its functionality. 
        :maxlen: specifies the maximum number of mood transitions allowed.
        """
        super().__init__()
        self.maxlen = maxlen
        self.embed_dim = embed_dim
        self.attention_dim = attention_dim
        self.mood_attention = nn.Linear(self.embed_dim, self.attention_dim)
        self.hidden_attention = nn.Linear(11, self.attention_dim)
        # the input to the hidden attention is 11 as that is the size of the
        # desired output dimension.
        self.attention = nn.Linear(self.attention_dim, 1)
        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=1)
    
    def forward(self, moods, hidden):
        """
        :param moods: The raw input mood states – this is of size (bs x maxlen x 5 x embed_dim).
        The reader should note that this *needs* to be preprocessed into the size
        (bs x (maxlen * 5) x embed_dim). 
        :param hidden: The previous hidden state of the LSTM cell – this should be of
        size (bs x 10).
        The result of this function is to find a weighting, or alternatively, where to 
        pay "attention" to based on the `moods` and `hidden` state. The weights
        of the attention, `alpha` of size (bs x (maxlen * 5)), is the used to in a sum product
        with the moods (bs x (maxlen * 5) x embed_dim), yielding a size of (bs x embed_dim).
        This single vector then acts as the inputs to the LSTM cell. 
        """
        att1 = self.mood_attention(moods)
        att2 = self.hidden_attention(hidden)
        att = self.attention(self.relu(att1 + att2.unsqueeze(1)))
        alpha = self.softmax(att)
        weighted_moods = (moods * alpha).sum(dim=1)
        return weighted_moods, alpha

In [50]:
class Model(pl.LightningModule):
    """
    An attention-based, one-directional baseline model.
    """
    def __init__(self, tokenizer, dropout=0.0, maxlen=5, embed_dim=3, lr=1e-2,
                 weight_decay=0.):
        """
        Initializer.
        :param tokenizer: the tokenizer used to create the tokenizations for
        the mood states and descriptors. 
        :param dropout: the probability of dropout of the layer between the 
        hidden state and the final output of each LSTM cell.
        :param maxlen: the maximum number of mood transition states that are allowed. 
        It should be noted that this must be greater than any of the number of the 
        states associated with each datapoint; otherwise, it will cause errors. 
        :param embed_dim: the dimensionality of each embedding.
        :param lr: learning rate of the network, TODO: implement separate learning rates
        for the attention network and the LSTM cell.
        """
        super().__init__()
        self.lr = lr
        self.weight_decay = weight_decay
        self.maxlen = maxlen
        self.tokenizer = tokenizer
        self.embed_dim = embed_dim
        self.embedding = nn.Embedding(len(self.tokenizer.itos), self.embed_dim, 
                                      padding_idx=self.tokenizer.stoi['<pad>'])
        self.attention = Attention(self.embed_dim)
        self.h0 = nn.Linear(maxlen * self.embed_dim * 5, 11)
        self.c0 = nn.Linear(maxlen * self.embed_dim * 5, 11)
        self.dropout = nn.Dropout(p=dropout)
        self.lstm = nn.LSTMCell(self.embed_dim, 11)
        self.sigmoid = nn.Sigmoid()
        self.forget = nn.Linear(11, self.embed_dim)
        self.fc = nn.Linear(11, 11)
        
    def init_hidden_states(self, x):
        """
        Given the mood states: flattened into a (bs x (maxlen * 5 * 3)) vector,
        we use two separate linear layers to find the initial cell state and 
        initial hidden state. This is necessary over simply random initialization,
        as the attention associated with the first cell is dependnet on h0.
        """
        return self.h0(x), self.c0(x)
        
    def forward(self, x):
        """
        Forward feeding of the model. The network proceeds by first converting the mood
        states into their respective embedding representations. Then, the flattened inputs
        are used to determine the initial hidden/cell states of the given LSTM. The given
        inputs are then sorted based on the length of their outputs. This makes it easier
        for prediction. 

        :param x: a tuple that contains three items, the first being the various
        mood states that are being queried, the second being the lenghts of the
        desired labels of each datapoint, and finally the features – the target 
        outputs. 
        """
        mood_states, lengths, audio_features = x
        bs = mood_states.size(0)
        mood_states = self.embedding(mood_states)
        moods = mood_states.view(bs, (self.maxlen * 5), self.embed_dim)
        
        sorted_lengths, indicies = lengths.sort(dim=0, descending=True)
        moods, audio_features = moods[indicies], audio_features[indicies]
        h, c = self.init_hidden_states(moods.view(bs, -1))  # (bs x 11)

        predictions = torch.zeros(bs, max(lengths), 11)
        for timestep in range(max(lengths)):
            num_predict = sum([l > timestep for l in lengths])
            attention_weighted_moods, alphas = self.attention(moods[:num_predict], 
                                                      h[:num_predict])
            gate = self.sigmoid(self.forget(h[:num_predict]))
            weighted_moods = gate * attention_weighted_moods
            h, c = self.lstm(weighted_moods, 
                             (h[:num_predict], c[:num_predict]))
            preds = self.fc(self.dropout(h))
            
            predictions[:num_predict, timestep, :] = preds
        return predictions, audio_features
    
    def step(self, batch, batch_idx):
        """
        One "step" of the model. 
        """
        predictions, targets = self(batch)
        loss = F.mse_loss(predictions, targets)
        return loss, {'loss': loss}
    
    def training_step(self, batch, batch_idx):
        loss, logs = self.step(batch, batch_idx)
        self.log_dict({f'train_{k}': v for k, v in logs.items()},
                      on_step=True, on_epoch=True, sync_dist=True)
        return loss
    
    def validation_step(self, batch, batch_idx):
        loss, logs = self.step(batch, batch_idx)
        self.log_dict({f'val_{k}': v for k, v in logs.items()}, sync_dist=True)
        return loss
    
    def test_step(self, batch, batch_idx):
        return
    
    def configure_optimizers(self):
        """
        Configuration of the optimizer used to train the model.
        This method is implicitly called by torch lightning during training. 
        Note that the learning rate and weight decay is given by the initialization
        parameters of the model. 
        TODO: Implement an appropriate learning rate scheduler. 
        """
        return optim.SGD(self.parameters(), lr=self.lr,
                         weight_decay=self.weight_decay)

## Training and Evaluation
This section is simplified by the torch lightning interface. Parameters involving training including epochs as well as checkpoints are paramterized in the initialization of `Trainer` object. We note here that there is no validation set as there are too little data points. Therefore, all the data is being used for training.

In [53]:
iso = ISODataset(df)
train_loader = DataLoader(iso,
                          batch_size=4,
                          collate_fn=iso_collate)
model = Model(tokenizer, embed_dim=3)

In [54]:
trainer = pl.Trainer(max_epochs=100)
trainer.fit(model, train_loader)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores

  | Name      | Type      | Params
----------------------------------------
0 | embedding | Embedding | 96    
1 | attention | Attention | 681   
2 | h0        | Linear    | 836   
3 | c0        | Linear    | 836   
4 | dropout   | Dropout   | 0     
5 | lstm      | LSTMCell  | 704   
6 | sigmoid   | Sigmoid   | 0     
7 | forget    | Linear    | 36    
8 | fc        | Linear    | 132   
----------------------------------------
3.3 K     Trainable params
0         Non-trainable params
3.3 K     Total params
0.013     Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]