# Preprocessing
This notebook encompasses all of the preprocessing which is need before the model is to be trained. This notebook implements two main functionalities. It should be noted that all of the playlists and their associated mood labels, which are already downloaded from Spotify contained in `train.csv`. 

The first functionality this notebook implements is to tokenize all the playlist descriptions, that is to assign each mood state a unique integer that will be used to generate the embedding later on. In addition, special tokens denoting the start of the sentence and the end of the sentence is inserted/appended to the start/end of the mood sequences, respectively. 

Secondly, this notebook also restructres the audio features, which, from Spotify, is a dictionary and contains a lot of extraneous data.  

Lastly, the notebook converts all of the preprocessed data into an accessible data, where its entries are `json` encodings of the lists/preprocessed data.

## Extract all the unique moods

The description of each playlist summarizes the mood that the playlist supposedly traverses. In short the format of these descriptions is described in the following:
- Each **stage** – defined as an explicit state of moods the playlist wishes to traverse – of the playlist is deliminited by the word `to`, separated by spaces.
- Each **stage** is the supplemented with additional mood descriptors, describing the state. These detailed descriptors are deliminited by a `,`. 

It should be noted that these mood descriptors are taken from [GEMS](https://www.zentnerlab.com/psychological-tests/geneva-emotional-music-scales) (Geneva Emotional Music Scale), which contains 45 labels. 

An example of this format is shown below:
```
agitated, nervous, irritated to fiery, energetic to inspired, moved to soothed, peaceful
```
A playlist with this description moves the user through 4 stages. The first stage is described through 3 mood keywords: `agitated, nervous, irritated`. The second stage is described through 2 mood keywords: `fiery, energetic`. The third stage is described using 2 mood keywords: `inspired, moved`. And lastly, the last stage is described using 2 mood keywords: `soothed, peaceful`.

In [15]:
import os
import json
import torch
import numpy as np
import pandas as pd

In [16]:
features = json.load(open('data/playlist-tracks-features'))

In [17]:
moods = {}
for i in features.keys():
    states = i[8:].split(' to ')
    states = [s.lower().strip() for s in states]
    states = [[x.lower().strip() for x in s.split(', ')] for s in states]
    moods[i[:5]] = states
print(moods)

{'iso25': [['sad'], ['tender']], 'iso29': [['tender'], ['powerful']], 'iso26': [['sad'], ['powerful']], 'iso24': [['sad'], ['happy']], 'iso27': [['happy'], ['tender']], 'iso13': [['nervous'], ['animated'], ['energetic']], 'iso21': [['tense'], ['happy']], 'iso22': [['tense'], ['tender']], 'iso09': [['sad'], ['happy']], 'iso10': [['anxious'], ['relaxed'], ['joyful']], 'iso17': [['serene'], ['animated'], ['energetic']], 'iso12': [['angry'], ['amused'], ['soothed']], 'iso11': [['sad'], ['soothed'], ['triumphant']], 'iso20': [['anxious'], ['meditative'], ['relaxed']], 'iso19': [['melancholic'], ['tender'], ['affectionate']], 'iso08': [['nervous'], ['calm']], 'iso04': [['calm', 'dreamy', 'melancholic'], ['calm', 'relaxed', 'meditative']], 'iso14': [['energetic'], ['dreamy'], ['relaxed']], 'iso18': [['irritated'], ['meditative'], ['soothed']], 'iso30': [['nervous', 'agitated', 'energetic'], ['calm', 'relaxed', 'soothed']], 'iso06': [['sad'], ['hopeful']], 'iso05': [['lonely'], ['connected']],

## Tokenize mood states and location labels

We now assign a unique integer to each mood keyword. Moreover, we also assign a unique integer to the meta-tokens: `<sos>`, `<eos`, `<pad>`. All of this is accomplished through the `Tokenizer` class as to encapsulate not only the functionalities of the tokenization process, but also to save the dictionaries associated with this particular tokenization.

In [18]:
class Tokenizer:
    """
    The tokenizer class provides three functionalities, 
    the first being the `fit_on_moods` methods. it extracts from 
    a list of moods which can be multi-leveled and nested, a unique list
    of all the mood keywords. The second and third functionality
    being that it is able to convert a list of states into the 
    corresponding token representation and back. We especially note
    here that the tokenization, not only preseverse the order of the tokens
    but also the structure of the list passed in. An example is given below:
    [25, 2, [3, 4, [3, 5]]] -> [a, b [c, d, [c, e]]].
    """
    def __init__(self):
        """
        attr: stoi – defines the dictonary that converts the 
        mood words into their token representations.
        attr: itos – defines the dictionary that converts token 
        representations back into the word-form representations.
        """
        self.stoi = {}
        self.itos = {}
    
    def __len__(self):
        return len(self.stoi)
    
    def fit_on_moods(self, moods):
        """
        Given a list of words stored in a un/nested list `mood`, 
        `fit_on_moods` extracts the unique words and creates a 
        look up table that forms a bijection between the words
        and a subset of the integers.
        """
        flat = []
        
        Tokenizer.flatten(moods, flat)
        vocab = sorted(set(flat))
        vocab.append('<sos>')
        vocab.append('<eos>')
        vocab.append('<pad>')
        for index, word in enumerate(vocab):
            self.stoi[word] = index
        self.itos = {v : k for k, v in self.stoi.items()}

    def flatten(l, flat):
        """
        Recursively, flatten the given list `l` into
        a one-dimensional list that is appended to a given
        list `flat`.
        """
        if type(l) != list:
            flat.append(l)
        else:
            for el in l:
                Tokenizer.flatten(el, flat)

    def moods_to_token(self, states, reverse=False):
        """
        Recursively tokenize moods, while preserving the
        structure of the list. When `reverse` is true, the
        method translates the tokens back into the mood strings
        """
        if type(states) != list:
            if reverse:
                return self.itos[states]
            else:
                return self.stoi[states]
        else:
            for index, state in enumerate(states):
                states[index] = self.moods_to_token(state, reverse)
            return states

In [19]:
tokenizer = Tokenizer()
tokenizer.fit_on_moods(list(moods.values()))

In [20]:
for l in moods.values():
    tokenizer.moods_to_token(l)

## Vectorizing audio features
We now want to organize the audio features into a single vector. Currently, the data for each playlist is organized into a list of songs. Each song is associated with a dictionary that contains the following data:
```python
{'danceability': 0.388,
   'energy': 0.0859,
   'key': 7,
   'loudness': -16.061,
   'mode': 0,
   'speechiness': 0.0472,
   'acousticness': 0.969,
   'instrumentalness': 7.35e-05,
   'liveness': 0.108,
   'valence': 0.19,
   'tempo': 88.253,
   'type': 'audio_features',
   'id': '30QNjcM3Q1GnLFIIJjWQL1',
   'uri': 'spotify:track:30QNjcM3Q1GnLFIIJjWQL1',
   'track_href': 'https://api.spotify.com/v1/tracks/30QNjcM3Q1GnLFIIJjWQL1',
   'analysis_url': 'https://api.spotify.com/v1/audio-analysis/30QNjcM3Q1GnLFIIJjWQL1',
   'duration_ms': 169410,
   'time_signature': 3}
```
We note that there is a lot of data that we do not need, so the next few modules extracts the useful information storing them into a one-dimensional vector and discarding the "useless" data. Then all of the songs in the same playlist are appended into a larger array creating a 2-dimensional array:
```python
playlist = [ [song1 features],
             [song2 features],
             [     ...      ],
                    .
                    . 
                    .,
           ]
```

Note that we also want to preserve the in which these features appear, so that training and evaluation is consistent, as well as the order of the songs relative to each other. 

In [21]:
tracks = json.load(open('data/playlist-tracks-features'))

In [22]:
useful_features = ['danceability', 'energy', 'key', 'loudness', 
                   'mode', 'speechiness', 'acousticness', 'instrumentalness',
                   'liveness', 'valence', 'tempo']
def extract_features(songs):
    """
    We extract the features of the songs of the same playlist
    into a two dimesional array, if `l` is None, then None is returned.
    """
    if songs == [None]:
        return songs
    songs_features = []
    for song in songs:
        # we first sort the keys so we retain the same order
        # every time.
        keys = sorted(song.keys())
        song_features = []
        for key in keys:
            if key in useful_features:
                song_features.append(song[key])
        songs_features.append(song_features)
    return songs_features

Now for ease of storage, we change all of the two-dimesional arrays into `json` format and store these represetations accordingly back into the features dictionary.

In [23]:
features = {}
for k, v in tracks.items():
    features[k[:5]] = json.dumps(extract_features(v))

# Combine into dataframe
Now with the mood states tokenized and the features discretized into vectors, we can store all of this into a Dataframe. Note that for readability, we also want to store the order of the features, which they were encoded.

In [24]:
for k, v in moods.items():
    moods[k] = json.dumps(v)

In [25]:
df1 = pd.DataFrame(features.values(), index=features.keys(), columns=['features'])
df2 = pd.DataFrame(moods.values(), index=moods.keys(), columns=['moods_states'])
df = df2.join(df1)

In [26]:
df

Unnamed: 0,moods_states,features
iso25,"[[22], [26]]",[null]
iso29,"[[26], [20]]",[null]
iso26,"[[22], [20]]",[null]
iso24,"[[22], [11]]",[null]
iso27,"[[11], [26]]",[null]
iso13,"[[19], [4], [9]]","[[0.932, 0.433, 0.329, 0.0474, 3, 0.1, -13.288..."
iso21,"[[28], [11]]","[[0.942, 0.252, 0.314, 0.676, 5, 0.0892, -18.1..."
iso22,"[[28], [26]]",[null]
iso09,"[[22], [11]]","[[0.778, 0.407, 0.308, 1.46e-05, 10, 0.092, -9..."
iso10,"[[5], [21], [14]]","[[0.942, 0.252, 0.314, 0.676, 5, 0.0892, -18.1..."


In [29]:
torch.save(tokenizer, 'data/tokenizer.pth')
json.dump(tokenizer.itos, open('data/tokenizer-itos.json', 'w+'))
json.dump(tokenizer.stoi, open('data/tokenizer-stoi.json', 'w+'))
df.to_csv('data/train.csv')
json.dump(sorted(useful_features), open('data/useful_features', 'w+'))