# Data Augmentations & Synthetic Data Generation
This notebook will explore some options for synthetic data generation as well as augmentations on existing data.

In [2]:
import os
import json
import torch
import random
import numpy as np
import pandas as pd
from pprint import pprint
from sklearn import decomposition
from scipy.special import softmax
from torchvision import transforms

In [3]:
data = pd.read_csv('data/train.csv', index_col=0)
data = data[data['features'] != '[null]']  # remove empty plists

In [4]:
data

Unnamed: 0,moods_states,features
iso26,"[[22], [20]]","[[0.755, 0.479, 0.154, 0.0261, 2, 0.11, -15.05..."
iso23,"[[28], [20]]","[[0.455, 0.674, 0.615, 0.00091, 0, 0.208, -8.1..."
iso30,"[[19, 1, 9], [6, 21, 25]]","[[0.907, 0.23, 0.159, 0.924, 5, 0.102, -16.315..."
iso19,"[[17], [26], [0]]","[[0.985, 0.653, 0.178, 0.000339, 9, 0.134, -13..."
iso22,"[[28], [26]]","[[0.155, 0.221, 0.0879, 0.698, 7, 0.0869, -16...."
iso24,"[[22], [11]]","[[0.975, 0.462, 0.203, 0.113, 10, 0.102, -16.3..."
iso18,"[[13], [16], [25]]","[[0.124, 0.585, 0.52, 7.01e-05, 5, 0.131, -6.1..."
iso25,"[[22], [26]]","[[0.953, 0.582, 0.199, 0.00852, 10, 0.135, -10..."
iso17,"[[23], [4], [9]]","[[0.755, 0.479, 0.154, 0.0261, 2, 0.11, -15.05..."
iso20,"[[5], [16], [21]]","[[0.948, 0.571, 0.0274, 6.43e-06, 9, 0.322, -2..."


# Augmentations
- **Dimensionality reduction through PCA.** This strategy isn't actually an augmentaion technique, but reducing the size of the model by reducing the dimensionality of the required output.

- **Random augmentation of audio features.** Based on our findings in the

- **Reversing mood states and song order.** This is fairly straightforward, we simply generate new playlists by inverting the order of the mood states and their respective songs. For example a playlist from sad to happy would be inverted from happy to sad. We note that although such playlists may have no intrinsic value during inference, it may help the model learn the relationship between the various embeddings and audio features.


## Dimensionality Reduction with PCA
We proceed with PCA on all of the audio features on each playlist, then we aggregate all of the components together through an average. We start by combining each playlist into a matrix and applying transposition such that each column represents an observation. Let $X$ be a $11\times N$ matrix, where $N$ is the number of observations, and $11$ represents the number of audio features. We first compute the mean-deviation form $B$:

\begin{gather}
M = \frac{1}{N}\begin{bmatrix}X_1 & \ldots & X_N\end{bmatrix}, \\
B = \begin{bmatrix}\hat{X_1} & \ldots & \hat{X_N}\end{bmatrix},~\hat{X_i} = X_i - M.
\end{gather}

Then, the covariance matrix is given by $S =\frac{1}{N-1}BB^T$. Here, the goal of PCA is to find an orthogonal matrix $P$, and a matrix $Y$ such that $B = PY$. Such that the rows of $Y$ are uncorrelated, which means that the covariance matrix of $Y$ is a diagnol matrix. The covariance matrix for $Y$ is given by:

\begin{align}
Y &= P^TB, \\
YY^T &= P^TB(P^TB)^T, \\
\frac{1}{N-1}YY^T &= P^TBB^TP\frac{1}{N-1}, \\
Y_{\text{covar}} &= P^T\Big(\frac{1}{N-1}BB^T\Big)P, \\
Y_{\text{covar}} &= P^TSP.
\end{align}

Since $S$ is symmetrical it must be orthogonally diagonalizable. Let $D$ be a diagonal matrix with the entries on its main diagonals be the eigenvalues of $S$. Therefore $P$ corresponds to the unit eigenvectors $u_1,\ldots,u_p$:
$$S = PDP^T.$$

Suppose that the entries on the main diagonal of $D$ are $\lambda_1, \ldots, \lambda_p$. Then, we can reduce the dimensionality of $X$ by finding values which $\frac{\lambda_i}{\text{tr}(D)}$ are the smallest.

In [9]:
# construct list a of matrices, where each matrix represents a playlist
playlists = []
for entry in range(len(data)):
    playlist = json.loads(data.iloc[entry]['features'])
    # stack `song vectors` together into a matrix
    plist = []
    for song in playlist:
        plist.append(np.array(song))
    playlists.append(np.stack(plist))


In [10]:
all_playlists = np.vstack(playlists)
mdf = all_playlists - np.mean(all_playlists, axis=0)  # mean-deviation form
covar = np.cov(mdf.T)  # covariance matrix
eigval, eigvec = np.linalg.eig(covar)  # eigenvalue/eigenvectors of covar
pcomp = np.real(eigval) / sum(np.real(eigval))
print(pcomp)

[9.65789438e-01 2.34542771e-02 1.04093030e-02 1.50963492e-04
 7.66370831e-05 5.57245469e-05 3.18184879e-05 1.80824453e-06
 1.38803950e-05 6.62635240e-06 9.52312074e-06]


In [286]:
print(f'mean: {np.mean(playlist, axis=0)}')
print(f'%variance: {pcomp}')
print(f'corresponding component eigenvectors:')
for index, eig in enumerate(eigvec[[0,1,2],:]):
    print(f'component {index+1}: {eig}')

mean: [0.361640 0.663500 0.555900 0.000077 5.700000 0.129330 -6.746100 0.600000
 0.063500 122.359400 0.427800]
%variance: [0.962899 0.026112 0.010619 0.000160 0.000082 0.000061 0.000034 0.000002
 0.000015 0.000007 0.000010]
corresponding component eigenvectors:
component 1: [-0.002817 -0.045506 -0.003553 0.002965 0.797772 0.004380 0.483114
 -0.012710 -0.223494 -0.278353 0.021204]
component 2: [0.001499 0.019468 -0.001043 -0.020381 -0.112361 -0.237898 0.435760
 -0.070781 0.796545 -0.189837 0.254094]
component 3: [0.001670 0.035450 0.002969 -0.005613 -0.346160 0.023433 -0.050312
 -0.069317 -0.224059 -0.905039 -0.041101]


We note that the first three components make up over 99% of the variance of our covariance matrix. Therefore, we can reduce $P$ from $B=PY$ to a $p\times 3$ matrix. This will make prediction much easier as we can simply train our network to predict $Y: (3\times N)$, and then multiply it by our matrix of eigenvectors to yield $B$. We also note here that $X= B+\mu$. Therefore, it is ideal to have a numerically stable $\mu$, achieved with all-encompassing training data, before proceeding with this method. Another possibility is that $\mu$ can be found by training the neural network.

## Random Protuberance 
We saw through experiments in the `Music Playlist EDA` notebook that Spotify was still able to return an acceptable song with $\pm 15\%$ protuberance. Thus, this augmentation is to simply modify each component of the audio label with a random percentage protuberance $\leq 15\%$, chosen from a continuous uniform distribution. The probability that a component is chosen is a hyperparameter $\phi$.

In [340]:
class FeatureProtuberance:
    def __init__(self, max_protuberance, phi):
        """
        :param max_protuberance: the maximum percentage of protuberance.
        If 0.5 is given then each component, c, in the feature matrix 
        will have a potential new min/max of c +- 0.5 * c.
        :param phi: the probability that a given component is going to
        be augmented. 
        """
        self.protuberance = max_protuberance
        self.phi = phi
    
    def __call__(self, features):
        pct = (torch.randn(features.size()) - 0.5)
        pct = pct * self.phi
        aug = torch.randn(features.size()) > self.phi
        return features + aug * pct * features

## Reversing Moods/Playlist Order
This augmentation is simple, we simply reverse both the mood state transitions and the order of the audio features. 

In [364]:
class Reverse:
    def __init__(self, phi):
        """
        :param phi: (0, 1), the probability that the mood states and 
        features will be reversed.
        """
        self.phi = phi
        
    def __call__(self, moods, features):
        if random.random() > self.phi:
            return moods, features
        return (torch.from_numpy(np.flipud(moods)),
                torch.from_numpy(np.flipud(features)))

## Re-Querying Spotify Recommendations Based on Existing Playlists
This augmentation uses Spotify's recommendation system to re-recommend songs which are similar, musically, with each existing song in the playlist. However, it finds songs in other genres to match. We proceed by first determining a set of genres to query. Then, we choose representative songs as well as artists in each of these categories to seed. Below, we provide all the genres that are being augmented against and their respective chosen artists and songs.
```
    "acoustic" - (Shawn Mendes) 7n2wHs1TKAczGzO7Dd2rGr; (Peaches - Acoustic) 2TitxRtrE8OLcAwcNucvmJ, (17 - Acoustic) 4ensio3acJ17PegjBcmXVg
    "blues" - (Muddy Waters) 4y6J8jwRAwO4dssiSmN91R; (At Last) 4Hhv2vrOTy89HFRcjU3QOx
    "country" - (Taylor Swift) 06HL4z0CvFAxyc27GXpf02; (10,000 Hours) 2wrJq5XKLnmhRXHIAf9xBa, (Speechless) 7q7jyVU0f0hnod8tsaUmxg
    "edm" - (Calvin Harris) 7CajNmpbOovFoOoasH2HaY, (Avicii) 1vCWHaC5f2uS3yhpwWbIA6; (Waiting for Love) 2P4OICZRVAQcYAV2JReRfj
    "indie" - (Phum Viphurit) 5mqguTgtaoCMNMZD6txCh6; (Serotonin) 1q2Zb0v5ZGxKDktEz7wSPb, (Lover Boy) 2rd4FH1cSaWGc0ZiUaMbX9
    "pop" - (Ariana Grande) 66CXWjxzNUsdJxJ2JdwvnR; (Bad Habits) 6PQ88X9TkUIAUIZJHW2upE; (MONTERO) 3BZEcbdtXQSo7OrvKRJ6mb
    "r-n-b" - (Doja Cat) 5cj0lLjcoR7YOSnhnX0Po5; (Leave The Door Open) 3XFY3bKOme4E3QhJqHS1Pi
    "reggae" - (Bob Andy) 6GFAZnwXhNER0WS76WMpS4; (Dream Land) 1haHZa75ytryLGUOtHiGl5
    "rock" - (The Beatles) 3WrFJ7ztbogyGnTHbHJFl2; (Seven Nation Army) 7i6r9KotUPQg3ozKKgEPIN
    "sleep" - (Chandeller) 0LzWeBGGtEazh0mJmyxFXi; (For You Always) 2DsYcmzLobaT4BLM6qoeds
    "soundtracks" - (John Williams) 3dRfiJ2650SZu6GbydcHNb; (He's a pirate) 4AD2dterIUjNt1LFNI9Bvi
    "study" - (afternoon bike ride) 1iXLcpr2SlUwrU2oCP8nI9; (Le Bonheur) 3J8dqKlO2trb3uRGhcx3ok
    "summer" - (Calvin Harris) 7CajNmpbOovFoOoasH2HaY; (good 4 u) 4ZtFanR9U6ndgddUvNcjcG, (Levitating) 463CkQjx2Zk1yXoBuierM9
```

In [5]:
def get_recommendation(feature, artists, tracks, genres,
                       playlist, order, lock):
    params = {f'target_{k}': v + v * protuberance
              for k, v in feature.items()}
    # round integer fields -> 
    params['target_key'] = round(params['target_key'])
    params['target_mode'] = round(params['target_mode'])
    params['target_time_signature'] = round(params['target_time_signature'])
    # get seed artists, tracks, and genres
    params['seed_artists'] = ','.join(artists)
    params['seed_tracks'] = ','.join(tracks)
    params['seed_genres'] = ','.join(genres)
    # get id of recommended track
    reqst = requests.get(url, headers=headers, params=params)
    reqst = reqst.json()
    track_id = reqst['tracks'][0]['id']
    # store track id in appropriate index in dict
    lock.aquire()
    playlist[order] = track_id
    lock.release()

In [8]:
data

Unnamed: 0,moods_states,features
iso26,"[[22], [20]]","[[0.755, 0.479, 0.154, 0.0261, 2, 0.11, -15.05..."
iso23,"[[28], [20]]","[[0.455, 0.674, 0.615, 0.00091, 0, 0.208, -8.1..."
iso30,"[[19, 1, 9], [6, 21, 25]]","[[0.907, 0.23, 0.159, 0.924, 5, 0.102, -16.315..."
iso19,"[[17], [26], [0]]","[[0.985, 0.653, 0.178, 0.000339, 9, 0.134, -13..."
iso22,"[[28], [26]]","[[0.155, 0.221, 0.0879, 0.698, 7, 0.0869, -16...."
iso24,"[[22], [11]]","[[0.975, 0.462, 0.203, 0.113, 10, 0.102, -16.3..."
iso18,"[[13], [16], [25]]","[[0.124, 0.585, 0.52, 7.01e-05, 5, 0.131, -6.1..."
iso25,"[[22], [26]]","[[0.953, 0.582, 0.199, 0.00852, 10, 0.135, -10..."
iso17,"[[23], [4], [9]]","[[0.755, 0.479, 0.154, 0.0261, 2, 0.11, -15.05..."
iso20,"[[5], [16], [21]]","[[0.948, 0.571, 0.0274, 6.43e-06, 9, 0.322, -2..."
