# Data Augmentations & Synthetic Data Generation
This notebook will explore some options for synthetic data generation as well as augmentations on existing data.

In [204]:
import os
import json
import torch
import numpy as np
import pandas as pd
from pprint import pprint
from sklearn import decomposition

In [206]:
data = pd.read_csv('data/train.csv', index_col=0)
data = data[data['features'] != '[null]']  # remove empty plists

In [207]:
data

Unnamed: 0,moods_states,features
iso13,"[[19], [4], [9]]","[[0.932, 0.433, 0.329, 0.0474, 3, 0.1, -13.288..."
iso21,"[[28], [11]]","[[0.942, 0.252, 0.314, 0.676, 5, 0.0892, -18.1..."
iso09,"[[22], [11]]","[[0.778, 0.407, 0.308, 1.46e-05, 10, 0.092, -9..."
iso10,"[[5], [21], [14]]","[[0.942, 0.252, 0.314, 0.676, 5, 0.0892, -18.1..."
iso17,"[[23], [4], [9]]","[[0.755, 0.479, 0.154, 0.0261, 2, 0.11, -15.05..."
iso12,"[[3], [2], [25]]","[[0.186, 0.548, 0.532, 0.000263, 5, 0.217, -7...."
iso11,"[[22], [25], [29]]","[[0.92, 0.587, 0.229, 0, 10, 0.1, -11.254, 0, ..."
iso20,"[[5], [16], [21]]","[[0.948, 0.571, 0.0274, 6.43e-06, 9, 0.322, -2..."
iso19,"[[17], [26], [0]]","[[0.985, 0.653, 0.178, 0.000339, 9, 0.134, -13..."
iso08,"[[19], [6]]","[[0.251, 0.513, 0.767, 0.278, 7, 0.221, -8.386..."


## Augmentations
- **Dimensionality reduction through PCA.** This strategy isn't actually an augmentaion technique, but reducing the size of the model by reducing the dimensionality of the required output.

- **Reversing mood states and song order.** This is fairly straightforward, we simply generate new playlists by inverting the order of the mood states and their respective songs. For example a playlist from sad to happy would be inverted from happy to sad. We note that although such playlists may have no intrinsic value during inference, it may help the model learn the relationship between the various embeddings and audio features.

- **Random augmentation of audio features.**

### Dimensionality Reduction with PCA
We proceed with PCA on all of the audio features on each playlist, then we aggregate all of the components together through an average. We start by combining each playlist into a matrix and applying transposition such that each column represents an observation. Let $X$ be a $11\times N$ matrix, where $N$ is the number of observations, and $11$ represents the number of audio features. We first compute the mean-deviation form $B$:

\begin{align}
M &= \frac{1}{N}\begin{bmatrix}X_1 & \ldots & X_N\end{bmatrix}, \\
B &= \begin{bmatrix}\hat{X_1} & \ldots & \hat{X_N}\end{bmatrix}, \qquad \hat{X_i} = X_i - M.
\end{align}

Then, the covariance matrix is given by $S =\frac{1}{N-1}BB^T$. Here, the goal of PCA is to find an orthogonal matrix $P$, and a matrix $Y$ such that $B = PY$. Such that the rows of $Y$ are uncorrelated. The covariance matrix for $Y$ is given by:

\begin{align}
Y &= P^TB, \\
YY^T &= P^TB(P^TB)^T, \\
\frac{1}{N-1}YY^T &= P^TBB^TP\frac{1}{N-1}, \\
Y_{\text{covar}} &= P^T\Big(\frac{1}{N-1}BB^T\Big)P, \\
Y_{\text{covar}} &= P^TSP.
\end{align}

Since $S$ is symmetrical it must be orthogonally diagonalizable. Let $D$ be a diagonal matrix with the entries on its main diagonals be the eigenvalues of $S$. Therefore $P$ corresponds to the unit eigenvectors $u_1,\ldots,u_p$:
$$S = PDP^T.$$

Suppose that the entries on the main diagonal of $D$ are $\lambda_1, \ldots, \lambda_p$. Then, we can reduce the dimensionality of $X$ by finding values which $\frac{\lambda_i}{\text{tr}(D)}$ are the smallest.

In [234]:
# construct list a of matrices, where each matrix represents a playlist
playlists = []
for entry in range(len(data)):
    playlist = json.loads(data.iloc[entry]['features'])
    # stack `song vectors` together into a matrix
    plist = []
    for song in playlist:
        plist.append(np.array(song))
    playlists.append(np.stack(plist))


In [250]:
all_playlists = np.vstack(playlists)
print(all_playlists.shape)

(234, 11)


In [253]:
all_playlists = np.vstack(playlists)
mdf = playlist - np.mean(playlist, axis=0)  # mean-deviation form
covar = np.cov(mdf.T)  # covariance matrix
eigval, eigvec = np.linalg.eig(covar)  # eigenvalue/eigenvectors of covar
pcomp = np.real(eigval) / sum(np.real(eigval))
print(pcomp)

In [258]:
print(f'mean: {np.mean(playlist, axis=0)}')
print(f'%variance: {pcomp}')
print(f'corresponding component eigenvectors:')
for index, eig in enumerate(eigvec[[0,1,2],:]):
    print(f'component {index+1}: {eig}')

mean: [0.361640 0.663500 0.555900 0.000077 5.700000 0.129330 -6.746100 0.600000
 0.063500 122.359400 0.427800]
%variance: [0.968516 0.025052 0.005954 0.000374 0.000076 0.000016 0.000008 0.000004
 0.000000 -0.000000 0.000000]
corresponding component eigenvectors:
component 1: [-0.007128 -0.016943 -0.110670 0.124335 0.627286 -0.043151 -0.546551
 -0.084440 -0.128198 0.007786 -0.504302]
component 2: [0.001854 0.011303 0.022219 0.038830 -0.504164 -0.341400 0.159444
 -0.282795 -0.355778 0.009416 -0.628613]
component 3: [-0.001611 0.012157 0.073040 -0.065926 -0.154649 -0.750761 -0.354121
 0.293144 0.432653 -0.000660 0.063927]


We note that the first three components make up over 99% of the variance of our covariance matrix. Therefore, we can reduce $P$ from $B=PY$ to a $p\times 3$ matrix. This will make prediction much easier as we can simply train our network to predict $Y: (3\times N)$, and then multiply it by our matrix of eigenvectors to yield $B$.