# KNN Model

The first model I'd like to try is a KNN model. I expect that this model will perform the worst out of the four models I've decided to create. This is becasue... 

In [9]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import classification_report

In [13]:
path = '../data/fma_metadata/'
tracks = pd.read_csv(path + 'tracks.csv', index_col=0, header=[0, 1])
features = pd.read_csv(path + 'features.csv', index_col=0, header=[0, 1, 2])

## Retrieve all of the features and labels

In [48]:
# This code is selecting the small dataset and extracting the features to use in the model.
# It is also separating the tracks into train, validation, and test sets.
small = tracks['set', 'subset'] == 'small'

# FMA has already separated the tracks for into the three sets (train, val, test)
train = tracks['set', 'split'] == 'training'
val = tracks['set', 'split'] == 'validation'
test = tracks['set', 'split'] == 'test'

# Load the genre labels
y_train = tracks.loc[small & train, ('track', 'genre_top')]
y_val = tracks.loc[small & val, ('track', 'genre_top')]
y_test = tracks.loc[small & test, ('track', 'genre_top')]

# Load the mfccs and convert to a numpy ndarray, I am only going to use the means for now.
# We have 20 MFCCs. When using the mfcc function in librosa, which FMA uses to get all the feature 
# statisitcs, it will return an array. From what I've see in a bunch of articles and papers people
# generally take the mean of each column and use the vector of means as 20 features. This is what I
# will do as well.
X_train_mfcc = features.loc[small & train, 'mfcc']['mean'].to_numpy()
X_val_mfcc = features.loc[small & val, 'mfcc']['mean'].to_numpy()
X_test_mfcc = features.loc[small & test, 'mfcc']['mean'].to_numpy()

print('{} training examples'.format(y_train.size))
print('{} cross validation examples'.format(y_val.size))
print('{} testing examples'.format(y_test.size))
print('{} features, {} classes'.format(X_train_mfcc.shape[1], np.unique(y_train).size))

6400 training examples
800 cross validation examples
800 testing examples
20 features, 8 classes


### Adding more features

Now that I've performed PCA on the MFCCs I'd like to add a couple more features to the model. The features that I'm going to be adding are the Spectral Centroid, Spectral Rolloff, Zero Crossing Rate, RMSE, Spectral Bandwidth.

In [59]:
# Load the mfccs and convert to a numpy ndarray
X_train_zcr = features.loc[small & train, 'zcr']['mean'].to_numpy()
X_val_zcr = features.loc[small & val, 'zcr']['mean'].to_numpy()
X_test_zcr = features.loc[small & test, 'zcr']['mean'].to_numpy()

X_train.shape

(6400, 1)