# Homework 6
**Total Points: 5**

**Instructions:**
1. Complete parts 1 through 5, filling in code or responses where marked with `# YOUR CODE HERE` or `# YOUR ANALYSIS HERE`.
2. The libraries you need, in the order you need them, have already been coded. Do not import additional libraries or move import commands.
3. When finished, run the full notebook by selecting <b>Kernel > Restart & Run All</b>. </li>
4. Submit this completed notebook file to <b>NYU Classes</b>. </li>**(Important: Only submit your .ipynb file! Do not submit the entire dataset.)**

In this assignment you will test different ML techniques to classify solo instruments. This assignment uses a large dataset (9+ GB) which you will download separately: *Medley-solos-DB*:

<blockquote>
V. Lostanlen, C.E. Cella. Deep convolutional networks on the pitch spiral for musical instrument recognition. Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2016.
</blockquote>

**Grading:** Each part is worth 1 point.

**Important Note**: The way you implement the code in your work for each assignment is entirely up to you. There are often many ways to solve a particular problem, so use whatever method works for you. The only requirement is that you follow the instructions, which may prohibit or require certain libraries or commands. Refer to https://scikit-learn.org/ for implementation instructions and tutorials.

In [51]:
import numpy as np
import pandas as pd
import librosa
from librosa import feature
from sklearn import neighbors
from sklearn import neural_network
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt


## Prologue: Download the Dataset and Metadata

The data you will need is a folder containing wav audio files, and a separate .csv file with metadata. You can download both from the following page:

https://zenodo.org/record/3464194#.X4G_oi2z3kJ

Place both the folder and csv file into the same directory as your `Homework-6.ipynb` file, such that the folder stucture is as follows:

`
 .
 <--   Homework-6.ipynb
 <--   Medley-solos-DB_metadata.csv
 <--   Medley-solos-DB
 |     <--   *.wav files
`

The audio files contain recordings from 8 different instuments which have already been labeled and separated into training, validation, and test sets. Each audio file is the same length, and there are many example files from each instrument.

Each audio file has a unique id number associated with it ('uuid4'). This id is important when extracting the audio data and making sure that the file has the correct label, as referenced in the csv file. The following two cells will load and display the metadata into a `Medley_Data` DataFrame. No changes should be made to the following code.

In [2]:
# Load amd Check the csv file

Medley_Data = pd.read_csv("Medley-solos-DB_metadata.csv")
#Medley_Data


### Helper Function: `get_file_name_and_label()` and `get_ids()`

The following helper functions have been provided for you.

In [3]:
def get_file_name_and_label(uuid, path='Medley-solos-DB/', dataset=Medley_Data):
    """ Returns full file name and path from a uuid
    
    Parameters
    ----------
    
    uuid: str 
        the unique id (uuid4) for the audio file
    
    path: str
        relative path to audio files
        
    dataset: pandas.DataFrame
        the DataFrame to consult (Medley_Data)
    
    Returns
    -------
    
    filename: str
        relative path and filename
    label: int
        the label associated with that filename
    
    """
    
    rd = dataset.loc[ (dataset['uuid4'] == uuid) ]
    file = path + 'Medley-solos-DB' + '_' + str(rd.values[0,0]) + '-'  + str(rd.values[0,2]) + '_' + rd.values[0,4] + '.wav'
    label = rd.values[0,2]
    return(file, label)
                       
def get_ids(subset, path = 'Medley-solos-DB/', dataset = Medley_Data):

    """ Get a np array of all uuids or a subset of files in the dataset
    
    Parameters
    ----------
    
        subset: str
            one of 'training', 'validation, 'test', or 'all'
            
        path: str
            relative path to the audio files
            
        dataset: pd.DataFrame
            The Medley-solos-DB dataframe to search
         
    Returns
    -------
        filename: np.array
            Medley-solos-DB file name (or 0 if not found)
    
    """
    
    file_array = np.array([])
    rd = dataset.loc[ (dataset['subset'] == subset) ]
    if len(rd.index) < 1:
        file_array = np.array([0])
    else:
        k = 0
        for i in range(len(rd.index)):
            file_array = np.append(file_array,rd.iloc[k,4])
            k += 1
    return(file_array)


# Divides up file names into training, validation, and test sets
tracks_train =  get_ids('training')
tracks_validate = get_ids('validation')
tracks_test = get_ids('test')

print("There are {} tracks in the training set".format(len(tracks_train)))
print("There are {} tracks in the validation set".format(len(tracks_validate)))
print("There are {} tracks in the test set".format(len(tracks_test)))

There are 5841 tracks in the training set
There are 3494 tracks in the validation set
There are 12236 tracks in the test set


## Part 1a: Compute Features

Create a function `compute_features()` such that the input is one audio file and the output is a single feature vector. This function should do the following:
1. Load audio into a sample array.
2. Compute the MFCCs of the input audio, and remove the first (0th) coeficient.
3. Compute the summary statistics of the MFCCs over time:
    1. Find the mean and standard deviation for each feature (2 values for each feature)
    2. stack these statistics into single 1-d vector of size (2*(n_mfccs-1))
4. Return the 1-d vector.

In [4]:
def compute_features(audiofile, n_fft=2048, hop_length=512, n_mels=128, n_mfcc=20):
    """Compute features for an audio file
    
    Parameters
    ----------
    audiofile : str
        name of audio file (with relative directory path)
    n_fft : int
        Number of points for computing the fft
    hop_length : int
        Number of samples to advance between frames
    n_mels : int
        Number of mel frequency bands to use
    n_mfcc : int
        Number of mfccs to compute
    
    Returns
    -------
    features: np.array (1, 2* (n_mfcc - 1))
        feature vector

    """
    # 1 Load audio into a sample array.
    y, fs = librosa.load(audiofile)
    
    # 2 Compute the MFCCs of the input audio, and remove the first (0th) coeficient.
    mfccs = librosa.feature.mfcc(y, fs) 
    mfccs = mfccs[1:len(mfccs)] # 19 by len

    # 3 Compute the summary statistics of the MFCCs over time:
    # 3a Find the mean and standard deviation for each feature (2 values for each feature)
    # 3b stack these statistics into single 1-d vector of size (2*(n_mfccs-1))
    means = mfccs.mean(axis = 1) # for mean of each row
    stds = mfccs.std(axis = 1) # for std of each row

    #features = np.concatenate(means,stds)
    features = np.append(means, stds)
    
    # 4 Return the 1-d vector. 
    return features


## Part 1b: Create Feature Set

Create a function `create_feature_set()` where the input is an array of audio files and output is a normalized feature set and an accompanying vector of class labels. This function should:
1. Iterate through all audio files in a list of uuids. The training, test, and validation lists have been created for you. For each uuid:
    1. Use `get_file_name_and_label()` to retrieve the audio file name and associated label
    2. Use `compute_features()` to get the 1-d vector for that audio file.
    3. Append the feature vector and label to their respective arrays, and continue to the next file.
2. When finished, output 2 numpy arrays: the feature matrix (n_samples, 2*(mfccs-1)) and the label (n_samples,)

In [5]:
def create_feature_set(id_list):
    """Create feature set from list of input ids.

    Parameters
    ----------
    id_list: np.array
        array of uuid (track_test, track_validate, track_train)

    Returns
    -------

    features: np.array (n_samples, n_features)
        The standard deviation of the features
    labels: np.array (n_samples)
        corresponding label for each feature

    """
    
    
    # 1 Iterate through all audio files in a list of uuids. The training, test, and validation lists have been created for you. For each uuid:
        
    # 2 Use get_file_name_and_label() to retrieve the audio file name and associated label
    # Use compute_features() to get the 1-d vector for that audio file.
    # Append the feature vector and label to their respective arrays, and continue to the next file.
    nTracks = len(id_list)
    
    features = np.zeros((nTracks, 38)) 
    labels = np.zeros(nTracks)
    counter = 0
    for id1 in id_list:
        id1_filename, id1_label = get_file_name_and_label(id1)
        id1_features = compute_features(id1_filename)
        features[counter,:] = id1_features
        labels[counter] = id1_label
        counter +=1
    
    # 3 When finished, output 2 numpy arrays: the feature matrix (n_samples, 2*(mfccs-1)) and the label (n_samples,)
    return features, labels

## Part 2a: Get Mean and Standard Deviation

Create a function `get_stats()` which gets the mean and standard deviation for each feature in the input matrix.


In [6]:
def get_stats(features):
    """ Get mean and standard deviation of each feature in a set
 
    Parameters
    ---------
    
    features: np.array (n_samples, n_features)
        feature set
 
    Returns
    -------
     
    mean: np.array (n_features)
        mean of input feature set
    std_dev: np.array (n_features)
        standard deviation of input feature set

    """
    # gets the mean and standard deviation for each feature in the input matrix.
    means = features.mean(axis = 0) # for mean of each col
    stds = features.std(axis = 0) # for std of each col
    return means, stds


### Getting Everything Ready

The code in the following cell has been done for you. When all is well, run the code to compute features and training labels for the 3 data sets in Medley-solos-DB.

**Hint:** Since you are processing many GB of data, this code will take a while to run. To make sure everything works as expected, you may want to test on small subset of the data, like `tracks_train[0:500]`. Although the output won't be valid for the ML experiments, you can verify that the shapes of the output matrices and vectors are correct. 

**Another Hint:** This code will save feature sets and labels to your computer so it won't need to be re-computed if not necessary. 

In [7]:
# THIS CODE IS PROVIDED FOR YOU

# Change this to True if you want to load prevously-computed features
load_saved_tests = True ######## originally false

if not load_saved_tests:
    test_set, test_labels = create_feature_set(tracks_test)
    print("Test Set: " + str(test_set.shape))
    train_set, train_labels = create_feature_set(tracks_train)
    print("Training Set: " + str(train_set.shape))
    validate_set, validate_labels = create_feature_set(tracks_validate)
    print("Validation Set: " + str(validate_set.shape))
    np.savetxt('test_set.csv', test_set, delimiter=',')
    np.savetxt('test_labels.csv', test_labels, delimiter=',')
    np.savetxt('train_set.csv', train_set, delimiter=',')
    np.savetxt('train_labels.csv', train_labels, delimiter=',')
    np.savetxt('validate_set.csv', validate_set, delimiter=',')
    np.savetxt('validate_labels.csv', validate_labels, delimiter=',')
else:
    test_set = np.loadtxt('test_set.csv',delimiter=',')
    test_labels = np.loadtxt('test_labels.csv',delimiter=',')
    train_set = np.loadtxt('train_set.csv',delimiter=',')
    train_labels = np.loadtxt('train_labels.csv',delimiter=',')
    validate_set = np.loadtxt('validate_set.csv',delimiter=',')
    validate_labels = np.loadtxt('validate_labels.csv',delimiter=',')
    print("Test Set: " + str(test_set.shape))
    print("Training Set: " + str(train_set.shape))
    print("Validation Set: " + str(validate_set.shape))
    

Test Set: (12236, 38)
Training Set: (5841, 38)
Validation Set: (3494, 38)


## Part 2b: Normalize Feature Sets

Using `get_stats()` find the mean and standard deviations for the training set. Then use those statistics to make all 3 data sets have a mean of 0 and standard deviation of 1.

In [8]:
# Using get_stats() find the mean and standard deviations for the training set. 
training_means, training_stds = get_stats(train_set)
testing_means, testing_stds = get_stats(test_set)
validation_means, validation_stds = get_stats(validate_set)

# Then use those statistics to make all 3 data sets have a mean of 0 and standard deviation of 1.
new_training = (train_set - training_means)/training_stds
new_testing = (test_set - testing_means)/testing_stds
new_validation= (validate_set - validation_means)/validation_stds


## Part 3: k-Nearest Neighbor

Using the data from part 1, run a kNN classification experienment:

- Use `sklearn` entirely
- Run tests on the validation set with k = 1, 5, 20, and 50
- When you decide on the best settings (best f-measure), run the experiment on the test set and output the f-measure and a confusion matrix.

In [61]:
knn = neighbors.KNeighborsClassifier(n_neighbors=1) #defaul nneighbors i 5

#3 "Fit" the training data to the model (this basically means "train the model").
thefit = knn.fit(new_training, train_labels)

#4 Use cross_val_scores() on the classifier, setting cv=10 and output the scores.
#f1 = cross_val_score(knn, new_training, train_labels, cv=10, scoring='precision')

#5 Get predictions using `predict() with the test data.
final_predictions = knn.predict(new_validation)

#6 Print the confusion matrix using the true vs. predicted labels.
c = confusion_matrix(validate_labels, final_predictions)
# print("confusion matrix:\n" , c) # uncomment to see the confusion matrix while trying different k values
# print("f1 score", f1_score(validate_labels, final_predictions, average="micro")) # uncomment to see the f1 scores while trying different k values
print("\n", "Now, we test on the Testing set:", "\n")
test_predictions = knn.predict(new_testing)
c = confusion_matrix(test_labels, test_predictions)
print("confusion matrix:\n" , c) # Final confusion matrix w test set
print("f1 score", f1_score(test_labels, test_predictions, average="micro")) # fomal f1 scores w test set






 Now, we test on the Testing set: 

confusion matrix:
 [[ 152   27    9   34  280   19    4  207]
 [   0  814    1    0   71    8    0   61]
 [   0    5  696    0   57  118    0  266]
 [  75   85   92  402 1293   17   83 1120]
 [   0   31    2    1 2558    0    0   17]
 [   0   89    6    0  111   25   11   83]
 [  15    1    1    5   33   17  222  112]
 [  14  146    9   31  604    4    2 2090]]
f1 score 0.5687316116377902


## Part 4: Multi-Layered Perceptron (Neural Network)

Using the same data, run the same test using the MLP classifier.

- Use `sklearn` entirely
- Run tests on the validation set to experiment with the number of iterations and size and number of hidden layers.
- Initially, try setting `max_iter=100` and `hidden_layer_sizes=(5,2)` (meaning 2 hidden layers of sizes 5 and 2.
- When you decide on the best settings (best f-measure), run the experiment on the test set and output the f-measure and a confusion matrix.


In [10]:
# INCOMPLETE
# mlp = neural_network.MLPClassifier()
# mlp.fit(new_training, train_labels) 

## Part 5: Analysis

For each machine learning method:

1. Predict the labels for the test set using hyperparameters from the validation set
2. Compute & print the f-measures
3. Compute and print the confusion matrix

For each method, report on the following:

4. Which instrument class has the best & worst performance?
5. For the worst source, what other sources are commonly confused? Why?
6. Listen to the audio for examples the classifier got wrong. What do they have in common?

In [11]:
# KNN: RESULTS FROM VARYING K
# k = 5 I got f1 = 0.6963365769891242
# k = 1 I got f1 = 0.6983400114481969
# k = 20 I got f1 = 0.6734401831711505
# k = 50 I got f1 = 0.6296508299942759 
# These were with 'micro' averaging. The highest f1 value is for k = 1, but the k=5 is very close! 
# Uncomment The F-measure and confusion matrix is printed at the bottom of part 3, the k value can be changed to see the other confusion matrices.

# PART 2 ---
# The f-score with the test set is 0.5687316116377902
# PART 3 ---
# The confusion matrix is printed above at the bottom of Part 3
# PART 4  ---
# The best performance: piano
# The worst performance: flute
# PART 5 ---
# Flute is often confused with paino violin. In fact, flute is predicted to be a piano or a violin much more often than it is precicted to be a flute. 
# PART 6 ---
# I listened to clips of the flute and violin. They play in the same frequency range and have a similar mild vibrato. I can see why the classifier confuses flute and violin.
# The piano also often plays in a high range. I think this dataset could benefit from more varied and representative piano samples, as I didn't hear many samples of large chords, or more on the low-frequency part of the piano. 
# Of course, I can't listen to all of the piano samples but this is the impression I got from the subset I did listen to. 
# It seems like this KNN classifier tends to classify things as piano more than anything else. This is some kind of bias that would be important to investigate if I was working on this as a research project.

# MULTI-LAYER PERCEPTRON
# I did not get to the MLP part of the assignment 


`# YOUR ANALYSIS HERE`