# Homework 5
**Total Points: 5**

**Instructions:**
1. Complete parts 1 through 5, filling in code or responses where marked with `# YOUR CODE HERE` or `# YOUR ANALYSIS HERE`.
2. The libraries you need, in the order you need them, have already been coded. Do not import additional libraries or move import commands.
3. When finished, run the full notebook by selecting <b>Kernel > Restart & Run All</b>. </li>
4. Submit this completed notebook file to <b>NYU Classes</b>. </li>

In this assignment you will create a Nearest Neighbor classifier, train it, and test it. The goal of your classifier to predict whether the input audio contains a "talking" voice or a "singing" voice based on the training data. This assignment contains a subfolder called `audio` which has multiple short audio files from the dataset *VocalSet: A Singing Voice Dataset*.

**Grading:** Each part is worth 1 point.

**Important Note**: The way you implement the code in your work for each assignment is entirely up to you. There are often many ways to solve a particular problem, so use whatever method works for you, including testing different parameter values. The only requirement is that you follow the instructions, which may prohibit or require certain libraries or commands.

Refer to https://scikit-learn.org/ for implementation instructions and tutorials.

In [4]:
import numpy as np
import librosa
from librosa import feature
from sklearn.model_selection import train_test_split

# Imported for plotting/testing
import matplotlib.pyplot as plt

## Part 1: Prepare Data
Create a function `prepare_data()` that will take an array of files and prepare them for a nearest neighbor machine learning task. This function should:
1. Take the input array of audio files for one class (arrays are provided in Part 2),
2. Concatenate the audio into one long audio file,
3. Generate an `mfccs` matrix from this one audio file (use `Librosa.feature.mfcc()` for this),
4. Remove the first MFCC.
5. Generate a `label` that is an array of the label number the same size as the number of samples (each set of MFCCs should have the same corresponding label),
6. Output both `mfccs` and `labels` from the function.

In [23]:
def prepare_data(audiofiles, label):
    
    """ Prepare data for Nearest Neighbor classification
        Hint: When generating MFCCs, you can use Librosa's default
        or experiment. 13 is a common number of MFCC coefficients to retain.
    
    Parameters
    ----------
    
    audiofiles: np.array
        array of input audio files
    
    label: int
        The label to give these audio features
    
    Returns
    -------
    
    mfccs: np.array
        mfcc feature set
    labels: np.array
        labels for feature set
        
    """
    
    # -------------------------------------------------- #
    # Note: the mfccs are of size ( features * samples ) #
    # The labels are outputted as a zero dimension array #
    # of length ( samples ). I will match them in the    #
    # next part                                          #
    # -------------------------------------------------- #
    
    # Extracting data from files and combining them to one array
    combined_audio_file = np.array([])
    
    for file in audiofiles:
        data, sr = librosa.load(file)
        assert sr == 22050
        combined_audio_file = np.append(combined_audio_file, data)

    # Generate MFCC matrix for the combined audio file
    mfcc = librosa.feature.mfcc(combined_audio_file, n_mfcc = 13)
    
    # Removing the bottom row of mfccs
    mfcc = mfcc[ 1 : , : ]
    
    # Generating the label array
    labels = [ label for i in range(mfcc.shape[1]) ]
    labels = np.array(labels)

    return mfcc, labels

## Part 2: Set Up the Experiment

Using `prepare_data()` prepare your Nearest Neighbor classification experiment as follows:

- Run `prepare_data()` twice, once using the files in array `talking_files` and once using `singing_files`,
- Append the feature vectors and label vectors so that you have two data objects; one with all features and one with all labels.
- Take care that the objects' dimensions match and that `label[n]` corresponds to `feature_vector[n]`.

**Hint:** Rows should be samples, and columns should be features. Python is row-centric, so rows come first. 
    

In [24]:
# Use singing_files and talking_files for Part 2; the vibrato_files is for Part 5
singing_files = np.array(['audio/f1_row_straight.wav','audio/f2_row_straight.wav','audio/m1_row_straight.wav','audio/m2_row_straight.wav'])
talking_files = np.array(['audio/f1_row_spoken.wav','audio/f2_row_spoken.wav','audio/m1_row_spoken.wav','audio/m2_row_spoken.wav'])
vibrato_files = np.array(['audio/f1_row_vibrato.wav','audio/f2_row_vibrato.wav','audio/m1_row_vibrato.wav','audio/m2_row_vibrato.wav'])

singing_mfcc, singing_labels = prepare_data( singing_files, "S")
talking_mfcc, talking_labels = prepare_data( talking_files, "T")

feature_vector = np.append(singing_mfcc, talking_mfcc, axis = 1).T
labels = np.append(singing_labels, talking_labels)

print(feature_vector.shape)
print(labels.shape)

(4924, 12)
(4924,)


## Part 3: Find the Nearest Neighbor
Create a function `find_nearest_neighbor()` without using scikit-learn (except for the function in step #1), You may use a process of your choice. One way to write the code is as follows:

1. Separate your data into training and testing sets (use scikit-learn's `train_test_split()` for this - start with a test_size of 0.1 (10%).
2. For the first feature vector in the test set, find the Euclidean distance between it and every vector in the training set, keeping track of the distances in an array.
3. When finished, use `np.argmin()` to find the index of the smallest distance. That is the Nearest Neighbor of that first test feature vector.
4. If the label of the training vector with the lowest distance is the same as the label of the test vector, then it is a match. Otherwise it is wrong.
5. Repeat steps 2-4 for every test vector.
8. When finished, the program should output the number of correct vs incorrect matches.
9. Run the program multiple times, making sure that the test/training sets are random each time. You can also adjust the size of the test set.

**Note that this implementation will take a while to run.** Also note that, since this function is only finding the nearest neighbor (and not k-neighbors), there is no validation set.


The equation for finding the Euclidean distance between two N-dimentional vectors $p$ and $q$ is:

$$d(p,q) = \sqrt{ \sum_{n=0}^{N-1}{ (p_n - q_n)^2}}$$

In [25]:
def find_nearest_neighbor(data, labels, test_size):
    
    """ Perform Nearest Neighbor classification
    
    Parameters
    ----------
    
    data: np.array
        feature set
    
    labels: np.array
        true labels for the features
        
    test_size: float
        Between 0 and 1, the amount of data set aside for testing
          
    
    Returns
    -------
    
    [correct, incorrect]: np.array
        the numer of correct and incorrect results

    """
    # Separate the data to training and test sets
    X_train, X_test, y_train, y_test = train_test_split(data, labels, 
                                                        test_size = test_size )
    
    # Finding the nearest label for every vector in test set
    correct = 0
    incorrect = 0
    for i in range( len(X_test) ) :
        test_vector = X_test[i]
        temp_distances = []
        for train_vector in X_train:
            distance = np.sum( (test_vector - train_vector)**2 )**0.5
            temp_distances.append(distance)
        min_dist_index = np.argmin( temp_distances )
        
        if y_train[min_dist_index] == y_test[i]:
            correct += 1
        else:
            incorrect += 1
    
    return np.array([correct, incorrect])

find_nearest_neighbor(feature_vector, labels, test_size = 0.1)

array([455,  38])

## Part 4: Cross Validation and Confusion Matrix
Using Scikit-Learn, set up a Nearest Neighbor experiment using k-fold cross validation and output a confusion matrix. All the sklearn modules you need have been imported for you.
1. Break the data into training and testing sets (as before).
2. Create a new classifier by using neighbors.KNeighborsClassifier()
3. "Fit" the training data to the model (this basically means "train the model").
4. Use `cross_val_scores()` on the classifier, setting cv=10 and output the scores.
5. Get predictions using `predict() with the test data.
6. Print the confusion matrix using the true vs. predicted labels.


In [26]:
# YOU CAN USE THESE LIBRARIES
from sklearn import neighbors
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix

# Break the data into training and testing sets (as before).
X_train, X_test, y_train, y_test = train_test_split( feature_vector, 
                                                    labels, 
                                                    test_size = 0.1)

# Create a new classifier by using neighbors.KNeighborsClassifier()
KN_classifier = neighbors.KNeighborsClassifier( n_neighbors = 3 )

# "Fit" the training data to the model.
KN_classifier.fit( X_train, y_train )

# Use cross_val_scores() on the classifier, setting cv=10 and output the scores.
cv_score = cross_val_score( KN_classifier, X_test, y_test, cv = 10 )
print("cross validation scores: ", cv_score)

# Get predictions using `predict() with the test data.
predictions = KN_classifier.predict( X_test )

# Print the confusion matrix using the true vs. predicted labels.
confusion_matrix(y_test, predictions)

cross validation scores:  [0.8        0.86       0.82       0.79591837 0.79591837 0.7755102
 0.73469388 0.85714286 0.79591837 0.85714286]


array([[256,  13],
       [ 18, 206]])

## Part 5: Analysis
How would you characterize your overall results? Does your classifier work better or worse than `sklearn`? Or are they similar?

There is a third class of audio included with this assignment, `vibrato_files`,  which contains singers using vibrato. Add this data to your full dataset and give it a unique label. Then run the whole test again (using `sklearn` - no need to run your code from Part 3 again). 
1. Perform 10-fold cross validation again. How so the results compare now that you have 3 classes.
2. Generate a confusion matrix again. How does that data compare?

(OPTIONAL) MFCC deltas ($\Delta$) may (or may not) improve the overall results. In addition to using the MFCCs generated by the MFCC function, append to each vector an MFCC difference such that, for each MFCC feature vector $f[m]$ at time index $m$, $\Delta f[m] = f[m] - f[m-1]$. This will double the amount of features per vector. Report on your findings.


In [27]:
# Adding vibrato files to the set
singing_files = np.array(['audio/f1_row_straight.wav','audio/f2_row_straight.wav','audio/m1_row_straight.wav','audio/m2_row_straight.wav'])
talking_files = np.array(['audio/f1_row_spoken.wav','audio/f2_row_spoken.wav','audio/m1_row_spoken.wav','audio/m2_row_spoken.wav'])
vibrato_files = np.array(['audio/f1_row_vibrato.wav','audio/f2_row_vibrato.wav','audio/m1_row_vibrato.wav','audio/m2_row_vibrato.wav'])

singing_mfcc, singing_labels = prepare_data( singing_files, "S")
talking_mfcc, talking_labels = prepare_data( talking_files, "T")
vibrato_mfcc, vibrato_labels = prepare_data( vibrato_files, "V")

feature_vector = np.hstack((singing_mfcc, talking_mfcc, vibrato_mfcc)).T
labels = np.hstack((singing_labels, talking_labels, vibrato_labels))

In [28]:
print(feature_vector.shape)
print(labels.shape)

(7575, 12)
(7575,)


In [29]:
# Break the data into training and testing sets (as before).
X_train, X_test, y_train, y_test = train_test_split( feature_vector, 
                                                    labels, 
                                                    test_size = 0.1)

# Create a new classifier by using neighbors.KNeighborsClassifier()
KN_classifier = neighbors.KNeighborsClassifier( n_neighbors = 4 )

# "Fit" the training data to the model.
KN_classifier.fit( X_train, y_train )

# Use cross_val_scores() on the classifier, setting cv=10 and output the scores.
cv_score = cross_val_score( KN_classifier, X_test, y_test, cv = 10 )
print("cross validation scores: ", cv_score)

# Get predictions using `predict() with the test data.
predictions = KN_classifier.predict( X_test )

# Print the confusion matrix using the true vs. predicted labels.
confusion_matrix_ = confusion_matrix(y_test, predictions)
print("\n\nConfusion Matrix Index:\nSinging, Talking, Vibrato")
print(confusion_matrix_)

cross validation scores:  [0.64473684 0.68421053 0.67105263 0.73684211 0.69736842 0.69736842
 0.76315789 0.61842105 0.61333333 0.52      ]


Confusion Matrix Index:
Singing, Talking, Vibrato
[[247  10  14]
 [ 26 179   8]
 [ 57  11 206]]


``
It's clear from the confusion matrix that the current model is doing quite well predicting Singing and Talking samples. However, it doesn't do quite well differentiating Vibrato with Singing samples, as seen in the bottom left and top right position of the confusion matrix. Very often it confuses Vibrato samples with Singing samples (Bottom Left)
``