# Homework 5
**Total Points: 5**

**Instructions:**
1. Complete parts 1 through 5, filling in code or responses where marked with `# YOUR CODE HERE` or `# YOUR ANALYSIS HERE`.
2. The libraries you need, in the order you need them, have already been coded. Do not import additional libraries or move import commands.
3. When finished, run the full notebook by selecting <b>Kernel > Restart & Run All</b>. </li>
4. Submit this completed notebook file to <b>NYU Classes</b>. </li>

In this assignment you will create a Nearest Neighbor classifier, train it, and test it. The goal of your classifier to predict whether the input audio contains a "talking" voice or a "singing" voice based on the training data. This assignment contains a subfolder called `audio` which has multiple short audio files from the dataset *VocalSet: A Singing Voice Dataset*.

**Grading:** Each part is worth 1 point.

**Important Note**: The way you implement the code in your work for each assignment is entirely up to you. There are often many ways to solve a particular problem, so use whatever method works for you, including testing different parameter values. The only requirement is that you follow the instructions, which may prohibit or require certain libraries or commands.

Refer to https://scikit-learn.org/ for implementation instructions and tutorials.

In [1]:
import numpy as np
import librosa
from librosa import feature
from sklearn.model_selection import train_test_split
from numpy import linalg 

## Part 1: Prepare Data
Create a function `prepare_data()` that will take an array of files and prepare them for a nearest neighbor machine learning task. This function should:
1. Take the input array of audio files for one class (arrays are provided in Part 2),
2. Concatenate the audio into one long audio file,
3. Generate an `mfccs` matrix from this one audio file (use `Librosa.feature.mfcc()` for this),
4. Remove the first MFCC.
5. Generate a `label` that is an array of the label number the same size as the number of samples (each set of MFCCs should have the same corresponding label),
6. Output both `mfccs` and `labels` from the function.

In [2]:
def prepare_data(audiofiles, label):
    
    """ Prepare data for Nearest Neighbor classification
        Hint: When generating MFCCs, you can use Librosa's default
        or experiment. 13 is a common number of MFCC coefficients to retain.
    
    Parameters
    ----------
    
    audiofiles: np.array
        array of input audio files
    
    label: int
        The label to give these audio features
    
    Returns
    -------
    
    mfccs: np.array
        mfcc feature set
    labels: np.array
        labels for feature set
        
    """
    fs = 44100 # sampling rate of files 
    
    # Concatenate the audio into one long audio file,
    long_audio = []
    y1, fs = librosa.load(audiofiles[0])
    y2, fs = librosa.load(audiofiles[1])
    y3, fs = librosa.load(audiofiles[2])
    y4, fs = librosa.load(audiofiles[3]) 

    long_audio = np.concatenate((y1,y2,y3,y4))

    # Generate an `mfccs` matrix from this one audio file use Librosa.feature.mfcc()
    mfccs = librosa.feature.mfcc(long_audio, fs) #default 20 mfccs should i change to 13?

    # Remove the first MFCC.
    mfccs = mfccs[1:len(mfccs)] # 2241 by 19
    mfccs = mfccs.T
    
    # Generate a `label` that is an array of the label number the same size as the number of samples (each set of MFCCs should have the same corresponding label)
    labels = np.empty([len(mfccs),1]) # 2241 by 1
    
    for i in range(len(labels)):
        labels[i] = label
    
    # output both returns
    return mfccs, labels


## Part 2: Set Up the Experiment

Using `prepare_data()` prepare your Nearest Neighbor classification experiment as follows:

- Run `prepare_data()` twice, once using the files in array `talking_files` and once using `singing_files`,
- Append the feature vectors and label vectors so that you have two data objects; one with all features and one with all labels.
- Take care that the objects' dimensions match and that `label[n]` corresponds to `feature_vector[n]`.

**Hint:** Rows should be samples, and columns should be features. Python is row-centric, so rows come first. 
    

In [3]:
# Use singing_files and talking_files for Part 2; the vibrato_files is for Part 5
singing_files = np.array(['audio/f1_row_straight.wav','audio/f2_row_straight.wav','audio/m1_row_straight.wav','audio/m2_row_straight.wav'])
talking_files = np.array(['audio/f1_row_spoken.wav','audio/f2_row_spoken.wav','audio/m1_row_spoken.wav','audio/m2_row_spoken.wav'])
vibrato_files = np.array(['audio/f1_row_vibrato.wav','audio/f2_row_vibrato.wav','audio/m1_row_vibrato.wav','audio/m2_row_vibrato.wav'])

# Run prepare_data() twice: once using the files in array talking_files and once using singing_files:
talk_mfccs, talk_labels = prepare_data(talking_files, 0) # 0 is label for talking
sing_mfccs, sing_labels = prepare_data(singing_files, 1) # 1 is label for singing

# Append the feature vectors and label vectors so that you have 2 data objects; one with all features and one with all labels.
all_mfccs = np.concatenate((talk_mfccs, sing_mfccs))
all_labels = np.concatenate((talk_labels, sing_labels))

## Part 3: Find the Nearest Neighbor
Create a function `find_nearest_neighbor()` without using scikit-learn (except for the function in step #1), You may use a process of your choice. One way to write the code is as follows:

1. Separate your data into training and testing sets (use scikit-learn's `train_test_split()` for this - start with a test_size of 0.1 (10%).
2. For the first feature vector in the test set, find the Euclidean distance between it and every vector in the training set, keeping track of the distances in an array.
3. When finished, use `np.argmin()` to find the index of the smallest distance. That is the Nearest Neighbor of that first test feature vector.
4. If the label of the training vector with the lowest distance is the same as the label of the test vector, then it is a match. Otherwise it is wrong.
5. Repeat steps 2-4 for every test vector.
8. When finished, the program should output the number of correct vs incorrect matches.
9. Run the program multiple times, making sure that the test/training sets are random each time. You can also adjust the size of the test set.

**Note that this implementation will take a while to run.** Also note that, since this function is only finding the nearest neighbor (and not k-neighbors), there is no validation set.


The equation for finding the Euclidean distance between two N-dimentional vectors $p$ and $q$ is:

$$d(p,q) = \sqrt{ \sum_{n=0}^{N-1}{ (p_n - q_n)^2}}$$

In [7]:
def find_nearest_neighbor(data, labels, test_size):
    
    """ Perform Nearest Neighbor classification    
    Parameters
    ----------
    data: np.array
        feature set
    
    labels: np.array
        true labels for the features
        
    test_size: float
        Between 0 and 1, the amount of data set aside for testing
    Returns
    -------    
    [correct, incorrect]: np.array
        the numer of correct and incorrect results
    """
    
    # 1. Separate your data into training and testing sets (use scikit-learn's train_test_split() for this - start with a test_size of 0.1 (10%).
    data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=test_size)
    
    # 2. For the first feature vector in the test set, find the Euclidean distance between it and every vector in the training set, keeping track of the distances in an array.
    correct_matches = 0 # counters
    incorrect_matches = 0 
    
    ntest_iterator = 0 # iterator to ocount which test vector we are at. There's probably a better way to do this.
    for ntest in data_test[:,:]: # For each test vector
        dist_arr = [] 
        for ntrain in data_train[:,:]: # for each train vector
            dist_arr.append(np.linalg.norm(ntest - ntrain)) # distances between 1 test vector and all training vectors.
        
        # 3. When finished, use np.argmin() to find the index of the smallest distance. 
        index = np.argmin(dist_arr) # index is the Nearest Neighbor of the ntest feature vector

        # 4. If the label of the training vector with the lowest distance is the same as the label of the ntest test vector, then it is a match. Otherwise wrong.
        if (labels_train[index] == labels_test[ntest_iterator]):
            correct_matches += 1
        else:
            incorrect_matches +=1
        ntest_iterator += 1 # goes to next test vector

    #6. Program outputs the number of correct vs incorrect matches.
    end = [correct_matches, incorrect_matches]
    print("find_nearest_neighbor results: ", end)
    return end

    # Run the program multiple times, making sure that the test/training sets are random each time. You can also adjust the size of the test set.
    
    ## I ran the program multiple times, getting scores of [464, 29], [466, 27], and [463, 30] which are 94%, 94.5%, 93.9% respectively. The test/train sets are random each time. The average accuracy is 94.1%
    ## Then I changed to an 80-20 train-test set and got [939, 46], [937, 48], and [944, 41] which is slightly better but not significantly! It's average accuracy is 95.3%


## Part 4: Cross Validation and Confusion Matrix
Using Scikit-Learn, set up a Nearest Neighbor experiment using k-fold cross validation and output a confusion matrix. All the sklearn modules you need have been imported for you.
1. Break the data into training and testing sets (as before).
2. Create a new classifier by using neighbors.KNeighborsClassifier()
3. "Fit" the training data to the model (this basically means "train the model").
4. Use `cross_val_scores()` on the classifier, setting cv=10 and output the scores.
5. Get predictions using `predict() with the test data.
6. Print the confusion matrix using the true vs. predicted labels.


In [5]:
# YOU CAN USE THESE LIBRARIES
from sklearn import neighbors
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix

#1 Break the data into training and testing sets (as before).
data_train, data_test, labels_train, labels_test = train_test_split(all_mfccs, all_labels, test_size=.1)

#2 Create a new classifier by using neighbors.KNeighborsClassifier()
neighbors = neighbors.KNeighborsClassifier() #defaul nneighbors i 5

#3 "Fit" the training data to the model (this basically means "train the model").
thefit = neighbors.fit(data_train, labels_train.ravel())

#4 Use cross_val_scores() on the classifier, setting cv=10 and output the scores.
cross_val_score(thefit, data_train, labels_train.ravel(), cv=10)

#5 Get predictions using `predict() with the test data.
final_predictions = neighbors.predict(data_test)

#6 Print the confusion matrix using the true vs. predicted labels.
c = confusion_matrix(labels_test, final_predictions)
print("confusion matrix:\n" , c)
print("Correct matches:" , c[0,0] + c[1,1])
print("Inorrect matches:" , c[1,0] + c[0,1])


confusion matrix:
 [[192  13]
 [ 12 276]]
Correct matches: 468
Inorrect matches: 25


## Part 5: Analysis
How would you characterize your overall results? Does your classifier work better or worse than `sklearn`? Or are they similar?

There is a third class of audio included with this assignment, `vibrato_files`,  which contains singers using vibrato. Add this data to your full dataset and give it a unique label. Then run the whole test again (using `sklearn` - no need to run your code from Part 3 again). 
1. Perform 10-fold cross validation again. How so the results compare now that you have 3 classes.
2. Generate a confusion matrix again. How does that data compare?

(OPTIONAL) MFCC deltas ($\Delta$) may (or may not) improve the overall results. In addition to using the MFCCs generated by the MFCC function, append to each vector an MFCC difference such that, for each MFCC feature vector $f[m]$ at time index $m$, $\Delta f[m] = f[m] - f[m-1]$. This will double the amount of features per vector. Report on your findings.


In [6]:
# YOU CAN USE THESE LIBRARIES
from sklearn import neighbors
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix

find_nearest_neighbor(all_mfccs, all_labels, 0.1) # Running code for part 3!!! 

# Vibrato_files
vibrato_mfccs, vibrato_labels = prepare_data(vibrato_files, 2) # 2 is label for vibato
all_mfccs_w_vibrato = np.concatenate((talk_mfccs, sing_mfccs, vibrato_mfccs))
all_labels_w_vibrato = np.concatenate((talk_labels, sing_labels, vibrato_labels))

#1 Break the data into training and testing sets (as before).
data_train1, data_test1, labels_train1, labels_test1 = train_test_split(all_mfccs_w_vibrato, all_labels_w_vibrato, test_size=.1)

#2 Create a new classifier by using neighbors.KNeighborsClassifier()
neighbors_vibrato = neighbors.KNeighborsClassifier() #defaul nneighbors i 5

#3 "Fit" the training data to the model (this basically means "train the model").
thefit_vibrato = neighbors_vibrato.fit(data_train1, labels_train1.ravel())

#4 Use cross_val_scores() on the classifier, setting cv=10 and output the scores.
cross_val_score(thefit_vibrato, data_train1, labels_train1.ravel(), cv=10)

#5 Get predictions using `predict() with the test data.
final_predictions_vibrato = neighbors_vibrato.predict(data_test1)

#6 Print the confusion matrix using the true vs. predicted labels.
c = confusion_matrix(labels_test1, final_predictions_vibrato)
print("confusion matrix:\n", c)
print("Correct matches:" , c[0,0] + c[1,1]+ c[2,2])
print("Inorrect matches:" , c[1,0] + c[2,0] + c[1,2] + c[2,1] + c[0,2] + c[0,1])


[correct_matches, incorrect_matches]:  [467, 26]
confusion matrix:
 [[194   9   5]
 [ 12 231  16]
 [  6  28 257]]
Correct matches: 682
Inorrect matches: 76


`# # Notes About sklearn vs my classifier:
'## I ran the sklearn classifier three times and got resuls [456, 37], [463, 30], and [459, 34]. These results are just slightly worse than the results I got from my classifier: [464, 29], [466, 27], and [463, 30]
'## I then changed the training-testing division to 80-20 for the sklearn function. I got results [910, 75], [902, 83], and [918, 67]. These results are comparable to the results of the 90-10 split.   
'## The split of correct/incorrect matches is consistent each time I run the classifier: both for mine and the sklearn functions
'## I did notice the sklear classifier runs much faster than the one I built. 

'## Notes About 3 classes w Vibrato:
'## With three classes, the results are less strong. I got results [674, 84] which is 88%, [672, 86] which is 89%, and [660,98] which is 87% for an average of 88%. Looking back at my results for two classes using sklear, the results averaged around 92%. So, with three classes the results were around 4% less strong, because there is more room for error. In my confusion matrix, the most common error by a factor of 2 was I got a lot of predicted 'sing' classification when it was in fact true 'vibrato.'
'##
