# Speech Recognition With a Single Layer Perceptron

We will be investigating a binary classification problem on audio data with two labels. The two labels are 'cat' and 'dog'. We will input raw audio data of humans saying the words 'cat' and 'dog'. We will then process this data and apply a few techniques to illustrate that perceptrons can be used to classify audio data.

You will need the dataset of words, here are two ways to obtain it:

**Method 1**
The dataset that we will be using can be found at this link:
https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html
Downloading this dataset might take a few minutes. You will notice a folder with all kinds of folders, each for a different word. We are only interested in cat and dog. 

**Method 2**
We have provided a zip file with all the necessary datasets you need. You can simply unzip it **in the same directory**

Inside these respective cat and dog folders are raw .wav files. We will by using scipy to convert these raw wav files into discrete time signals. Let us first import all necessary packages:

In [None]:
from sklearn.linear_model import Perceptron
from sklearn.decomposition import PCA
from scipy.io import wavfile
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import os

%matplotlib notebook

Now it is very important that you provide the relative file path to the dog and cat files. We will refer to dogs as 1 and cats as 0 in our binary classifcation scheme:

In [None]:
#Path to the data set containing dog audio samples
path_1  = "Materials/Word_Dataset/dog/"
#Path to the data set containing cat audio samples
path_0  = "Materials/Word_Dataset/cat/"
files_1 = os.listdir(path_1)
files_0 = os.listdir(path_0)

Now we will define some usefull functions that will help us process the data. 

The first function is **numpy_fillna**. This function will simply take in a 2d numpy array where the row vectors are not the same length, and return a 2d numpy array where all the row vectors have the same length. To do this, we simply zero-pad the short row vectors/data vectors to match the length of the longer ones. Worry not, this does not change the relevent audio data at all!

The second function is **get_train_test**. This function performs a few important steps, let us break it down:
1. It will first use the argument N (total number of data points, 50% dog, 50% cat), and r_train (the ratio of training data to testing data) to calculate the number of training samples that we need. The big idea here is to take our data set and partition it into a training dataset and a testing dataset. The training dataset (can be something like 80% of all the data) will be used to actually train the perceptron. The testing dataset (whatever is leftover, 20%) will be used to acually test our perceptron on 'fresh' data.

2. When accumulating the data from the .wav files, this function will trim the audio files nicely. It does this because alot of the times, the audio files have a very unessesary sequence of zeros before the initial useful audio data arrives. It will first find the time point where the audio hits 50% of the maximum value. Then, we instruct the function to call the start of the audio vector 'pre' units before this 50% time point. We can tune pre to untill the audio data appears to be alligned.

3. It will also normalize all the audio vectors by removing their mean and making their variance = 1. The reason for this is to treat very loud audio the same as very soft audio. We have no preference over loud or soft audio when classifying. We subtract the mean in order to remove the offset in audio, which we also do not care about.

4. Finally, the function will put all these normalized and trimmed audio vectors into a large 2d array. Now there is no gurantee that all the row vectors (data/audio vectors) in this 2d array will have the same length. This is where we use **numpy_fillna** to assert this property. Once this is done, we shuffle up the row vectors in order to make the data randomized with respect to the labels. A minor detail: The function will also return the samplerate of the audio data.

In [None]:
def numpy_fillna(data):
    # Get lengths of each row of data
    lens = np.array([len(i) for i in data])

    # Mask of valid places in each row
    mask = np.arange(lens.max()) < lens[:,None]

    # Setup output array and put elements from data into masked positions
    out = np.zeros(mask.shape, dtype=np.float64)#data.dtype)
    out[mask] = np.concatenate(data)
    return out

def get_train_test(N, pre = 50, r_train = 0.5):
    
    N_train = int(r_train * N)
    
    X_1 = []
    X_0 = []
    
    for i in range(N//2):
        _, data_1 = wavfile.read(os.path.join(path_1, files_1[i]))
        samplerate = _
        data_1 = (data_1 - np.mean(data_1)) / np.std(data_1)
        ind = 0
        thresh = 0.5 * np.max(data_1)
        while abs(data_1[ind]) < thresh:
            ind += 1
        ind -= pre
        data_1 = data_1[ind:]
        X_1.append(data_1)
        
        _, data_0 = wavfile.read(os.path.join(path_0, files_0[i]))
        data_0 = (data_0 - np.mean(data_0)) / np.std(data_0)
        ind = 0
        thresh = 0.5 * np.max(data_0)
        while abs(data_0[ind]) < thresh:
            ind += 1
        ind -= pre
        data_0 = data_0[ind:]
        X_0.append(data_0)
        
    X = numpy_fillna(np.asarray(X_0 + X_1))
    y = np.append(np.zeros(N//2), np.ones(N//2)).reshape(-1, 1)
    
    rng_state = np.random.get_state()
    np.random.shuffle(X)
    np.random.set_state(rng_state)
    np.random.shuffle(y)
    
    y = y.flatten()
    
    X_train = X[:N_train]
    y_train = y[:N_train]
    X_test  = X[N_train:]
    y_test  = y[N_train:]
    
    return X_train, y_train, X_test, y_test, samplerate


Now that the data is nice and 'clean' let us define what we have mathematically. We will call the ith audio file from each file $\vec{d_i}, \vec{c_i} \in R^d$ for the dog and cat respectively. The dimension $d$ of these vectors is much larger than the number of data points, $n$. $X \in R^{n x d}, \vec{y} \in R^n$, the data matrix with corresponding labels will take the following form:

$$X = \begin{bmatrix}
\vec{d_{21}}^T\\
\vec{d_{309}}^T\\
\vec{c_{19}}^T\\
\vec{d_{10}}^T\\
\vec{c_{111}}^T\\
.\\
.\\
.
\end{bmatrix} \; \; \; \;\; \;\; \; \; \; \vec{y} = \begin{bmatrix}
1\\
1\\
0\\
1\\
0\\
.\\
.\\
.
\end{bmatrix}$$

If we chose an 80%, 20% split for train and test data, then $X_{train}$ will just be the first 80% rows vectors of $X$, and $X_{test}$ will be the remaining 20%. Similaraly, $\vec{y_{train}}$ will be the first 80% of the enteries of $\vec{y}$ and $\vec{y_{test}}$ will be the remaining 20%. Or mathematically:


## $$N_{train} = N * (0.8), N_{test} = N - N_{train}$$


## $$X_{train} \in R^{N_{train} x d}, X_{test} \in R^{N_{test} x d}, y_{train} \in R^{N_{train}}, y_{test} \in R^{N_{test}}$$


Now let us fetch the X and y training and testing pairs:


In [None]:
X_train, y_train, X_test, y_test, samplerate = get_train_test(N = 1500, pre = 500, r_train = 0.6)

Great now we have our testing and training data, we are ready to use the perceptron! To do this, we can use the Sklearn's perceptron package. In the previous part of this jupyter notebook, we learned how a perceptron works and how to build one. For this part, we will use the imported perceptron module for simplicity. 

First thing we do is create our perceptron. There are many parameters we can pass into the Perceptron constructor, however the default parameters will be fine for this project. Here is a link if you are curious about Sklearn's Perceptron under the hood:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html


## **(TODO)Using Sklearn's perceptron class, find the weights to fit X_train and y_train. Report the weights as a column vector. Make sure to name the column vector of weights $w$.**

In [None]:
#Start Sklearn Perceptron code
#TODO
w = None

At this point we need a way to test our classifier. Find a value for training_accuracy below which we define as the percentage of correctly classified training points. Recall that given some sample data, we can classify it by taking the inner product (dot product) of our weights with the input data. That quantity should come out to a scalar. We determine that if the scalar is positive, we assign a classification value of 1 and if it is negative we assign a classification value of 0. Or mathematically:
$$y_{predicted} = S(\vec{x_{sample}}^T \vec{weights})$$

If we have multiple input sample points in a matrix:
$$X = \begin{bmatrix}
\vec{x_{1}}^T\\
\vec{x_{2}}^T\\
\vec{x_{3}}^T\\
\vec{x_{4}}^T\\
\vec{x_{5}}^T\\
.\\
.\\
.
\end{bmatrix}, \; \; \; \; \vec{y_{predicted}} = S(X \vec{weights})$$

Where $S(x)$ is $1$ if $x$ is positive and $0$ if $x$ is negative. If we have $S(\vec{x})$ then we apply S to each element in the vector $\vec{x}$ independently. 

## **(TODO)Find the percentage of correctly classified data points that the weights calculated earlier will predict on X_train in comparison to y_train. Store the accuracy into a variable named $training\_accuracy$ (HINT: look into the Perceptron's .score function).**

In [None]:
#TODO
#Start
training_accuracy = None
#End
print('Training Accuracy: {0:.2f}%'.format(training_accuracy * 100))

Looks good! It seems that our classifier has fit the training data very well. We will now do the same thing to the testing data:

## **(TODO)Find the percentage of correctly classified data points that the weights calculated earlier will predict on X_test in comparison to y_test. Store the accuracy into a variable named $testing\_accuracy$ (HINT: look into the Perceptron's .score function).**

In [None]:
#TODO
#Start
testing_accuracy = None
#End
print('Testing Accuracy: {0:.2f}%'.format(testing_accuracy * 100))

Not very good. What happened here? Well it seems that we have over-fit our data. We know this because we fit the training data extremely well, however the testing data gets missclassified almost all the time (recall that the worst binary classifer will have an accuracy of 50%). That is, our perceptron only knows how to fit what it has seen before. This is often the product of having too many parameters. Let us view just how many paramters we have in relation to the amount of training data points:

In [None]:
num_params = len(weights)
num_data_points = len(X_train)
print('Number of Parameters:', num_params)
print('Number of Data Points:', num_data_points)

That is a ton of parameters. Perhaps it would be smart to perform PCA on this data in order to reduce its dimensionality (number of paramters) down to 2, just to see how distinguishable our test data is. **You do not need to understand PCA in order to follow the notebook. Just know that PCA converts our data matrix from $\mathbb{R}^{n \times d}$ to $\mathbb{R}^{n \times 2}$. That is, each data point will only have dimension 2. This will only be used to visualize our high dimension data set.**

In [None]:
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
num_data, dim = X_train_pca.shape
print("Number of data points:", num_data)
print("Number of parameters/dimension of data:", dim)

As we can see, we have reduced our data matrix to having the same number of data points, however now each data point only has dimension 2. Recall that X_train_pca still has our same number of data points, but each data point effectively only has an 'x' and a 'y' component. What we now want is the ability to plot each of these training data points.

Let us build a function that takes in an input matrix and label array X and y respectively. The dimension of X will be N x 2 and y should be a 1 dimensional array of length N. The function will look one row at a time and plot the row of X as an ordered pair on a 2d plot. The color assigned to each label should depend on the value of y for that row. For example, let us say that we wish to plot the 5th row in our data matrix X. We can find the x and y component by looking at the first and second entry of the 5th row of X. Then, we can view the 5th entry in y. If the value is a 1, we can assign the color blue, and 0 gets the color red. Use the plt.scatter function to perform this task:

## **(TODO) Impliment the $plot\_data\_2d$ function. This should plot label '1' data points as blue and label '0' data points as red** 

In [None]:
def plot_data_2d(X_2d, y_labels):
    assert len(X_2d) == len(y_labels)
    assert X_2d.shape[1] == 2
    N = len(X_2d)
    #TODO
    #Start
    
    #End

Now that we have implemented the function, let us plot the 2d projection of our data points:

In [None]:
plot_data_2d(X_train_pca, y_train)

This does not look very good. Recall that a perceptron is a linear classifer. However, in 2 dimensions, our data is not linearly seperable. 

**THE DFT PORTION IS COMPLETELY OPTIONAL TO UNDERSTAND. READ IF YOU ARE CURIOUS**

At this point, we may feel like it is time to use other techniques to classify this data. However, 16b to the rescue, we have a very handy tool at our disposal. The discrete fourier transform, or DFT! Why would we even think of using this? Well, first let us ask, what is the best machine for classifying audio? It is us, humans that are the best at this task. This is partially due to the fact that the human ear canal performs a pseudo fourier transform.

Taking the DFT of the audio data will provide us with information on how dominant certain frequencies are. This is important for speech recognition because that human audio only exists in a narrow frequency band (< 4kHz). So, if we take the DFT of our data and omit all frequencies above 6000 Hz (to be safe), we will essentially only be capturing the relevant informatin from the audio sample. This is a method of taking an extremeley long audio sample and reducing it down to its more fundemental components, which is what we wanted to do! Recall that our problem was that the vectors were too long/had too many parameters. By taking the DFT and omitting all frequencies above 6000Hz, we will be reducing the number of parameters significantly.

In 16B we learned how to compute DFTs using the DFT matrix. In practice, we use the FFT or the Fast Fourier Transform. This will result in the same thing, however the FFT computes the DFT in $O(nlogn)$ time rather than $O(n^2)$ time with the 16B method. For the purposes of this notebook, don't worry about how the FFT works, **just know that the output is the same as if you computed the DFT via the DFT matrix**.

So, let us go ahead and compute the 4096 point FFT (remember, the same input/output behavior of the DFT) of each raw audio data vector and construct a new matrix $X_w$:

In [None]:
#think of this as the resolution of our FFT
NFFT=4096 

#Compute fft for all data points
Xw_train = None

for arr in X_train:
    if Xw_train is None:
        Xw_train = np.fft.fft(arr, n = NFFT)
    else:
        Xw_train = np.vstack((Xw_train, np.fft.fft(arr, n = NFFT)))

Xw_test = None

for arr in X_test:
    if Xw_test is None:
        Xw_test = np.fft.fft(arr, n = NFFT)
    else:
        Xw_test = np.vstack((Xw_test, np.fft.fft(arr, n = NFFT)))

#compute the 6khz cutoff index
six_cutoff = int(NFFT * 6000 / samplerate)
        
#We only care about the magnitude of the complex numbers, hence np.abs
Xw_train = np.abs(Xw_train[:, :six_cutoff])
Xw_test = np.abs(Xw_test[:, :six_cutoff])

To get a better idea of what the data vectors now look like, let us plot the first data vector (now in frequency domain).

## **(OPTIONAL) Choose any data point from X_train. Plot it as a time domain signal**

In [None]:
#TODO
#Start
plt.plot(X_train[4])
plt.title('Time Signal Label 0')
plt.xlabel('Sample')
plt.ylabel('Amplitude')
#end

## **(OPTIONAL) Choose the same data point from before, but now grab the corresponding row of Xw_train. Plot it as a frequency domain signal**

In [None]:
#TODO
#Start
fVals=np.arange(start = 0,stop = six_cutoff)*samplerate/NFFT
plt.plot(fVals,Xw_train[4])
plt.title('One Sided FFT')
plt.xlabel('Frequency (Hz)')
plt.ylabel('|DFT Values|')
#End

## **(Optional) Comment on the differences between the frequency domain signal and the time domain signals:**

The frequency domain signal is able to capture seemingly relevant data with only 4000 samples while the time domain signal seems to have important values throughout 8000 samples.

Great, now we have successfully converted all of our data into frequency domain! Let us do the same thing that we did before: Project the data back to 2 dimensions using PCA in order to plot the 'dog's in red and the 'cat's in blue. We will use our previously implimented funtion to plot the data:

In [None]:
pca = PCA(n_components=2)
Xw_train_pca = pca.fit_transform(Xw_train)
plot_data_2d(Xw_train_pca, y_train)

At last, we have linearly separable data! Now, we will train a perceptron on the FFT data rather than the raw data.

## **(TODO) Use the same process from the previous parts to train a perceptron classifier but this time use the Xw_train data for training. Make sure to store the final weights into a variable called $w\_fft$**

In [None]:
#TODO
#Start
w_fft = None
#End
print('Number of parameters/dimension for FFT:', len(w_fft))
print('Number of parameters/dimension for Raw:', len(w))

We were able to reduce the dimensionality significantly and achieve linear separable data! Let us see how it performs:

## **(TODO) Use the same process from the previous parts to report the testing accuracy and training accuracy for the FFT method. Remember to use Xw_train and Xw_test. Store the training and testing accuracies into variables named $training\_accuracy$ and $testing\_accuracy$ respectively.** 

In [None]:
#TODO
#Start
training_accuracy = None
testing_accuracy = None
#End
print('Training Accuracy FFT Method: {0:.2f}%'.format(training_accuracy * 100))
print('Testing Accuracy FFT Method: {0:.2f}%'.format(testing_accuracy * 100))

## **(TODO) Comment below what you learned about perceptrons, audio classification, dimensionality reduction, and linearly separable data.**

COMMENT HERE