# SIXT33N Project Phase 4: SVD/ PCA Classification for Voice Commands

### EECS 16B: Designing Information Devices and Systems II, Fall 2021

Written by Nathaniel Mailoa and Emily Naviasky (2016). Updated by Julian Chan (2018), Peter Schafhalter (2019). Vin Ramamurti and Zain Zaidi (Fall 2019)

nmailoa@berkeley.edu &emsp; enaviasky@berkeley.edu &emsp; julianchan0928@berkeley.edu &emsp; pschafhalter@berkeley.edu

## Table of Contents

* [Introduction / Lab Note](#intro)
* [Part 0: Preparing your Launchpad](#part0)
* [Part 1: Setting up your Circuit](#part1)
* [Part 2: Data Collection](#part2)
* [Part 3: Data Preprocessing](#part3)
* [Part 4: PCA via SVD](#part4)
* [Part 5: Clustering Data Points](#part5)
* [Part 6: Testing your Classifier](#part6)
* [Appendix: Formatting Vectors for Energia](#appendix)

<a id='intro'></a>
## <span style="color:navy">Introduction</span>


SIXT33N is an obedient little robot that will follow the directions that you tell it. There are four moves that SIXT33N can make: move straight far, move straight close, turn left, and turn right. However, SIXT33N does not speak human languages, and some words, like "left" and "right", sound very similar (a strong single syllable), while other words are easier to distinguish. Your job in this phase is to find four command words that are easy for SIXT33N to tell apart (consider syllables and intonation).

In this lab, you will develop the PCA classifier that allows SIXT33N to tell the difference between the four commands. You will examine several different words, and determine which ones will be easiest to classify by PCA.

**Please read the [lab note](https://drive.google.com/file/d/1vKmclZFsLs6KPv1bojH4VGsEs0vy2_5B/view?usp=sharing). It explains in detail what you will be doing in each part of the lab.**

**Remember to document all design choices you made and explain them in the lab report.**

### Side Note: Datasets in Machine Learning Applications
It is common practice, especially in machine learning applications, to split a dataset into a training set and a smaller test set (some common ratios for train:test are 80:20 or 70:30) when trying to make data-driven predictions/decisions. In this lab, we will collect data and split our dataset into 70% training data and 30% test data. 

### Overview of Classification Procedure
Once you have some sample data collected, we will:
1. Split our data into 2 sets: train_data and test_data
2. Perform PCA and look at how well it separates the train_data 
3. Once you have a set of four words that you like, you will compute the means for each of those four words in the PCA basis. We will classify each word according to which mean it is closest in Euclidean distance to. 
4. To see how well our classifier does on data it has never seen before (this is called generalization in machine learning), we will project test_data onto the same PCA basis as train_data, and find the mean that is closest in Euclidean distance to each data point. 
5. Make sure you (and your GSI) are satisfied with the classifier's accuracy.


The goals of this phase are as follows:
- Generate envelope and utilize threshold to get snippets
- PCA + Classifier (4 commands)
- Check accuracy

When humans distinguish words, they listen for temporal and frequency differences to determine what is being said. However, SIXT33N does not have the memory or the processing power to distinguish words nearly as well as our human brains, so we will have to choose much simpler features for SIXT33N to look at (syllables, intonation, magnitude).

When you think of speech signals, you might notice that the shape of the speech wave is a very distinctive part of each word. Taking just the shape of the magnitude of a signal is called enveloping, exemplified in the image below. So, we want to do some filtering to retrieve the envelope of the audio signal. We train the PCA off of just this envelope and build a classifier to classify new data points.

<center>
<img width="400px" src="images/proj-envelope.png">
</center>

<b>Keeping in mind that the words that look most different have different shapes (or different amplitudes varied over time), brainstorm at least six words that you think will sort well. Consider syllables, intonation, and length of the word.</b>

**<span style="color:red">What words are you going to try? Why?</span>** 

"Good" audio data has a high signal to noise ratio. Recording words while far away from the microphone may cause your intended word to blend in with background noise. However "oversaturation" of the audio signal (speaking too loudly and/or too closely into the mic) will also distort the signal. You can probe the microphone output using the oscilloscope to test for over/under saturation.

### Submitting your datasets for future lab development

- At the end of the lab, please submit your collected `.csv` files for the 4 words you ended up using for your classifier to Gradescope. We will be using student data to help further develop this lab for future semesters. **The Gradescope assignment is "[Hands-on Lab] SVD/PCA Files."**
- If you don't want to submit data collected from your voice, feel free to use [Google Translate](https://translate.google.com/) or any other text-to-speech website from your phone as voice alternative.

<a id='part0'></a>
## <span style="color:navy">Part 0: Preparing your Launchpad</span>

**Disconnect the 5V jumper wire that's powering the MSP through the 9V Battery and 9V -> 5V regulator**. As before, make sure that the MSP is not simultaneously being powered by both the computer (via the USB) and the 5V pin. Otherwise, you risk frying your MSP.

For the remainder of this lab, the MSP will be powered by only the computer, via the USB.


<a id='part1'></a>
## <span style="color:navy">Part 1: Setting up your Circuit</span>

### Materials
- Microphone front-end circuit
- Launchpad + USB

### Front End Verification

1. Hook up your front end circuit. **Make sure to disconnect the 5V jumper from the Launchpad**.
2. Without powering your circuit from the power supply yet, connect your circuit to the Launchpad:
    - **P6.0 to the microphone front end circuit output (output of non-inverting amplifier for low-pass filter).**
    - GND pin to the ground rail of the breadboard.
    - You can keep your Launchpad plugged via USB as long as **YOUR 5V JUMPER IS DISCONNECTED**.
3. Use the bench power supply to provide 9V to the 9V->5V and 9V->3.3V voltage regulators. You won't be using the motor circuits for this lab. You can leave that part of the circuit unpowered. 
4. Set the current limit to 0.1A.
5.  **Use the oscilloscope to probe the output of the microphone circuit.** Make sure the waveform averages to 1.65V (halfway between 0V and 3.3V) and the peak-to-peak is large enough.
    - Make a noise at the microphone; you should see the signal change to reflect the sound you just made. If you are close enough or loud enough, you should be able to get the peak-to-peak of your signal all the way up to 3.3V.
    - Hint: set the oscilloscope's x-axis to 10ms per division and y-axis to 1 volt per division.

<a id='part2'></a>
## <span style="color:navy">Part 2: Data Collection</span>

### Materials
- Microphone front-end circuit
- Launchpad + USB

### Tasks

#### Data Collection 

Now we will record 40 audio samples for each of your 6 chosen words. We recommend each partner records 3 words each, to maximize variance between our recorded signals. Again, think about how to make the words distinct from others (syllable count, intontation).

Make sure your breadboard is powered and you see an audio signal at the microphone front end circuit's output! **Also make sure P6.0 on Launchpad is connected to the microphone front end circuit output (output of non-inverting amplifier for low-pass filter)!**

**For each chosen word, do the following:**
1. Upload the sketch **`collect-data-envelope.ino`** to your Launchpad.
    - This sketch records 2 seconds of audio sampled every 0.35ms at a time and sends it to your computer.
    - When the red LED is on, the Launchpad is recording audio. Say the word to the micboard right after the red LED blinks on. 
2. Run **`python collect-data-envelope.py YOUR_WORD.csv`**.
    - Download the collect-data-envelope folder to your local computer. Press Shift + right click within the file explorer to view the option to "Open PowerShell window here".
    - Open the PowerShell terminal and navigate to the directory with the `collect-data-envelope.py` script and then run the above command.
    - Make sure the serial monitor/plotter in Energia is closed before running the script!
    - This program will capture audio data collected by the Launchpad and write it to `YOUR_WORD.csv`. You should see a console output in your PowerShell terminal after each sample is recorded. 
    - Choose your words carefully! Think about the PCA algorithm and what characterstics of your word might affect its output.
    - After you collect a few test words, check `YOUR_WORD.csv` and make sure that it looks like a sound wave as opposed to being full of 0s. *It might help to plot the data in Excel.*
3. **When the red light goes on, say the word you want to record.**
    - **Pronounce the word consistently and always speak around the same time relative to when the red light turns on.** This will help you collect data that is less "noisy" which will result in better classification.
    - The red LED on the launch pad is like a recording room. When the red light goes on, the Launchpad is recording. Say the word you want to record before the red LED turns off.
4. Once you've recorded ~40 audio samples of the word, stop the python program (e.g. by pressing Ctrl + C in the command prompt).
5. Go into the .csv file and delete outlier samples such that you are left with **exactly 30 audio recordings of the word**. We recommend deleting the first and last couple samples, as well as samples whose values look very different compared to other samples at the same timesteps. The best way to help you identity outlier samples is to plot the data in Excel. Plot all your samples using a line plot. **Don't spend too much time with this,** our enveloping function in Part 3 also helps with some of the outliers!
6. If you are working on the lab on DataHub, you will need to upload your `.csv` files to DataHub into the `PCA_data` folder so the jupyter notebook can access them.


### Before moving on, please note that:

You may realize in the next section that one or two of your words are not sorting quite as well as you would like. Don't be afraid to come back to this section and try collecting different words based on what you have learned makes a word sortable. 

<a id='part3'></a>
## <span style="color:navy">Part 3: Data Preprocessing</span>

Before we can use the recorded data for PCA, we must first process the data. It is not necessary for you to understand the enveloping function well enough to implement it (since we have already done it for you), but just in case you are curious the enveloping function is described in the following pseudocode:

<code><b>Enveloping function</b>
Divide the whole signal to a block of 16 samples
For each chunk:
    Find the mean of the chunk
    Subtract each sample by the mean
    Find the sum of the absolute value of each sample
</code>

What you really need to know, however, is what the enveloped signal looks like for each word. Spend a little time looking at the data you just collected in the python plots below.


### 3.1 Load Data from CSV

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.io
import csv
import utils
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline

cm = ['blue', 'red', 'green', 'orange', 'black', 'purple']

In [None]:
# YOUR CODE HERE: Fill in the six words you recorded

all_words_arr = ['', '', '', '', '', '']

In [None]:
# Load data from csv
train_test_split_ratio = 0.7
train_dict = {}
test_dict = {}

# Build the dictionary of train and test samples.
for i in range(len(all_words_arr)):
    word_raw = utils.read_csv("PCA_data/{}.csv".format(all_words_arr[i]))
    word_raw_train, word_raw_test = utils.train_test_split(word_raw, train_test_split_ratio)
    train_dict[all_words_arr[i]] = word_raw_train
    test_dict[all_words_arr[i]] = word_raw_test

# Count the minimum number of samples you took across the six recorded words. These variables might be useful for you!
num_samples_train = min(list(map(lambda x : np.shape(x)[0], train_dict.values())))
num_samples_test = min(list(map(lambda x : np.shape(x)[0], test_dict.values())))

# Crop the number of samples for each word to the minimum number so all words have the same number of samples.
for key, raw_word in train_dict.items():
    train_dict[key] = raw_word[:num_samples_train,:]

for key, raw_word in test_dict.items():
    test_dict[key] = raw_word[:num_samples_test,:]


Plot your data and get a feel for how it looks enveloped.

**<span style="color:red">Important: It's okay if the data isn't aligned. The code in the next part will automatically align the data.</span>** 

In [None]:
# Plot all training samples
for word_raw_train in train_dict.values():
    plt.plot(word_raw_train.T)
    plt.show()

### 3.2 Align Audio Recordings

As you can see above, the speech is only a small part of the 2 second window, and each sample starts at different times. PCA is not good at interpreting delay, so we need to somehow start in the same place each time and capture a smaller segment of the 2 second sample where the speech is present. To do this, we will use a thresholding algorithm.

First, we define a **`threshold`** relative to the maximum value of the data. We say that any signal that crosses the threshold is the start of a speech command. In order to not lose the first couple samples of the speech command, we say that the command starts **`pre_length`** samples before the threshold is crossed. We then take a window of the data that is **`length`** long, and try to capture the entire sound of the command in that window.

<b>Play around with the parameters `length`, `pre_length` and `threshold`</b> in the cells below to find appropriate values corresponding to your voice and chosen commands. You should see the results and how much of your command you captured in the plots generated below. When you are satisfied, note down the values of `length`, `pre_length` and `threshold` - <b>you will need to add them to the Launchpad sketch later.</b>

In [None]:
def get_snippets(data, length, pre_length, threshold):
    """Attempts to align audio samples in data.
    
    Args:
        data (np.ndarray): Matrix where each row corresponds to a recording's audio samples.
        length (int): The length of each aligned audio snippet.
        pre_length (int): The number of samples to include before the threshold is first crossed.
        threshold (float): Used to find the start of the speech command. The speech command begins where the
            magnitude of the audio sample is greater than (threshold * max(samples)).
    
    Returns:
        Matrix of aligned recordings.
    """
    assert isinstance(data, np.ndarray) and len(data.shape) == 2, "'data' must be a 2D matrix"
    assert isinstance(length, int) and length > 0, "'length' of snippet must be an integer greater than 0"
    assert 0 <= threshold <= 1, "'threshold' must be between 0 and 1"
    snippets = []

    # Iterate over the rows in data
    for recording in data:
        # Find the threshold
        recording_threshold = threshold * np.max(recording)

        # Figure out when interesting snippet starts
        i = pre_length
        while recording[i] < recording_threshold:
            i += 1
            
        snippet_start = min(i - pre_length, len(recording) - length)
        snippet = recording[snippet_start:snippet_start + length]

        # Normalization
        snippet = snippet / np.sum(snippet)
        
        snippets.append(snippet)

    return np.vstack(snippets)

In [None]:
length = 80 # Default: 80         # Adjust this
pre_length = 5 # Default: 5       # Adjust this
threshold = 0.5 # Default: 0.5    # Adjust this

processed_train_dict = {}

for key, word_raw_train in train_dict.items():
    word_processed_train = get_snippets(word_raw_train, length, pre_length, threshold)
    processed_train_dict[key] = word_processed_train
    plt.plot(word_processed_train.T)
    plt.show()

You should now see a mostly organized set of samples for each word. Can you tell which word is which just by the envelope? Can you tell them apart? If you can't tell the words apart, then PCA will have a difficult time as well.

<a id='part4'></a>
## <span style="color:navy">Part 4: PCA via SVD</span>

### 4.0 SVD/PCA Resources
- http://www.ams.org/publicoutreach/feature-column/fcarc-svd
- https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues
- https://towardsdatascience.com/pca-and-svd-explained-with-numpy-5d13b0d2a4d8

### 4.1 Generate and Preprocess PCA Matrix

Now that we have our data in a nice format, we can build the PCA input matrix from that data by **stacking all the data vertically**.

**Sanity check:** What should be the dimensions of processed_A? Feel free to use np.shape() if you aren't sure.

In [None]:
processed_A = np.vstack(list(processed_train_dict.values()))

The first step of PCA is to zero-mean your data as `demeaned_A`. Centering the data can be helpful to obtain principal components that are representative of the shape of the variations in the data. Please note that you want to **get the mean of each feature** (***what are the features?***). The function [`np.mean`](https://numpy.org/doc/stable/reference/generated/numpy.mean.html) might be helpful here, along with specifying the axis parameter.

**Sanity check:** Does the dimension of `mec_vec` make sense given what we averaged across?

In [None]:
# Zero-mean the matrix A
# YOUR CODE HERE
mean_vec = np.mean(processed_A, axis=0)
demeaned_A = processed_A - mean_vec
print(processed_A.shape)
print(mean_vec.shape)

### 4.2 Principal Component Analysis

In [None]:
# Take the SVD of matrix demeaned_A (np.linalg.svd)
# YOUR CODE HERE #

U, S, Vt = ...

Take a look at your sigma values. They should show you very clearly how many principal components you need.

In [None]:
# Plot out the sigma values (Hint: Use plt.stem for a stem plot)
# YOUR CODE HERE #


**<span style="color:red">How many principal components do you need? Given that you are sorting 6 words, what is the number you expect to need?</span>** 

There is no correct answer here. We can pick as many principal components onto which we project our data to get the "best" separation (most variance), but at some point, the cost-benefit isn't worth selecting an extra basis vector. For example, in our project, we are loading these basis vectors onto the [MSP430 Launchpad](http://www.ti.com/tool/MSP-EXP430F5529LP), and we can only store 2-3 principal components before we run into memory issues.

### 4.3 Choosing a Basis using Principal Components

Set the `new_basis` argument to be a basis of these principal components. (Hint: Of the three outputs from the SVD function call, which one will contain the principal components onto which we want to project our data points? Do we need to transpose it? **The lab note will help!**)

When you plot `new_basis` you should see a number of line plots equal to the number of principal components you've chosen.

In [None]:
# Plot the principal component(s)
# YOUR CODE HERE
new_basis = ...        # This should be the basis containing your principal components
plt.plot(new_basis)

Now project the data in the matrix A onto the new basis and plot it. Do you see clustering? Do you think you can separate the data easily? If not, you might need to try new words.

In [None]:
# Project the data onto the new basis
# YOUR CODE HERE. Hint: np.dot() may help, as well as printing the dimensions. 
proj = ...

'''Uncomment this block for 3 basis vectors
fig=plt.figure(figsize=(10,7))
ax = fig.add_subplot(111, projection='3d')
for i in range(len(all_words_arr)):
    Axes3D.scatter(ax, *proj[i*num_samples_train:num_samples_train*(i+1)].T, c=cm[i], marker = 'o', s=20)
plt.legend(all_words_arr,loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()'''


fig=plt.figure(figsize=(10,7))
for i in range(len(all_words_arr)):
    plt.scatter(proj[i*num_samples_train:num_samples_train*(i+1),0], proj[i*num_samples_train:num_samples_train*(i+1),1], edgecolor='none')

plt.legend(all_words_arr,loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()


Your data might look noisy, and might not classify perfectly. That is completely okay, we are just looking for good enough. Like many AI applications, this is noisy data that we are classifying so some error in classification is okay. The important part is that you see strong clustering of your words. 

If you don't see clustering, try to think about why this might be the case. Things you might want to ask yourself:
- How does PCA create the clusters? 
- What characteristics of your waveform will PCA favor when clustering? 
- How can you choose your words such that it maximizes the distinction between your different classes?

Once you think you have decent clustering, you can move on to getting your code to automate classification and you will make up for some of the error there, too. **Choose 4 out of the 6 words that form the most distinct clusters. You will be using these four words for the rest of this lab.**

In [None]:
# YOUR CODE HERE
selected_words_arr = ['', '', '', '']

selected_train_dict = {k: train_dict[k] for k in selected_words_arr}
selected_processed_train_dict = {k: processed_train_dict[k] for k in selected_words_arr}
selected_test_dict = {k: test_dict[k] for k in selected_words_arr}

num_samples_train = min(list(map(lambda x : np.shape(x)[0], selected_train_dict.values())))
num_samples_test = min(list(map(lambda x : np.shape(x)[0], selected_test_dict.values())))

# Reconstruct data based on 4 chosen words.
processed_A = np.vstack(list(selected_processed_train_dict.values()))
mean_vec = np.mean(processed_A, axis=0)
demeaned_A = processed_A - mean_vec
proj = demeaned_A.dot(new_basis)

<a id='part5'></a>
## <span style="color:navy">Part 5: Clustering Data Points</span>

#### Implement `find_centroids` which finds the center of each cluster.

In [None]:
def find_centroids(clustered_data):
    """Find the center of each cluster by taking the mean of all points in a cluster.
    It may be helpful to recall how you constructed the data matrix (e.g. which rows correspond to which word)
    
    Parameters:
        clustered_data: the data already projected onto the new basis
        
    Returns: 
        The centroids of the clusters
    """
    centroids = []
    # YOUR CODE HERE
    # Hint: the variable num_samples_train may help you splice into your clustered_data, as well as np.mean()
    # Feel free to ignore the skeleton code if you wish to write it your way.
    for word_num in range(len(clustered_data)//num_samples_train):
        centroid = #FILL IN HERE
        centroids.append(centroid)
    
    return centroids

In [None]:
# Determine the centroids of each cluster
# YOUR CODE HERE: hint: call find_centroids()
centroids = ...
print(centroids)

In [None]:
centroid_list = np.vstack(centroids)
colors = cm[:(len(centroids))]

for i, centroid in enumerate(centroid_list):
    print('Centroid {} is at: {}'.format(i, str(centroid)))

    
''' Uncomment this for 3 basis vectors:
fig=plt.figure(figsize=(10,7))

ax = fig.add_subplot(111, projection='3d')
for i in range(len(selected_words_arr)):
    Axes3D.scatter(ax, *proj[i*num_samples_train:num_samples_train*(i+1)].T, c=cm[i], marker = 'o', s=20)
    Axes3D.scatter(ax, *np.array([centroids[i]]).T, c=cm[i], marker = '*', s=300)

plt.show()'''


fig=plt.figure(figsize=(10,7))
for i in range(len(selected_words_arr)):
    plt.scatter(proj[i*num_samples_train:num_samples_train*(i+1),0], proj[i*num_samples_train:num_samples_train*(i+1),1], c=colors[i], edgecolor='none')

plt.scatter(centroid_list[:,0], centroid_list[:,1], c=colors, marker='*', s=300)
plt.legend(selected_words_arr,loc='center left', bbox_to_anchor=(1, 0.5))
plt.title("Training Data")
plt.show()


<a id='part6'></a>
## <span style="color:navy">Part 6: Testing your Classifier</span>

Great! We now have the means (centroid) for each word. Now let's see how well our test data performs. Recall that we will classify each data point according to the centroid that it is closest in Euclidean distance to. 

Before we perform classification, we need to do the same preprocessing to the test data that we did to the training data (enveloping, demeaning, projecting onto the PCA basis). You have already written most of the code for this part. However, note the difference in variable names as we are now working with test data.

First let's look at what our raw test data looks like.

In [None]:
# Plot all test samples
for word_raw_test in selected_test_dict.values():
    plt.plot(word_raw_test.T)
    plt.show()

Perform enveloping and trimming of our test data.

In [None]:
processed_test_dict = {}

for key, word_raw_test in selected_test_dict.items():
    word_processed_test = get_snippets(word_raw_test, length, pre_length, threshold)
    processed_test_dict[key] = word_processed_test
    plt.plot(word_processed_test.T)
    plt.show()

Construct the PCA matrix by stacking all the test data.

In [None]:
selected_processed_test_dict = {k: processed_test_dict[k] for k in selected_words_arr}

processed_A_test = np.vstack(list(selected_processed_test_dict.values()))

**Now we will do something slightly different.**

Previously, you projected data onto your PCA basis with $ (x - \bar{x})P $, where $\bar{x}$ is the mean vector, x is a single row of `processed_A`, and P is `new_basis`. 

We can rewrite this operation as 

$$ (x - \bar{x})P = xP - \bar{x}P = xP - \bar{x}_{\text{proj}} $$ 
$$ \bar{x}_{\text{proj}} = \bar{x}P $$

Why might we want to do this? We'll later perform these operations on our car. Our launchpads have limited memory, so we want to store as little as possible. Instead of storing a length $n$ vector $\bar{x}$, we can precompute $ \bar{x}_{\text{proj}} $ (length 2 or 3) and store that instead!

Compute $ \bar{x}_{\text{proj}} $ using the **same mean vector** as the one computed with the training data.

In [None]:
# YOUR CODE HERE
projected_mean_vec = ...

Project the test data onto the **same PCA basis** as the one computed with the training data.

In [None]:
# YOUR CODE HERE
projected_A_test = ...

Zero-mean the projected test data using the **`projected_mean_vec`**.

In [None]:
# YOUR CODE HERE
proj = ...

Plot the projections to see how well your test data clusters in this new basis. This will give you an idea of how well your test data will classify.

In [None]:

''' Uncomment this for 3 basis vectors:
fig=plt.figure(figsize=(10,7))
ax = fig.add_subplot(111, projection='3d')
for i in range(len(selected_words_arr)):
    Axes3D.scatter(ax, *proj[i*num_samples_test:num_samples_test*(i+1)].T, c=cm[i], marker = 'o', s=20)
    Axes3D.scatter(ax, *np.array([centroids[i]]).T, c=cm[i], marker = '*', s=300)

fig.show'''


fig=plt.figure(figsize=(10,7))
for i in range(len(selected_words_arr)):
    plt.scatter(proj[i*num_samples_test:num_samples_test*(i+1),0], proj[i*num_samples_test:num_samples_test*(i+1),1], c=colors[i], edgecolor='none')

plt.scatter(centroid_list[:,0], centroid_list[:,1], c=colors, marker='*', s=300)
plt.legend(selected_words_arr,loc='center left', bbox_to_anchor=(1, 0.5))
plt.title("Test Data")
plt.show()

Now that we have some idea of how our test data looks in this new basis, let's see how our data actually performs. Implement the classify function that takes in a data point (AFTER enveloping is applied) and returns which word number it belongs to depending on which centroid the data point is closest in Euclidean distance to.

In [None]:
def classify(data_point):
    """Classifies a new voice recording into a word.
    
    Args:
        data_point: new data point vector before demeaning and projection
    Returns:
        Word number (should be in {1, 2, 3, 4} -> you might need to offset your indexing!)
    Hint:
        Remember to use 'projected_mean_vec'!
        Np.argmin(), and np.linalg.norm() may also help!
    """
    # YOUR CODE HERE
    projected_data_point = ...
    demeaned = ...
    # TODO: classify the demeaned data point by comparing its distance to the centroids

In [None]:
# Try out the classification function
print(classify(processed_A_test[0,:])) # Modify the row index of processed_A_test to use other vectors

**Our goal is 80% accuracy for each word.** Apply the `classify` function to each sample and compute the accuracy for each word.

In [None]:
# Try to classify the whole A matrix
correct_counts = np.zeros(4)

for (row_num, data) in enumerate(processed_A_test):
    word_num = row_num // num_samples_test + 1
    if classify(data) == word_num:
        correct_counts[word_num - 1] += 1
        
for i in range(len(correct_counts)):
    print("Percent correct of word {} = {}%".format(i + 1, 100 * correct_counts[i] / num_samples_test))

<img width='30px' align='left' src="http://inst.eecs.berkeley.edu/~ee16b/sp16/lab_pics/check.png">

## <span style="color:green">CHECKOFF</span>

- **Show your GSI that you've achieved 80% accuracy on your test data for all 4 words.** 
- Your GSI will check **all your PCA code and plots.**
- Your GSI will also check that you have **submitted your `.csv` files for the 4 words to the Gradescope assignment: "[Hands-on Lab] SVD/PCA Files."** Only one person in lab group needs to submit, and can add partners to same submission on Gradescope!
- **Make sure you have read the lab note before requesting a checkoff! Many checkoff questions are pulled straight from the lab note.**
- **Make sure to save the formatted vectors below for next week's lab: Advanced Controls!**

## SAVE ALL YOUR DATA!!

- **Data stored on the lab computers often gets deleted automatically.** Please store it on your personal flash drive or cloud storage like Google Drive, and not on the lab computers! If you used DataHub, it should save through your CalNet ID.
- **You will need everything for the final lab report. Lab report logistics have been posted on Piazza.**

<a id='appendix'></a>
## <span style="color:navy">Appendix: Formatting Vectors for Energia</span>

In next week's lab, copy/paste the following printed code into **`classify.ino`**

In [None]:
print("Paste the code below into 'CODE BLOCK PCA1':")
print("")
print(utils.format_constant_energia("SNIPPET_SIZE", length))
print(utils.format_constant_energia("PRELENGTH", pre_length))
print(utils.format_constant_energia("THRESHOLD", threshold))

In [None]:
print("Paste the code below into 'CODE BLOCK PCA2':")
print("")
print(utils.format_array_energia("pca_vec1", new_basis[:, 0]))
print(utils.format_array_energia("pca_vec2", new_basis[:, 1]))
# print(utils.format_array_energia("pca_vec3", new_basis[:, 2]))   # Uncomment this line if you have 3 PCA vectors
print(utils.format_array_energia("projected_mean_vec", projected_mean_vec))
print(utils.format_array_energia("centroid1", centroids[0]))
print(utils.format_array_energia("centroid2", centroids[1]))
print(utils.format_array_energia("centroid3", centroids[2]))
print(utils.format_array_energia("centroid4", centroids[3]))