In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab11.ipynb")

<img style="display: block; margin-left: auto; margin-right: auto" src="./ccsf-logo.png" width="250rem;" alt="The CCSF black and white logo">

# Lab 11: k-NN

## References

* [Sections 17.0 - 17.6 of the Textbook](https://inferentialthinking.com/chapters/17/Classification.html)
* [Sections 18.0 - 18.2 of the Textbook](https://inferentialthinking.com/chapters/18/Updating_Predictions.html)
* [datascience Documentation](https://datascience.readthedocs.io/)
* [Python Quick Reference](https://ccsf-math-108.github.io/materials-sp24/resources/quick-reference.html)

## Assignment Reminders

- Make sure to run the code cell at the top of this notebook that starts with `# Initialize Otter` to load the auto-grader.
- For all tasks indicated with a 🔎 that you must write explanations and sentences for, provide your answer in the designated space.
- Throughout this assignment and all future ones, please be sure to not re-assign variables throughout the notebook! _For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!_
- Collaborating on labs is more than okay -- it's encouraged! You should rarely remain stuck for more than a few minutes on questions in labs, so ask an instructor or classmate for help. (Explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it.) Please don't just share answers, though.
- View the related <a href="https://ccsf.instructure.com" target="_blank">Canvas</a> Assignment page for additional details.

Run the following cell to set up the lab, and make sure you run the cell at the top of the notebook that initializes Otter.

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

## 🐧 Penguins of the Palmer Archipelago

<a href="https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png" title="The Palmer Archipelago penguins. Artwork by @allison_horst."><img src="lter_penguins.png" width="50%" alt="The Palmer Archipelago penguins. Artwork by @allison_horst."></a>

[Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) is a renowned marine biologist whose research often focuses on penguins, particularly in Antarctica. She was with the [Palmer Station Long Term Ecological Research Program](https://pallter.marine.rutgers.edu/), part of the [US Long Term Ecological Research Network](https://lternet.edu/). From 2007 through 2009, Gorman gathered several measurements from three species of penguins, Adélie [[1](#Citations)], chinstrap [[2](#Citations)], and gentoo [[3](#Citations)], along the Palmer Archipelago near Palmer Station.

<a href="https://en.wikipedia.org/wiki/Palmer_Archipelago" title="Palmer Archipelago
"><img src="./palmer_archipelago.jpeg" alt="Map of the Palmer archipelago" width=30%></a>

We imported the data sets for each penguin species directly from the [Environmental Data Initiative (EDI) Data Portal](https://portal.edirepository.org/nis/home.jsp) and combined them into one file `penguins.csv`. These data are available for use by [CC0 license](https://creativecommons.org/public-domain/cc0/) ("No Rights Reserved").

Run the following code cell to load the data as `penguins`.

In [None]:
penguins = Table.read_table('penguins.csv')
penguins

## Features

In this lab, you will build a kNN classifier to predict a penguin's species based on two of their numerical attributes (features) using Gorman's data set.

### Task 01 📍🔎

<!-- BEGIN QUESTION -->

The first thing you want to do is visualize the relationship between several of the pairs of numerical measurements (features) for the penguins to see which pair would make the best classifier. The features you will focus on initially are `'Culmen Length (mm)'`, `'Culmen Depth (mm)'`, `'Flipper Length (mm)'`, `'Body Mass (g)'`. 

The culmen refers to the ridge along the penguin's bill.

<a href="https://twitter.com/allison_horst/status/1270046411002753025" title="Allison Horst X post"><img src="./culmen.jpeg" alt="An illustration of a penguins culmen." width=40%></a>

For this task, create a scatter plot for each of the pairs of features listed above. To help you out, there are 6 ways to pair up the listed features, and we've stored those in an array called `feature_pairs`. 
* Each item in `feature_pairs` contains two strings, corresponding to the relevant column labels in the `penguins` table.
* The code template loops through all the pairs and starts to make the scatter plots by creating the titles.

To complete this task, you need to add the single line of code that makes the scatter plot from the `penguins` table. Use `group='Species'` to overlay the data from the three species on the same graph.

**Notes:** 
* This task does not have an auto-grader, so check your scatter plots with a classmate, a tutor, or the instructor to make sure you're on the right track.
* The code in the template uses Python's `combinations` function from the [`itertools` module](https://docs.python.org/3/library/itertools.html). You are not expected to know about `itertools` for this class. That module simply provides us with an easy way to determine all the unique pairs of the listed features.

In [None]:
from itertools import combinations

features = ['Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)', 'Body Mass (g)']
feature_pairs = np.array(list(combinations(features, 2)))

for feature_pair in feature_pairs:
    ...
    plt.title(f'{feature_pair.item(0)} and {feature_pair.item(1)}')
    plt.show()

<!-- END QUESTION -->

### Task 02 📍

Based on the scatter plots, which pair of features do you think best shows separation in the species of penguins?

Assign `a_good_pair` to one of the following integers that represent a good pair of features for showing separation of penguin species (especially for Andélie penguins):

1. Culmen Length (mm) and Culmen Depth (mm)
2. Culmen Length (mm) and Flipper Length (mm)
3. Culmen Length (mm) and Body Mass (g)
4. Culmen Depth (mm) and Flipper Length (mm)
5. Culmen Depth (mm) and Body Mass (g)
6. Flipper Length (mm) and Body Mass (g)

In [None]:
a_good_pair = ...

In [None]:
grader.check("task_02")

## Binary Classification

There are three islands (Biscoe, Dream, and Torgersen) represented in this data set. 

Run the following code cell to the distribution of island and species combinations.

In [None]:
(penguins.select('Island', 'Species')
         .group(['Island', 'Species']))

Adélie penguins can be found on all three islands, but gentoo and chinstrap are not found on all three islands. In fact, the data shows that gentoo penguins inhabit Biscoe and chinstrap penguins inhabit Dream. This means we could easily differentiate between gentoo and chinstrap penguins based on what island the penguin was observed. However, we cannot just identify an Adélie penguin based on the island they are on. 

For this reason, we will reduce this classification problem to a binary classification and guide you to build a [kNN classifier](https://inferentialthinking.com/chapters/17/1/Nearest_Neighbors.html#k-nearest-neighbors) that will classify a penguin as Adélie or not Adélie. 

Since you are learning about how kNN works, we'll have you build the classifier in pieces.

### Task 03 📍

First, we need to create our binary labels. There are 3 species, but we are just focusing on whether a penguin is an Adélie penguin or not.

1. Write a function `is_adelie` that takes the species of a penguin (as listed in the `penguins` table) and returns a `bool` value of `True` or `False` depending on whether it is an Adélie penguin or not.
2. Add a column called `Adelie` to the end of the `penguins` table that contains correct `True`/`False` value for the given penguin.

In [None]:
def is_adelie(species):
    ...

penguins = penguins.with_column(
    'Adelie', ...)

penguins

In [None]:
grader.check("task_03")

## Feature Selection

We want you to use the features `'Culmen Length (mm)'` and `'Flipper Length (mm)'` to classify Adélie penguins. 

Run the following code cell to reduce the `penguins` table down to the relevant information for the rest of this lab.

In [None]:
penguins = penguins.select('Adelie', 
                           'Culmen Length (mm)', 
                           'Flipper Length (mm)')
penguins

## Distance

The kNN classifier works by identifying the majority label of the _nearest_ labeled data to the data you are trying to label. To determine the nearest data, we need to measure how far all the data pairs are from the unlabeled data. The (Euclidean) distance ($d$) between two points $(x_1, y_1)$ and $(x_2, y_2)$ was defined for you as $$d = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2}.$$ Use this distance to define the distance between two penguins.

From one of the recent lectures, we created a `distance` function and a `row_distance` function for you. The `distance` function does the above calculation for arrays and the `row_distance` function more specifically calculates the distance between two rows of a table where each row contains only numerical data. 

Run the following cell to define these functions.

In [None]:
def distance(pt1, pt2):
    """Return the distance between two points, represented as arrays"""
    return np.sqrt(sum((pt1 - pt2)**2))

def row_distance(row1, row2):
    """Return the distance between two numerical rows of a table"""
    return distance(np.array(row1), np.array(row2))

## Training and Testing Data

The kNN algorithm that you will implement is tied to a specific set of data, referred to as a training set. The training set is formed by using a certain percentage of the sample data. A common percentage is 80\%. 

If 80\% of the sample data is used for the nearest neighbors classification process, then what is the remaining 20\% used for? The remaining sample data, known as the test set, is reserved for the evaluation of the classifier. You'll go through that process at the end of the lab.

### Task 04 📍

Define a function called `train_test_split` with one input, a table, that returns two tables reflecting the training and testing tables. 
* The first table returned should reflect the training data and consist of a random sample of 80\% of the rows of the input table.
* The second table returned should reflect testing data and consist of the remaining rows from the input table.

The line `train, test = train_test_split(penguins)` in the following code template uses your function to create a training set and a testing set from `penguins`.

**Notes:** 
* You have used a function like `minimize` that returns multiple values. We haven't had you make a function before that returns two things so we will help you with that. This is done in the line `return training_tbl, testing_tbl`. 
* To help make sure you are doing the tasks correctly, we'll make sure that you always get the same random training and test sets as everyone else. This is done with the code `np.random.seed(2024)`.

In [None]:
def train_test_split(data):
    np.random.seed(2024)
    percentile_80_row_index = int(np.ceil(0.8 * data.num_rows))
    data_shuffled = ...
    training_tbl = ...
    testing_tbl = ...
    return training_tbl, testing_tbl

# Use the function to create training and testing sets from the penguins table. 
train, test = train_test_split(penguins)

print(f'The following training set contains %{train.num_rows / penguins.num_rows * 100:.2f} of the rows of the penguins table.') 

display(train)
print(f'\nThe following test set contains %{test.num_rows / penguins.num_rows * 100:.2f} of the rows of the penguins table.')
display(test)

In [None]:
grader.check("task_04")

## An Example Point

Your goal is to be able to classify a penguin as Adélie or not based on the labeled data in Gorman's data set. Let's use the first penguin listed in the `test` table as an example to use for a few tasks. That penguin should have a culmen length of 43.1 mm and a flipper length of 197 mm. You will now make and use a kNN classifier to predict the Adélie label for that penguin. You will be focusing on using table rows to do this. We know the true label for this penguin is `True`, but for now, let's ignore it and see how the classifier works. So an unlabeled example row for that penguin would be:

In [None]:
an_unlabeled_example = test.drop('Adelie').row(0)
an_unlabeled_example

Notice where this example penguin fits within the collected data in `penguin`.

In [None]:
penguins.scatter('Culmen Length (mm)', 'Flipper Length (mm)', group='Adelie')
plt.scatter(an_unlabeled_example.item(0), an_unlabeled_example.item(1), marker='*', s=100, color='red')
plt.show()

Visually, it doesn't seem that the unlabeled example is an Adélie penguin because it is not near the points of other Adélie penguins. Now, teach the computer how to see that as well!

## Building a Classifier

### Task 05 📍

For this task, create a function called `distances` (an array) that:
1. Calculates the row distances between a row of features called `unlabeled_example` and every row in a table like `penguins` called `labeled_features_tbl`.
    * The `labeled_features_tbl` should have 1 column that contains the `True`/`False` labels.
    * Make sure to provide that column label (`label_col`) as input to the function.
3. Adds the distances to the table `labeled_features_tbl` as the column `'Distances'`.

In [None]:
def distances(unlabeled_example, labeled_features_tbl, label_col):
    distances_array = make_array()
    features = labeled_features_tbl.drop(...)
   
    for row in features.rows:
        distances_array = np.append(distances_array, ...)
    
    labeled_features_tbl = labeled_features_tbl.with_column('Distance', ...)
    return ...

# Here is the results for the unlabeled example:
distances(an_unlabeled_example, penguins, 'Adelie').show(3)

In [None]:
grader.check("task_05")

### Task 06 📍

With a way to calculate distances between penguins based on their numerical attributes, the next part of the kNN process is to find the `k` nearest points. 

Define a function called `closest`.
1. The function should have 4 input arguments:
    * `labeled_features_tbl`: a table of labeled features (e.g. `penguins`)
    * `unlabeled_example`: an example row (e.g. `an_unlabeled_example`)
    * `label_col`: a column label that indicates which columns contain the labels (e.g. `'Adelie'`)
    * `k`: an odd integer that indicates the number of nearest points to consider (e.g. `5`)
2. If the value of `k` is even, the function should return an error. Otherwise, the function should return a table with the `k` nearest rows in the `labeled_features_tbl` table to the row `unlabeled_example`.

**Note:** We'll help you raise the error since that is not part of this class; however the if/then programming logic is part of this class. 🤓

In [None]:
def closest(unlabeled_example, labeled_features_tbl, label_col, k):
    '''
    Return a table of the k (odd) closest neighbors to the unlabeled example
    '''
    if ...:
        # Raise an error since k should be odd
        raise ValueError("The value of k should be odd.")
    else:
        # Use the value of k to make a table of the k closest rows
        distances_tbl = ...
        distances_tbl_sorted = ...
        nearest_k = ...
        return ...

# Test even and odd values of k to see if your function is working

k = ... # Try using k = 4 and k = 5
try:
    result = closest(an_unlabeled_example, penguins, 'Adelie', k)
    display(result)
except ValueError as the_error:
    display(the_error)

In [None]:
grader.check("task_06")

### Task 07 📍

Using the fact that `closest` returns the `k` closest rows in the data table to the provided example, create a function called `majority_class` that takes a table (`top_k_closest`) as input and returns the majority label associated with those `k` rows.

In [None]:
def majority_class(top_k_closest):
    ...

In [None]:
grader.check("task_07")

### Task 08 📍

Finally, using the `majority_class` and `closest` functions, create your kNN classifier as a function called `knn_classifier`. 

1. The function should have 4 input arguments:
    * `labeled_features_tbl`: a table of labeled features (e.g. `penguins`)
    * `unlabeled_example`: an example row (e.g. `an_unlabeled_example`)
    * `label_col`: a column label that indicates which columns contain the labels (e.g. `'Adelie'`)
    * `k`: an integer value that indicates the number of nearest points to consider (e.g. `5`)
2. The function should return a `bool` value for whether the unlabeled example is an Adélie penguin or not using the kNN algorithm.

In [None]:
def knn_classifier(unlabeled_example, labeled_features_tbl, label_col, k):
    ....

In [None]:
grader.check("task_08")

## Classifier Accuracy

Great work building a classifier! It did a great job classifying the first row of the test set, but how good of a job does it do in general?

A standard way to analyze how well a classifier does is to measure its accuracy in predicting the labels of the data in the test set. You can do this by:

1. Using the classifier (including the training data) to predict the labels for the data in the test set.
2. Define the accuracy of the model to be the proportion of the correctly labeled data points in the testing data set.

Over the next few tasks, we'll help you define the accuracy of your classifier.

### Task 09 📍

Now, use the `train` and `test` tables to calculate the accuracy of your kNN classifer `knn_classifier`. We'll guide you through this in a few steps:
1. Use the `train` set with the `knn_classifier` and `k=3` to create an array `predicted_labels` of the predicted labels for all the rows of the `test` table.
2. Create a table called `test_with_predictions` that is copy of `test with an added column called `'Predicted Adelie'` containing the predictions in `predicted_labels`.
3. Create an array called `success` that contains `bool` values showing whether or not the predicted label was correct for each row of the test set.
4. Assign `accuracy_k_3` to the proportion of `True` values in `success`.

In [None]:
# Step 1
predicted_labels = make_array()

for test_row in test.drop('Adelie').rows:
    predicted_label = ...
    predicted_labels = np.append(predicted_labels, predicted_label)

# Step 2
test_with_predictions = ...

# Step 3
success = test_with_predictions.column('Adelie') == ...

# Step 4
accuracy_k_3 = ...

print(f'The accuracy of your kNN classifer (with k = 3) using your randomly \
generated train and test sets is: {accuracy_k_3*100:.02f}%.')

In [None]:
grader.check("task_09")

Now, when you apply the classifier you built to a new data point (pair of culmen lengths and flipper length measurements), you can more confidently believe it will correctly label the species of the penguin.

### Task 10 📍🔎

<!-- BEGIN QUESTION -->

Fine-tuning existing classifiers to improve performance metrics like accuracy is a big part of machine learning. Since kNN classifiers use all the available data to make a prediction, there are not many parameters to adjust to improve accuracy. 

For this task, try at least one of the following ideas to improve the accuracy of your classifier:
* Finding an optimal value for `k`. In this class, that means try some different values of `k`.
* Pick different or more features. We used culmen length and flipper length, but maybe there is a better pair of features or maybe it is better to consider more features at once.
* Standardize the data so that the scale of the data values for each feature doesn't cause a bias in one feature. Maybe culmen length and flipper length are great, but if we standardize the data, then we might see that flipper length has less of a biased influence on the classification. Right now flipper lengths are above 100 in general, but culmen lengths are less than 100 in general.

In [None]:
...

<!-- END QUESTION -->

## Submit your Lab to Canvas

Once you have finished working on the lab questions, prepare to submit your work in Canvas by completing the following steps.

1. In the related Canvas Assignment page, check the requirements for a Complete score for this lab assignment.
2. Double-check that you have run the code cell near the end of the notebook that contains the command `grader.check_all()`. This command will run all of the run tests on all your responses to the auto-graded tasks marked with 📍.
3. Double-check your responses to the manually graded tasks marked with 📍🔎.
4. Select the menu items `File`, `Save and Export Notebook As...`, and `Html_embed` in the notebook's Toolbar to download an HTML version of this notebook file.
5. In the related Canvas Assignment page, click Start Assignment or New Attempt to upload the downloaded HTML file.

## Citations

1. Palmer Station Antarctica LTER and K. Gorman. 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Adélie penguins (Pygoscelis adeliae) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 5. Environmental Data Initiative. https://doi.org/10.6073/pasta/98b16d7d563f265cb52372c8ca99e60f (Accessed 2024-05-02).

2. Palmer Station Antarctica LTER and K. Gorman. 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Chinstrap penguins (Pygoscelis antarcticus) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 8. Environmental Data Initiative. https://doi.org/10.6073/pasta/ce9b4713bb8c065a8fcfd7f50bf30dde (Accessed 2024-05-02).

3. Palmer Station Antarctica LTER and K. Gorman. 2020. Structural size measurements and isotopic signatures of foraging among adult male and female gentoo penguins (Pygoscelis papua) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 7. Environmental Data Initiative. https://doi.org/10.6073/pasta/9fc8f9b5a2fa28bdca96516649b6599b (Accessed 2024-05-02).

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()