In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("gla13.ipynb")

<img src="./ccsf.png" alt="CCSF Logo" width=200px style="margin:0px -5px">

# Guided Learning Activity 13: kNN Classification

This Guided Learning Activity is designed for you to complete alongside a Data Ambassador from the course. You might find that it feels like a combination of the lectures and lab assignment. Whether you are participating live or watching the recording of the live meeting, let the Data Ambassador guide you through the following tasks. There will be moments for you to reflect and explore your own ideas as a way to solidify concepts and skills introduced by your instructor. Keep in mind that this is not a graded assignment for MATH 108 by default. If you have any concerns about participation, reach out to your instructor.

---

## Learning Objectives

1. Describe kNN classification
2. Outline a procedure for creating and evaluating a kNN classifier
3. Build a kNN classifier for a real data set
4. Assess the quality of the classifier

---

## Configure the Notebook

Run the following code cell to set up the notebook.

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

---

## Understanding k-Nearest Neighbors (kNN)

---

### Goal

Use the relationship between numerical data and categorical labels to predict a label using only numerical data

---

### How kNN Works

- Predict labels using the majority vote of k nearest known data points
- We measure distance using Euclidean distance
- The k value is a selected during the training process

---

### What Makes a kNN Classifier?

A training set + an optimal k value + selected features

---

### Evaluating Quality
- A metric: Accuracy -> proportion of correctly labeled points
- Data splitting strategy:
  - Training set: builds the model
  - Validation set: optimizes k value
  - Test set: evaluates final performance (used only once)

---

### Improving Performance

- **Standardization**: Ensures no feature dominates due to scale
- **Feature Selection**: Choose non-redundant features that best distinguish between classes

---

### Pipeline

The following is a process for how to convert a labeled data set into a kNN classifier ready to make predictions

```mermaid
flowchart TD
    A["Raw Data"] --> B["Clean Data"]

    B -->|"e.g. 80%"| C["(Pre) Training Set"]
    B -->|"e.g. 20%"| D["Testing Set"]

    C -->|"e.g. 75%"| E["Training Set"]
    C -->|"e.g. 25%"| F["Validation Set"]

    E --> G["Standardize & Select Features"]
    
    subgraph Transform["Transform Data"]
        H1["Transform Training Set"]
        H2["Transform Validation Set"]
        H3["Transform Testing Set"]
    end

    G --> H1
    G --> H2
    G --> H3

    F --> H2
    D --> H3

    H1 --> I["Evaluate Accuracy for several\n k values on Validation Set"]
    H2 --> I
    I --> J["Pick Optimal k"]

    J --> K["Evaluate Final Accuracy on Testing Set"]
    H3 --> K

```

---

### Task 01 📍🔎

<!-- BEGIN QUESTION -->

Feature selection is an important part of the classification process. The following graphic shows two scatterplots using a [wine quality data set](https://archive.ics.uci.edu/dataset/186/wine+quality) and two different sets of features.

<img src="./feature_selection_wine.png" width=800px alt="Two scatterplots showing various class seperation based on two different feature selections">

Which feature selection seems more useful for classifying wine based on the provided category labels?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

## Detecting Gallstone Disease

---

### Gallstone Disease

<a href="https://en.wikipedia.org/wiki/Gallstone#/media/File:Gallstones.png" target="_blank"><img src="./gallstones.png" width=400px alt="A gallstone blocking a bile duct"></a>

According to [John Hopkins Medicine](https://www.hopkinsmedicine.org/health/conditions-and-diseases/gallstone-disease):

> Gallstone disease is the most common disorder affecting the biliary system, the body's system of transporting bile. Gallstones are solid, pebble-like masses that form in the gallbladder or the biliary tract (the ducts leading from the liver to the small intestine). They form when the bile hardens and are caused by an excess of cholesterol, bile salts or bilirubin.

---

### Research

A recent paper titled [_Early prediction of gallstone disease with a machine learning-based method from bioimpedance and laboratory data_](https://journals.lww.com/md-journal/fulltext/2024/02230/early_prediction_of_gallstone_disease_with_a.40.aspx) showed that vitamin D, C-reactive protein (CRP) level, total body water, and lean mass are crucial features in predicting gallstones.

---

### Data

The UC Irvine Machine Learning Repository hosts [the dataset from this study](https://archive.ics.uci.edu/dataset/1150/gallstone-1), which contains information on 319 individuals, 161 of whom were diagnosed with gallstone disease.

---

### Task 02 📍

The gallstone data is contained within the file `dataset-uci.csv`. Assign `gallstone` to a Table containing that data.

In [None]:
gallstone = ...
gallstone

In [None]:
grader.check("task_02")

---

### Task 03 📍

Split `gallstone` into two Tables. One table called `pre_train` should contain a random selection of 80% of the data in `gallstone`, and the other table called `test` should contain the remaining rows of `gallstone`.

In [None]:
np.random.seed(123) # Included for reproducibility
...
display(pre_train)

In [None]:
grader.check("task_03")

---

### Task 04 📍🔎

<!-- BEGIN QUESTION -->

Next, we'd want to clean the pre-training data by checking for missing values, typos, outliers, etc. To keep this activity focused, visualize the distributions of vitamin D, C-reactive protein level, total body water, and lean mass, and make sure there are no outliers or missing values.

**Note**: There is no right way to do this. You are just exploring for any anomalies in the key features outlined by the research paper.

In [None]:
key_features = ['Vitamin D', 'C-Reactive Protein (CRP)',
                'Total Body Water (TBW)', 'Lean Mass (LM) (%)']
...

<!-- END QUESTION -->

---

### Outlier Detection

We've created a function for you that identifies the row indices of a table where the data values in those rows are considered outliers by the standard 1.5 IQR method. Run the following code cell to define that function and test it on an array with two outliers.

In [None]:
def detect_outlier_indices(data):
    """
    Detects outliers in an array using the 1.5 * IQR method.
    
    Parameters:
        data: Input data array
        
    Returns:
        indices (list): List of indices corresponding to outlier values
    """
    q1 = percentile(25, data)
    q3 = percentile(75, data)
    iqr = q3 - q1

    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr

    outlier_indices = np.where((data < lower_bound) | (data > upper_bound))[0]
    return outlier_indices.tolist()

# Testing the function on an array with 2 outliers
detect_outlier_indices(make_array(-20, 1, 2, 3, 4, 50))

The function should return a list of 2 numbers that correspond to the fact that `-20` and `50` are outliers (by the 1.5 IQR method) for the given data set.

---

### Task 05 📍

You should have noticed some potential outliers in two of the features. Use `detect_outlier_indices` to remove the rows of the pre-training data that correspond to those outliers. Assign the resulting table to `clean_pre_train`.

In [None]:
clean_pre_train = ...
clean_pre_train

In [None]:
grader.check("task_05")

---

### Task 06 📍

Further split `clean_pre_train` into two Tables. One table called `train` should contain a random selection of 75% of the data in `clean_pre_train`, and the other table called `validation` should contain the remaining rows of `clean_pre_train`.

In [None]:
np.random.seed(1234) # Included for reproducibility
...
display(train)

In [None]:
grader.check("task_06")

---

### Task 07 📍

Next, you need to standardize the data in the 3 data sets based on the values in the training set. We've set up the code to standardize the key features from the data and then transform the train, test, and validation tables by selecting only the standardized features and the Gallstone Status label. You just need to get the average and standard deviation!

In [None]:
for feature in key_features:
    mean_value_train = ...
    SD_value_train = ...

    for tbl_name, tbl in [('train', train), ('validation', validation), ('test', test)]:
        transformed_feature_values = (tbl.column(feature) - mean_value_train) / SD_value_train
        new_column_name = feature + '_standardized'
        tbl = tbl.with_column(new_column_name, transformed_feature_values)

        if tbl_name == 'train':
            train = tbl
        elif tbl_name == 'validation':
            validation = tbl
        else:
            test = tbl

# Transform train, validation, and test sets
standardized_features = ['Gallstone Status', 'Vitamin D_standardized', 
                         'C-Reactive Protein (CRP)_standardized', 'Total Body Water (TBW)_standardized',
                         'Lean Mass (LM) (%)_standardized']

train = train.select(standardized_features)
validation = validation.select(standardized_features)
test = test.select(standardized_features)
display(train_transformed)

In [None]:
grader.check("task_07")

---

### Task 08 📍

Complete the following functions created in [the MATH 108 textbook](https://ccsf-math-108.github.io/textbook) that were used to build a kNN classifier function called `classify`.

In [None]:
def distance(point1, point2):
    """Returns the distance between point1 and point2
    where each argument is an array 
    consisting of the coordinates of the point"""
    return ...

def all_distances(training, new_point):
    """Returns an array of distances
    between each point in the training set
    and the new point (which is a row of attributes)"""
    attributes = training.drop(...)
    def distance_from_point(row):
        return distance(np.array(new_point), np.array(row))
    return attributes.apply(distance_from_point)

def table_with_distances(training, new_point):
    """Augments the training table 
    with a column of distances from new_point"""
    return training.with_column('Distance', all_distances(training, new_point))

def closest(training, new_point, k):
    """Returns a table of the k rows of the augmented table
    corresponding to the k smallest distances"""
    with_dists = table_with_distances(training, new_point)
    sorted_by_distance = ...
    topk = ...
    return topk

def majority(topkclasses):
    ones = topkclasses.where(..., are.equal_to(1)).num_rows
    zeros = topkclasses.where(..., are.equal_to(0)).num_rows
    if ones > zeros:
        return 1
    else:
        return 0

def classify(training, new_point, k):
    closestk = ...
    topkclasses = ...
    return ...

In [None]:
grader.check("task_08")

---

### Task 09 📍

Complete the following functions created in [the MATH 108 textbook](https://ccsf-math-108.github.io/textbook) that were used to define a function called `evaluate_accuracy` that is used to evaluate the accuracy of `classify` based on a training set and a test/validation set.

In [None]:
def count_zero(array):
    """Counts the number of 0's in an array"""
    return len(array) - np.count_nonzero(array)

def count_equal(array1, array2):
    """Takes two numerical arrays of equal length
    and counts the indices where the two are equal"""
    return ...

def evaluate_accuracy(training, testing, k):
    test_attributes = testing.drop('Gallstone Status')
    def classify_testrow(row):
        return classify(training, row, k)
    c = test_attributes.apply(classify_testrow)
    return ...

In [None]:
grader.check("task_09")

---

### Task 10 📍🔎

<!-- BEGIN QUESTION -->

Using the transformed training and validation sets and several values of `k`, create a visual showing the accuracy values for each `k`. Which `k` value seems optimal to use?

_Type your answer here, replacing this text._

In [None]:
accuracy_array = make_array()
ks = [3, 5, 7, 9, 11, 13, 15, 17, 19, 21]
for k in ks:
    accuracy_array = ...

Table().with_columns(
    'k', ks,
    'Accuracy', accuracy_array)

<!-- END QUESTION -->

---

### Task 11 📍

Now that you've fine-tuned your kNN classifier by picking out an optional value of `k`, calculate the accuracy of the classifier using the test set. Assign that value to `classifier_accuracy`.

In [None]:
classifier_accuracy = ...
classifier_accuracy

In [None]:
grader.check("task_11")

---

### Task 12 📍🔎

<!-- BEGIN QUESTION -->

If you evaluate the accuracy with the test set based on `k=5`, you'll actually find the accuracy is higher! Why is it bad to change your classifier to use `k=5`?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

## Reflection

In this activity, you reviewed the main ideas behind kNN classification, including a workflow for building a classifier. Additionally, you built a kNN classifier and applied it to a data set on gallstone disease detection. Finally, you evaluated the accuracy of the classifier and reflected on how to handle the test set.

---

## License

This content is licensed under the <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)</a>.

<img src="./by-nc-sa.png" width=100px>