# Assignment 1: k-Nearest Neighbors

Fill in your name and student ID here.
- Name: Ong Jia Xi
- Student ID: A0276092Y

## Overview

In this assignment, we'll implement KNN step-by-step:

1. **Euclidean Distance:** Measure similarity between data points.
2. **Find Neighbors:** Select the 'K' closest points to a query point.
3. **Predict (Classification and Regression):** Use majority vote from neighbors and average of neighbors' values.
4. **Build KNN Class:** Combine everything into a reusable class.
5. **Practical**: Train a KNN classifier on the training dataset using scikit-learn.

By the end, you'll understand how KNN works and how it makes predictions based on distance. Let’s dive in!

## Instructions

1. Fill in your name and student ID at the top of the ipynb file.
2. The parts you need to implement are clearly marked with the following:

    ```
    """ YOUR CODE STARTS HERE """

    """ YOUR CODE ENDS HERE """
    ```

    , and you must **ONLY** write your code in between the above two lines. 
3. **IMPORTANT**: Make sure that all of the cells are runnable and can compile without exception, even if the answer is incorrect. This will significantly help us in grading your solutions.
3. For task 1 to 4, you are only allowed to use basic Python functions in your code (no `NumPy` or its equivalents), unless otherwise stated. You may reuse any functions you have defined earlier. If you are unsure whether a particular function is allowed, feel free to ask any of the TAs.
4. For task 5, you may use the `scikit-learn` library.
5. Your solutions will be evaluated against a set of hidden test cases to prevent hardcoding of the answer. You may assume that the test cases are always valid and does not require exception or input mismatch handling. Partial marks may be given for partially correct solutions

### Submission Instructions
Items to be submitted:
* **This notebook, NAME-STUID-assignment1.ipynb**: This is where you fill in all your code. Replace "NAME" with your full name and "STUID" with your student ID, which starts with "A", e.g. `"John Doe-A0123456X-assignment1.ipynb"`

Submit your assignment by **Sunday, 31 August 23:59** to Canvas. Points will be deducted late submission.


## Task 1 - Compute Euclidean Distance [1 Point]

To find the nearest neighbors, we first need a distance measure to determine how **close** two data points are. 

A possible distance measure is **Euclidean distance**—the straight-line distance between points $\mathbf{p} = [p_1, p_2, ..., p_m]$ and $\mathbf{q} = [q_1, q_2, ..., q_m]$ in $m$ dimensions:

$$
d_\text{euclidean}(\mathbf{p}, \mathbf{q}) = \sqrt{\sum_{i=1}^{m} (p_i - q_i)^2}
$$

Implement the function calculating euclidean distance using `math.sqrt()`, without the use of `numpy`.

In [1]:
# TASK 1
import math 

def euclidean_distance(vec_p, vec_q):
    """
    TODO: Write a function to compute distance between two vectors.
    Call the math.sqrt() function to compute the square root of the sum of squared differences.

    Args:
        vec_p: List or tuple p, with a list of m numbers
        vec_q: List or tuple q, with a list of m numbers

    Returns:
        A float representing the Euclidean distance between the two vectors
    """

    distance = 0

    """ YOUR CODE STARTS HERE """   
    if len(vec_p) == 0 or len(vec_q) == 0:
        print('Please provide vector of correct dimension!')
    if len(vec_p) != len(vec_q): 
        print('Vector dimension is inconsistent!')
    
    # Assume length of both vectors is consistent
    v_len = len(vec_p)

    # Track sum of squared difference
    ssd = 0 
    for i in range(v_len):
        ssd += (vec_p[i] - vec_q[i]) ** 2 
    
    distance = math.sqrt(ssd)
    """ YOUR CODE ENDS HERE """

    return distance


# TESTCASES 1
assert math.isclose(euclidean_distance([1, 2, 3], [4, 5, 6]), 5.196152422706632, rel_tol=1e-5)
assert math.isclose(euclidean_distance([5.5, 5.5], [5.5, 5.5]), 0.0, rel_tol=1e-5)
assert math.isclose(euclidean_distance([-6], [-6]), 0.0, rel_tol=1e-5)
print('All test cases passed!') 

All test cases passed!


## Task 2 - Get the K Nearest Neighbors [1 Point]

Now that we can measure distance, let’s use that to find the ```k``` closest training points to a given test point. If there are multiple points with the same distance, we keep the ones that appear first in the training data.

In [5]:
# TASK 2
def get_k_nearest_neighbors(training_data, test_point, k):
    """
    TODO: Return the k nearest neighbors to the test_point.

    Args:
        training_data: list of tuples [(feature_vector, label), ...]
        test_point: list of numbers (the point we're classifying)
        k: number of neighbors to consider
        
    Returns:
        list of labels of the k nearest neighbors in the correct order
    """

    k_neighbors = []
    
    """ YOUR CODE STARTS HERE """  
    
    # Compute all the distances of one test point to the other points in training_data
    distances = []
    for train_pt in training_data:
        X_train = train_pt[0]
        distances.append(euclidean_distance(test_point, X_train))

    # Sort the distances + Handle tiebreaker naturally
    distances = sorted(enumerate(distances), key=lambda x: (x[1], x[0]))

    # Get classification
    pair = distances[:k]
    for idx, _ in pair:
        label = training_data[idx][1]
        k_neighbors.append(label)

    # Append the result into the list 
    """ YOUR CODE ENDS HERE """

    return k_neighbors

# TESTCASES 2.1
training_data = [
    ([1, 2], 'A'),
    ([2, 3], 'B'),
    ([3, 4], 'A'),
    ([5, 5], 'B')
]

assert get_k_nearest_neighbors(training_data, [1.5, 2.5], k=2) == ['A', 'B']
assert get_k_nearest_neighbors(training_data, [4, 4], k=1) == ['A']
assert get_k_nearest_neighbors(training_data, [0, 0], k=3) == ['A', 'B', 'A']
print('All test cases passed!') 


All test cases passed!


## Task 3 - Prediction [2 Points]

### Task 3.1 - Compute Majority Voting [1 Point]

Once we have the `k` nearest neighbors, we need to decide the final label, which is the label that appears the most frequently. If there is a tie, return the label that appears first in the input list.

In [6]:
# TASK 3.1
def knn_majority_vote(neighbors):
    """
    TODO: Return the most common label in neighbors.

    Args:
        neighbors: list of labels

    Returns:
        The label that appears most frequently
        If there's a tie, return the label that appears first
    """
    
    most_common_label = None
    
    """ YOUR CODE STARTS HERE """ 
    hash_set = {}
    counter = 0

    for neigh in neighbors:
        if neigh not in hash_set:
            hash_set[neigh] = 0
        hash_set[neigh] += 1

    possible_label = []
    majority_count = max(list(hash_set.values()))
    for k, v in hash_set.items():
        if v == majority_count:
            possible_label.append(k)
    
    if len(possible_label) == 1:
        most_common_label = possible_label[0]
    else:
        first = min([neighbors.index(i) for i in possible_label])
        most_common_label = neighbors[first]

    """ YOUR CODE ENDS HERE """ 

    return most_common_label

# TESTCASES 3.1
assert knn_majority_vote(['A', 'B', 'A']) == 'A'
assert knn_majority_vote(['B', 'B', 'A']) == 'B'
assert knn_majority_vote(['A', 'A', 'A']) == 'A'
assert knn_majority_vote(['A', 'B']) == 'A'
assert knn_majority_vote(['B', 'A']) == 'B'
print('All test cases passed!') 


All test cases passed!


### Task 3.2 - Compute KNN Regression Prediction [1 Point]

Now, you will implement K-Nearest Neighbors (KNN) for regression tasks.

In [7]:
# TASK 3.2
def knn_regression(X_train, y_train, x_query, k):
    """
    TODO: Implement the KNN regression algorithm.

    Args:
        X_train (list[list[float]]): Training features
        y_train (list[float]): Target values
        x_query (list[float]): Query point
        k (int): Number of neighbors

    Returns:
        float: Predicted target value by averaging the k nearest neighbors
    """

    prediction = 0

    """ YOUR CODE STARTS HERE """ 
    training_data = [(X, y) for X, y in zip(X_train, y_train)]
    
    # Get k nearest neighbors
    neighbors = get_k_nearest_neighbors(training_data, x_query, k)

    # Average top k y_train 
    prediction = sum(neighbors) / k

    """ YOUR CODE ENDS HERE """ 

    return prediction

# TESTCASES 3.2
X_train = [[1], [2], [3], [4], [5]]
y_train = [1.1, 1.9, 3.0, 3.9, 5.1]
x_query = [2.5]
assert math.isclose(knn_regression(X_train, y_train, x_query, 2), 2.45, rel_tol=1e-5)

X_train = [[1], [2], [3]]
y_train = [1, 2, 3]
x_query = [2.1]
assert math.isclose(knn_regression(X_train, y_train, x_query, 1), 2, rel_tol=1e-5)

X_train = [[1], [2], [3]]
y_train = [1, 2, 3]
x_query = [2]
assert math.isclose(knn_regression(X_train, y_train, x_query, 3), 2, rel_tol=1e-5)

X_train = [[1, 2], [2, 3], [3, 4]]
y_train = [10, 20, 30]
x_query = [2, 2.5]
assert math.isclose(knn_regression(X_train, y_train, x_query, 2), 15.0, rel_tol=1e-5)

print('All test cases passed!') 


All test cases passed!


## Task 4 - Wrapping in Classes [4 Points]

### Task 4.1 -  KNN Classifier [2 Points]

Here we combine everything into a reusable class to train your own model and make predictions! 

In [10]:
# TASK 4.1
class KNNClassifier:
    def __init__(self, k=3):
        self.k = k
        self.training_data = []  # Will hold tuples of (feature_vector, label)

    def fit(self, X, y):
        """
        TODO: Store the training data. DO NOT return anything!

        Args:
            X: list of feature vectors
            y: list of labels corresponding to the feature vectors
        """

        """ YOUR CODE STARTS HERE """ 
        self.training_data = [(feature, label) for feature, label in zip(X, y)]
        """ YOUR CODE ENDS HERE """ 


    def predict(self, X_test):
        """
        TODO: Predict the class label for each test point in X_test.

        Args:
            X_test: list of feature vectors to classify

        Returns:
            list of predicted labels 
        """
    
        predictions = []

        """ YOUR CODE STARTS HERE """ 
        for test in X_test:
            pred = knn_majority_vote(get_k_nearest_neighbors(self.training_data, test, self.k))
            predictions.append(pred)
        """ YOUR CODE ENDS HERE """ 
        
        return predictions

# TESTCASES 4.1
knn = KNNClassifier(k=3)
X_train = [[1,2],[2,3],[3,4],[5,5]]
y_train = ['A','B','A','B']
knn.fit(X_train, y_train)

X_test = [[1.5,2.5],[4,4]]
assert knn.predict(X_test) == ['A', 'B']

knn = KNNClassifier(k=1)
X_train = [[1,1],[2,2],[3,3],[4,4]]
y_train = ['A','A','B','B']
knn.fit(X_train, y_train)

X_test = [[1.5,1.5],[3.5,3.5]]
assert knn.predict(X_test) == ['A', 'B']

print('All test cases passed!') 

All test cases passed!


### Task 4.2 - KNN Regressor [2 Points]

Similarly, we do the same for the regressor.

In [12]:
# TASK 4.2
class KNNRegressor:
    def __init__(self, k=3):
        self.k = k
        self.training_data = []  # Will hold tuples of (feature_vector, label)

    def fit(self, X, y):
        """
        TODO: Store the training data. DO NOT return anything!

        Args:
            X: list of feature vectors
            y: list of labels corresponding to the feature vectors
        """

        """ YOUR CODE STARTS HERE """ 
        self.training_data = [(feature, label) for feature, label in zip(X, y)]
        """ YOUR CODE ENDS HERE """ 

    def predict(self, X_test):
        """
        TODO: Predict the target value for each test point in X_test

        Args:
            X_test: list of feature vectors to predict

        Returns:
            list of predicted target values 
        """

        predictions = []

        """ YOUR CODE STARTS HERE """ 
        for test in X_test:
            neighbors = get_k_nearest_neighbors(self.training_data, test, self.k)
            pred = sum(neighbors) / self.k
            predictions.append(pred)
        """ YOUR CODE ENDS HERE """ 

        return predictions

# TESTCASES 4.2
regressor = KNNRegressor(k=3)
X_train = [[1], [2], [3], [4], [5]]
y_train = [10.0, 20.0, 30.0, 40.0, 50.0]
regressor.fit(X_train, y_train)
X_test = [[2.5], [4.5]]
predictions = regressor.predict(X_test)
assert math.isclose(predictions[0], 20.0)
assert math.isclose(predictions[1], 40.0)

regressor = KNNRegressor(k=1)
X_train = [[1, 1], [2, 2], [3, 3]]
y_train = [10.0, 20.0, 30.0]
regressor.fit(X_train, y_train)
X_test = [[1.2, 1.2], [2.8, 2.8], [0.5, 0.5]]
predictions = regressor.predict(X_test)
assert math.isclose(predictions[0], 10.0)
assert math.isclose(predictions[1], 30.0)
assert math.isclose(predictions[2], 10.0)

print('All test cases passed!')     

All test cases passed!


## Task 5 - Practical [2 Points]

Train a KNN classifier on the training dataset using `scikit-learn` and tune its hyperparameters to optimize performance. We recommend that you use PCA as described in the lecture (or any other preprocessing method that you think is suitable) to also boost the model performance. You may find `make_pipeline()` useful in this task.

You will get full marks if your modelling is appropriate and performs well. But remember, you **MUST NOT** use or access X_test and y_test in your code, as this defeats the purpose of a hidden test set. Any model that does so will be given 0 mark.

Make sure that you have installed `scikit-learn` in your python environment. 

**HINT**: Set the `random_state` parameter (if exists) to a certain constant to make your model reproducible (same result on every run)

In [None]:
# TASK 5.1
from sklearn.datasets import fetch_lfw_people
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA

# [Optional] TODO: Add other sklearn imports for your code 
""" YOUR CODE STARTS HERE """
import numpy as np
from scipy.ndimage import gaussian_filter
from sklearn.metrics import f1_score, make_scorer
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler, Normalizer
from sklearn.base import BaseEstimator, TransformerMixin
""" YOUR CODE ENDS HERE """

# Load dataset
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
X = lfw_people.data  # Flattened images
# print(len(X[0])) # 1850 dim
y = lfw_people.target
# print(np.unique(y, return_counts=True)) # 7 classes + imbalanced dataset

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42)

h, w = lfw_people.images.shape[1], lfw_people.images.shape[2]

class GaussianBlur(BaseEstimator, TransformerMixin):
    def __init__(self, sigma=0.0):
        lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
        h, w = lfw_people.images.shape[1], lfw_people.images.shape[2]
        self.h = h
        self.w = w
        self.sigma = float(sigma)
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        if self.sigma == 0.0:
            return X
        X_reshaped = X.reshape(-1, self.h, self.w) # reshape to (n, h, w) from (n, h*w)
        X_blurred = np.empty_like(X_reshaped)
        # Apply gaussian blur filter 
        for i in range(X_reshaped.shape[0]):
            X_blurred[i] = gaussian_filter(X_reshaped[i], sigma=self.sigma, mode='nearest')
        return X_blurred.reshape(X.shape[0], self.h * self.w)


def train_model(X_train, y_train):
    """
    TODO: Train and return a kNN classifier.
    If using PCA, use a make_pipeline() to combine PCA and kNN.
    When .predict() is called, the model should be able to perform any necessary transformations (like PCA) 
    on the test data automatically.

    Args:
        X_train: Training feature vectors
        y_train: Training labels

    Returns:
        A trained sklearn model, your model will be used to predict the labels of test data
    """

    model = None

    """ YOUR CODE STARTS HERE """ 
    gauss = GaussianBlur()
    pca = PCA(whiten=True, svd_solver='randomized', random_state=42)
    knn = KNeighborsClassifier()
    pipe = make_pipeline(gauss, 'scaler', pca, knn)

    param_grid = {
        'gaussianblur__sigma': np.linspace(0.0, 3.0, 7),
        'scaler': ['passthrough', StandardScaler(), Normalizer(norm='l2')],
        #'pca__n_components': range(50, 251, 50), # first iter: 100, eval_acc = 0.82
        'pca__n_components': range(50, 151, 25), # second iter: finer search after locating better param locally 
        'kneighborsclassifier__n_neighbors': range(1, 44, 2), # odd number to prevent ties + use upper bound for neighbors of sqrt(d)=43
        'kneighborsclassifier__weights': ['uniform', 'distance'],
        'kneighborsclassifier__metric': ['minkowski', 'cosine'] # explore cosine similarity cuz high dim data
        }
    
    f1_scorer = make_scorer(f1_score, average='weighted') # use F1 cuz class imbalance

    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    grid = GridSearchCV(
        estimator=pipe,
        param_grid=param_grid,
        scoring=f1_scorer, 
        cv=skf,
        verbose=0,
        n_jobs=-1,
        refit=True,
    )
                
    grid.fit(X_train, y_train)

    print("Best CV weighted-F1:", grid.best_score_)
    print("Best params:", grid.best_params_)

    model = grid.best_estimator_
    """ YOUR CODE ENDS HERE """ 

    return model

# TESTCASES 5.1
# Our hidden test cases will use your code to train a model to predict the labels of the test data, not necessarily on the same train-test split.
# Note: If your model is poorly designed or performs poorly, points may be deducted.

model = train_model(X_train, y_train)
# Check if the model can predict
predictions = model.predict(X_test)
assert len(predictions) == len(X_test)
accuracy_score = model.score(X_test, y_test)
print(f"Model accuracy: {accuracy_score:.2f}")

Best CV weighted-F1: 0.8215373207711668
Best params: {'gaussianblur__sigma': np.float64(1.5), 'kneighborsclassifier__metric': 'cosine', 'kneighborsclassifier__n_neighbors': 9, 'kneighborsclassifier__weights': 'distance', 'pca__n_components': 100, 'scaler': StandardScaler()}
Model accuracy: 0.85


## END OF ASSIGNMENT