# Lab Assignment 2 - Part B: k-Nearest Neighbor Classification
Please refer to the `README.pdf` for full laboratory instructions.


## Problem Statement
In this part, you will implement the k-Nearest Neighbor (k-NN) classifier and evaluate it on two datasets:
- **Lenses Dataset**: A small dataset for contact lens prescription
- **Credit Approval (CA) Dataset**: Credit card application data with binary labels (+/-)

### Your Tasks
1. **Preprocess the data**: Handle missing values and normalize features
2. **Implement k-NN** with L2 distance
3. **Evaluate** on both datasets for different values of k
4. **Discuss** your results

### Datasets
The data files are located in the `credit 2017/` folder:
- `lenses.training`, `lenses.testing`
- `crx.data.training`, `crx.data.testing`
- `crx.names` (describes the features)


## Setup


In [5]:
# Library declarations
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter


In [6]:
# Data paths
DATA_PATH = "credit 2017/"

# Load Lenses data
def load_lenses_data():
    """Load the lenses dataset."""
    train_data = np.loadtxt(DATA_PATH + "lenses.training", delimiter=',')
    test_data = np.loadtxt(DATA_PATH + "lenses.testing", delimiter=',')
    
    # First column is ID, last column is label
    # Syntax description: take all rows and include all columns except for first/last column
    X_train = train_data[:, 1:-1]
    y_train = train_data[:, -1]
    X_test = test_data[:, 1:-1]
    y_test = test_data[:, -1]
    
    return X_train, y_train, X_test, y_test

# Load Credit Approval data
def load_credit_data():
    """
    Load the Credit Approval dataset.
    Note: This dataset contains missing values (?) and mixed types.
    You will need to preprocess it.
    """
    train_data = np.loadtxt(DATA_PATH + "crx.data.training", delimiter=',')
    test_data = np.loadtxt(DATA_PATH + "crx.data.testing", delimiter=',')

    X_train = train_data[:, :-1]
    y_train = train_data[:, -1]

    X_test = test_data[:, 1:-1]
    y_test = test_data[:, -1]
    # TODO: Implement data loading
    # The data is comma-separated
    # Missing values are marked with '?'
    # Last column is the label ('+' or '-')
    return X_train, y_train, X_test, y_test

# Test loading lenses data
X_train_lenses, y_train_lenses, X_test_lenses, y_test_lenses = load_lenses_data()
print(f"Lenses - Train: {X_train_lenses.shape}, Test: {X_test_lenses.shape}")


Lenses - Train: (18, 3), Test: (6, 3)


## Task 1: Data Preprocessing
For the Credit Approval dataset, you need to:
1. **Handle missing values** (marked with '?'):
   - Categorical features: replace with mode/median
   - Numerical features: replace with label-conditioned mean
2. **Normalize features** using z-scaling:
   $$z_i^{(m)} = \frac{x_i^{(m)} - \mu_i}{\sigma_i}$$

Document exactly how you handle each feature!


In [None]:
def preprocess_credit_data(train_file, test_file):
    """
    Preprocess the Credit Approval dataset.
    
    Steps:
    1. Load the data
    2. Handle missing values
    3. Encode categorical variables
    
    Returns:
    --------
    X_train, y_train, X_test, y_test : numpy arrays
    """
    # Load using pandas (only for loading)
    import pandas as pd
    X_train = pd.read_csv(train_file, header=None, na_values='?').values
    X_test = pd.read_csv(test_file, header=None, na_values='?').values
    
    # Separate features and labels
    y_train = X_train[:, -1]
    y_test = X_test[:, -1]
    X_train = X_train[:, :-1]
    X_test = X_test[:, :-1]
    
    # Feature indices: categorical vs numerical
    categorical_features = [0, 3, 4, 5, 6, 8, 9, 11, 12]
    numerical_features = [1, 2, 7, 10, 13, 14]
    
    # Handle missing values for categorical features (use mode)
    for col in categorical_features:
        # Find NaN mask
        mask = np.array([isinstance(x, float) and np.isnan(x) for x in X_train[:, col]])
        
        if mask.any():
            # Get mode using Counter
            non_missing = X_train[~mask, col]
            mode_val = Counter(non_missing).most_common(1)[0][0]
            
            X_train[mask, col] = mode_val
            test_mask = np.array([isinstance(x, float) and np.isnan(x) for x in X_test[:, col]])
            X_test[test_mask, col] = mode_val
    
    # Handle missing values for numerical features (use label-conditioned mean)
    for col in numerical_features:
        for label in ['+', '-']:
            # Find missing values for this label
            label_mask_train = y_train == label
            missing_mask_train = np.array([isinstance(x, float) and np.isnan(x) for x in X_train[:, col]])
            train_mask = label_mask_train & missing_mask_train
            
            label_mask_test = y_test == label
            missing_mask_test = np.array([isinstance(x, float) and np.isnan(x) for x in X_test[:, col]])
            test_mask = label_mask_test & missing_mask_test
            
            # Compute mean for this label
            label_data = X_train[label_mask_train, col]
            label_data_float = np.array([float(x) for x in label_data if not (isinstance(x, float) and np.isnan(x))])
            label_mean = np.mean(label_data_float)
            
            X_train[train_mask, col] = label_mean
            X_test[test_mask, col] = label_mean
    
    # Convert numerical features to float (we gotta keep categorical as strings tho)
    for col in numerical_features:
        X_train[:, col] = X_train[:, col].astype(float)
        X_test[:, col] = X_test[:, col].astype(float)
    
    # Encode labels: '+' -> 1, '-' -> 0
    y_train = np.array([1 if label == '+' else 0 for label in y_train])
    y_test = np.array([1 if label == '+' else 0 for label in y_test])
    
    return X_train, y_train, X_test, y_test


def z_normalize(X_train, X_test, feature_indices):
    """
    Apply z-score normalization to specified features.
    
    Parameters:
    -----------
    X_train, X_test : numpy arrays
    feature_indices : list of indices for numerical features
    
    Returns:
    --------
    X_train_normalized, X_test_normalized : numpy arrays
    """
    X_train_normalized = X_train.copy()
    X_test_normalized = X_test.copy()
    
    for col in feature_indices:
        mean = np.mean(X_train[:, col])
        std = np.std(X_train[:, col])
        
        if std > 0:
            X_train_normalized[:, col] = (X_train[:, col] - mean) / std
            X_test_normalized[:, col] = (X_test[:, col] - mean) / std
    
    return X_train_normalized, X_test_normalized

## Task 2: Implement k-NN Classifier
Implement k-NN with L2 (Euclidean) distance:
$$\mathcal{D}_{L2}(\mathbf{a}, \mathbf{b}) = \sqrt{\sum_i (a_i - b_i)^2}$$

For **categorical attributes**, use:
- Distance = 1 if values are different
- Distance = 0 if values are the same


In [9]:
def l2_distance(a, b):
    """
    Compute L2 (Euclidean) distance between two vectors.
    
    Parameters:
    -----------
    a, b : numpy arrays of same shape
    
    Returns:
    --------
    distance : float
    """
    a = np.array(a)
    b = np.array(b)

    squared_distance = 0.0
    for i in range(len(a)):
        # Check if categorical (string) or numerical
        if isinstance(a[i], str) or isinstance(b[i], str):
            # Categorical: 1 if different, 0 if same
            squared_distance += 1.0 if a[i] != b[i] else 0.0
        else:
            # Numerical
            squared_distance += (a[i] - b[i]) ** 2
    
    return np.sqrt(squared_distance)



def knn_predict(X_train, y_train, X_test, k):
    """
    Predict labels for test data using k-NN.
    
    Parameters:
    -----------
    X_train : numpy array of shape (n_train, n_features)
    y_train : numpy array of shape (n_train,)
    X_test : numpy array of shape (n_test, n_features)
    k : int, number of neighbors
    
    Returns:
    --------
    predictions : numpy array of shape (n_test,)
    """

    predictions = []
    
    for row in X_test:                                                                                            
      distances = np.array([l2_distance(sample, row) for sample in X_train])
      k_nearest_indices = np.argsort(distances)[:k]  # k smallest distances
      k_nearest_labels = y_train[k_nearest_indices]

      # We use counter -> taking the most common label -> the first one -> label
      prediction = Counter(k_nearest_labels).most_common(1)[0][0]
      predictions.append(prediction)
    
    predictions = np.array(predictions)
    return predictions
        
    
    # TODO: Implement k-NN prediction
    # For each test sample:
    #   1. Compute distance to all training samples
    #   2. Find k nearest neighbors
    #   3. Predict using majority voting



def compute_accuracy(y_true, y_pred):
    """
    Compute classification accuracy.
    
    Returns:
    --------
    accuracy : float (between 0 and 1)
    """

    # I think I know how to use ternary functions fluently now :) 
    return np.sum(np.array([y_true[i] == y_pred[i] for i in range(len(y_true))])) / len(y_true)

## Task 3: Evaluate on Lenses Dataset
Test your k-NN implementation on the Lenses dataset for different values of k.


In [None]:
# TODO: Evaluate k-NN on Lenses dataset
# Try different values of k (e.g., 1, 3, 5, 7)

k_values = [1, 3, 5, 7]
lenses_results = []
# 
for k in k_values:
     predictions = knn_predict(X_train_lenses, y_train_lenses, X_test_lenses, k)
     accuracy = compute_accuracy(y_test_lenses, predictions)
     lenses_results.append((k, accuracy))
     print(f"k={k}: Accuracy = {accuracy:.4f}")

[3. 3. 3. 3. 3. 3.]


## Task 4: Evaluate on Credit Approval Dataset
First preprocess the data, then evaluate k-NN.


In [11]:
# TODO: Preprocess Credit Approval data
X_train_credit, y_train_credit, X_test_credit, y_test_credit = preprocess_credit_data(
     DATA_PATH + "crx.data.training",
     DATA_PATH + "crx.data.testing"
 )

In [18]:
# TODO: Evaluate k-NN on Credit Approval dataset
credit_results = []
 
for k in range(1, 9):
     predictions = knn_predict(X_train_credit, y_train_credit, X_test_credit, k)
     accuracy = compute_accuracy(y_test_credit, predictions)
     credit_results.append((k, accuracy))
     print(f"k={k}: Accuracy = {accuracy:.4f}")

k=1: Accuracy = 0.6522
k=2: Accuracy = 0.6522
k=3: Accuracy = 0.6812
k=4: Accuracy = 0.6739
k=5: Accuracy = 0.6667
k=6: Accuracy = 0.6812
k=7: Accuracy = 0.6304
k=8: Accuracy = 0.6377


## Summary and Discussion

### Results Table

| Dataset | k=1 | k=3 | k=5 | k=7 |
|---------|-----|-----|-----|-----|
| Lenses | 1 | 1 | 0.5 | 0.83 |
| Credit Approval | 0.6522 | 0.6812 | 0.6667 | 0.66304 |

### Discussion
*Answer these questions:*
1. **Which value of k works best for each dataset? Why do you think that is?**
For lenses k=1 and for credit approval for k = 3 worked the best. We have limited test samples particularly for lenses. 

As for credit approval I am not too sure. This is quite disappointing. Perhaps the transition from categorical to numerical 
was not a perfect translation. I would say perhaps that the optimal k likely scales with larger datasets. 

2. **How did preprocessing affect your results on the Credit Approval dataset?**
Well the credit approval dataset is more full without NaN values. It was also missing  value imputation preserved class-specific patterns (label-conditioned mean for numerical features).

Without preprocessing, the dataset would be unusable (can't compute distances on strings/NaN). So preprocessing didn't artificially inflate accuracy, it just made the data usable while preserving its structure.
3. **What are the trade-offs of using different values of k?**
The bias variance tradeoff. If we have low k, there is low bias and higher variance as the test point
can overfit to points incredibly close to it. Vice versa. As k increases, we begin to underfit as the choise of the test
label will begin to approach the proportion split of the data. 

4. **What did you learn from this exercise?**
I learned particularly how to become comfortable with low level understanding of manipulating and transforming arrays. 
Using ternary operators really forces you to visualize since you aren't breaking it down step by step in distinct for loops. 

In the age of AI where perhaps my work will be replaced by AI, I begin to wonder the importance of low level coding. The utility comes not from the code I write, but the understanding I gain from writing code. 
