# Neighbor-Based Predictions: K-Nearest Neighbors (KNN)

## Neighbor-Based Learning

In previous sessions, we used Tree-based models (Random Forest, XGBoost) which split data by rules.
Now we introduce **K-Nearest Neighbors (KNN)** — an *instance-based* learner that assumes similar customers have similar spending patterns.

**Dataset**: Malaysia Customer Transactions 2025 — predicting `total_transaction_amount`.

**How KNN Works:**
1. **Store** all training data.
2. For a new prediction, calculate the **distance** to all training points.
3. Find the **K** nearest neighbors.
4. **Average** their target values (for Regression) as the prediction.

**Key Difference from Trees:**
- KNN is sensitive to the **scale** of data (because it uses distance).
- Trees are invariant to scale.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-Learn tools
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

%matplotlib inline
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

## 1. Data Loading & Preprocessing

**Critical Step: Scaling**
Unlike XGBoost, KNN *requires* feature scaling.
Consider two features:
- `region_encoded`: range 0–15 (16 Malaysian states)
- `number_of_purchases`: range 0–25

Without scaling, distances might be dominated by one feature. We use `StandardScaler` (mean=0, variance=1) to normalise.

In [None]:
# Load data
df = pd.read_csv("dummy_malaysia_customer_transactions_2025.csv")

# Encode categorical features
le_region = LabelEncoder()
le_quarter = LabelEncoder()
df['region_encoded'] = le_region.fit_transform(df['region'].fillna('Unknown'))
df['quarter_encoded'] = le_quarter.fit_transform(df['quarter'].fillna('Unknown'))

# Select Features and Target
features = ['region_encoded', 'quarter_encoded', 'number_of_purchases']
target = 'total_transaction_amount'

data = df[features + [target]].dropna()
X = data[features]
y = data[target]

# Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale Data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training Data Shape: {X_train.shape}")
print("Data has been scaled.")

## 2. Baseline KNN Model

We train a standard KNN Regressor.
Key parameter:
- `n_neighbors`: The 'K' in KNN. Number of neighbors to use.

In [None]:
# Initialize model (default K=3)
knn_model = KNeighborsRegressor(n_neighbors=3)

# Train (on SCALED data)
knn_model.fit(X_train_scaled, y_train)

# Predict
y_pred = knn_model.predict(X_test_scaled)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("--- Baseline KNN (K=3) ---")
print(f"MSE: {mse:.2f}")
print(f"R2 Score: {r2:.4f}")

## 3. The Effect of 'K' (Overfitting vs Underfitting)

The choice of K dramatically changes the model:
- **Small K (e.g., 1)**: Captures noise — **Overfitting** (High Variance).
- **Large K**: Averages too many points — **Underfitting** (High Bias).

Let's visualize how the error changes as we increase K.

In [None]:
k_values = range(1, 21)
mse_scores = []

for k in k_values:
    knn = KNeighborsRegressor(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    y_p = knn.predict(X_test_scaled)
    mse_scores.append(mean_squared_error(y_test, y_p))

plt.figure(figsize=(10, 6))
plt.plot(k_values, mse_scores, marker='o', linestyle='--')
plt.title('MSE vs. n_neighbors (K) — Customer Transactions')
plt.xlabel('K')
plt.ylabel('Mean Squared Error')
plt.xticks(k_values)
plt.grid(True)
plt.show()

## 4. Hyperparameter Tuning

We can tune:
- `n_neighbors`: The number of neighbors.
- `weights`: 'uniform' (all neighbors equal) vs 'distance' (closer neighbors have more say).
- `p`: Distance metric. 1 = Manhattan (L1), 2 = Euclidean (L2).

In [None]:
param_grid = {
    'n_neighbors': [3, 5, 7, 9, 11, 15],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]
}

knn_tuned = KNeighborsRegressor()
grid_search = GridSearchCV(estimator=knn_tuned, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', verbose=1)
grid_search.fit(X_train_scaled, y_train)

best_knn = grid_search.best_estimator_

print(f"Best Params: {grid_search.best_params_}")
print(f"Best CV MSE: {-grid_search.best_score_:.2f}")

## 5. Visualization: Actual vs Predicted

KNN doesn't provide 'Feature Importance' like trees. We assess performance by how close predictions are to actual values.

In [None]:
final_pred = best_knn.predict(X_test_scaled)

plt.figure(figsize=(8, 8))
sns.scatterplot(x=y_test, y=final_pred, alpha=0.6)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', linewidth=2)  # Identity line
plt.xlabel('Actual Transaction Amount (MYR)')
plt.ylabel('Predicted Transaction Amount (MYR)')
plt.title('Actual vs Predicted (KNN) — Customer Transactions')
plt.show()

final_r2 = r2_score(y_test, final_pred)
print(f"Best KNN R2: {final_r2:.4f}")

## 6. Exercises

Challenge yourself with these tasks!

In [None]:
# Task 1: The Effect of Scaling
# Train a KNN model (K=5) on the UN-SCALED data (X_train, X_test).
# Compare the MSE with the scaled version. How big is the difference?

# Your code here

In [None]:
# Task 2: Manual Inference
# Pick a random row from X_test.
# Find its nearest neighbor manually using numpy (Euclidean distance on scaled data).
# Does the target value of the neighbor match the prediction?

# Your code here

## Summary

You've learned **K-Nearest Neighbors**!
1. **Similarity-based**: Predicts based on what is close.
2. **Scaling is Key**: Distances are distorted without proper scaling.
3. **Choice of K**: Balances noise (small K) vs smoothness (large K).