# Day 52 – K-Nearest Neighbors (KNN) Classification

## Introduction

In this notebook, I focus on **K-Nearest Neighbors (KNN) Classification**. KNN is a distance-based algorithm that predicts the class of a data point by looking at the majority class of its nearest neighbors.

I begin with a short theory on how KNN works, why scaling is important, and the impact of choosing different `k` values. Then I implement KNN on a dataset, testing multiple `k` values (with and without scaling) to compare performance.

By the end of this notebook, it becomes clear how scaling affects KNN, how accuracy changes with different `k`, and why selecting the right `k` is important for building a reliable classifier.

---

## What is Classification?

* **Classification** is a supervised learning technique where the goal is to predict the **category (class label)** of given input data.
* Examples:

  * Spam vs. Not Spam (emails)
  * Disease vs. No Disease (medical test results)
  * Will Buy vs. Will Not Buy (customer behavior)

---

## What is KNN Classification?

The **K-Nearest Neighbors (KNN)** algorithm classifies a data point based on how its **neighbors** are classified.
**K-Nearest Neighbors (KNN)** is a **non-parametric** and **lazy learning** algorithm.
* **Non-parametric** means it makes no assumptions about the underlying data distribution.
* **Lazy learning** means it doesn't build a model during the training phase. Instead, it "memorizes" the entire training dataset. The computation and learning only happen when a new data point needs a prediction.


## How it Works: The Core Idea

The fundamental principle of KNN is that "similar things are near each other." To classify a new, unseen data point, the KNN algorithm follows these steps:

1.  **Distance Calculation**: It calculates the distance between the new data point and every single point in the training dataset.
   * Common distance metrics:
     * Euclidean Distance
     * Manhattan Distance
     * Minkowski Distance

3.  **Find the K-Nearest Neighbors**: It identifies the `K` data points that are closest to the new point (i.e., those with the smallest distances). The value of `K` is a hyperparameter you choose.
4.  **Predict the Class**: For a classifier, the predicted class for the new data point is determined by a **majority vote** among its `K` nearest neighbors. The new data point is assigned the class that is most common among its neighbors.


## Choosing a Hyperparameter `K`

The choice of `K` is crucial and can significantly impact the model's performance:
* **Small `K`** (e.g., K=1): The model is very sensitive to noise in the data, which can lead to **high variance** and **overfitting**.
* **Large `K`**: The model becomes overly generalized and might miss fine-grained patterns, which can lead to **high bias** and **underfitting**.
* Best `K` is usually found using **Cross Validation**.


## Distance Metrics

The choice of distance metric determines how the algorithm measures "nearness." Some common metrics include:

* **Euclidean Distance**: The most common metric. It calculates the shortest, straight-line distance between two points. It is the default in many implementations and works well with dense, continuous data.
* **Manhattan Distance**: Also known as "city block distance," it measures the distance by summing the absolute differences of the coordinates.
* **Cosine Distance**: This metric measures the angle between two vectors and is particularly useful for high-dimensional data, such as in natural language processing (NLP) or when comparing documents, as it focuses on orientation rather than magnitude.

---

## Key Characteristics

* **No Training Phase**: The "training" phase for KNN is just storing the dataset.
* **Feature Scaling is Crucial**: Since KNN relies on distance, features with larger scales can dominate the distance calculation. Therefore, **feature scaling (standardization or normalization)** is a vital preprocessing step for KNN.
* **Computational Cost**: KNN can be computationally expensive for very large datasets, as it needs to calculate the distance to every data point for each new prediction.
* **Simple and Interpretable**: The algorithm's logic is easy to understand, making its decisions highly interpretable.

---

## Advantages of KNN

* Simple and intuitive.
* Works well with smaller datasets.
* No assumption about data distribution (non-parametric).

## Limitations of KNN

* **Computationally expensive** for large datasets (distance calculation for each prediction).
* Sensitive to irrelevant features and feature scaling.
* Choosing the right **value of K** is crucial.

---

## Why Feature Scaling is Important in KNN?

* KNN relies on **distance measures**.
* Features with larger ranges dominate the distance calculation.
* Example: Age (20–60) vs. Salary (30,000–150,000).
* To fix this → apply **Normalization or Standardization** before KNN.

---

## Evaluation Metrics for Classification

When evaluating KNN classification, common metrics include:

* **Confusion Matrix**
* **Accuracy Score**
* **Precision, Recall, F1 Score**
* **ROC-AUC Curve**

---

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

## Load the dataset

In [2]:
dataset = pd.read_csv(r"C:\Users\Arman\Downloads\dataset\logit classification.csv")

In [3]:
dataset

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0
...,...,...,...,...,...
395,15691863,Female,46,41000,1
396,15706071,Male,51,23000,1
397,15654296,Female,50,20000,1
398,15755018,Male,36,33000,0


## Feature Selection
Split into features (X) and target (y)
- X: Features (Age, EstimatedSalary)
- y: Target (Purchased)

In [4]:
X = dataset[["Age", "EstimatedSalary"]].values
y = dataset["Purchased"].values

## Splitting the dataset into the Training set and Test set

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

## Feature Scaling
### Apply StandardScaler

In [6]:
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

## Train the KNN Model with different parameters

### Model 1: k=3, p=1 (Manhattan)

In [7]:
model1 = KNeighborsClassifier(n_neighbors=3, p=1)
model1.fit(X_train_sc, y_train)
y_pred1 = model1.predict(X_test_sc)
cm1 = confusion_matrix(y_test, y_pred1)
print("Model 1: k = 3, p = 1 (Manhattan) -", accuracy_score(y_test,y_pred1))
print("Confusion Matrix of Model 1:\n", cm1)

Model 1: k = 3, p = 1 (Manhattan) - 0.93
Confusion Matrix of Model 1:
 [[64  4]
 [ 3 29]]


### Model 2: k=4, p=1 (Manhattan)

In [8]:
model2 = KNeighborsClassifier(n_neighbors=4, p=1)
model2.fit(X_train_sc, y_train)
y_pred2 = model2.predict(X_test_sc)
cm2 = confusion_matrix(y_test, y_pred2)
print("Model 2: k = 4, p = 1 (Manhattan) -", accuracy_score(y_test,y_pred2))
print("Confusion Matrix of Model 2:\n", cm2)

Model 2: k = 4, p = 1 (Manhattan) - 0.93
Confusion Matrix of Model 2:
 [[64  4]
 [ 3 29]]


### Model 3: k=5, p=1 (Manhattan)

In [9]:
model3 = KNeighborsClassifier(n_neighbors=5, p=1)
model3.fit(X_train_sc, y_train)
y_pred3 = model3.predict(X_test_sc)
cm3 = confusion_matrix(y_test, y_pred3)
print("Model 3: k = 5, p = 1 (Manhattan) -", accuracy_score(y_test,y_pred3))
print("Confusion Matrix of Model 3:\n", cm3)

Model 3: k = 5, p = 1 (Manhattan) - 0.93
Confusion Matrix of Model 3:
 [[64  4]
 [ 3 29]]


### Model 4: k=3, p=2 (Euclidean)

In [10]:
model4 = KNeighborsClassifier(n_neighbors=3, p=2)
model4.fit(X_train_sc, y_train)
y_pred4 = model4.predict(X_test_sc)
cm4 = confusion_matrix(y_test, y_pred4)
print("Model 4: k = 3, p = 2 (Euclidean) -", accuracy_score(y_test,y_pred4))
print("Confusion Matrix of Model 4:\n", cm4)

Model 4: k = 3, p = 2 (Euclidean) - 0.93
Confusion Matrix of Model 4:
 [[64  4]
 [ 3 29]]


### Model 5: k=4, p=2 (Euclidean)

In [11]:
model5 = KNeighborsClassifier(n_neighbors=4, p=2)
model5.fit(X_train_sc, y_train)
y_pred5 = model5.predict(X_test_sc)
cm5 = confusion_matrix(y_test, y_pred5)
print("Model 5: k = 4, p = 2 (Euclidean) -", accuracy_score(y_test,y_pred5))
print("Confusion Matrix of Model 5:\n", cm5)

Model 5: k = 4, p = 2 (Euclidean) - 0.92
Confusion Matrix of Model 5:
 [[64  4]
 [ 4 28]]


### Model 6: k=5, p=2 (Euclidean)

In [12]:
model6 = KNeighborsClassifier(n_neighbors=5, p=2)
model6.fit(X_train_sc, y_train)
y_pred6 = model6.predict(X_test_sc)
cm6 = confusion_matrix(y_test, y_pred6)
print("Model 6: k = 5, p = 2 (Euclidean) -", accuracy_score(y_test,y_pred6))
print("Confusion Matrix of Model 6:\n", cm6)

Model 6: k = 5, p = 2 (Euclidean) - 0.93
Confusion Matrix of Model 6:
 [[64  4]
 [ 3 29]]


### Without Scaling

In [13]:
model7 = KNeighborsClassifier(n_neighbors = 3, p=1)
model7.fit(X_train, y_train)
y_pred7 = model7.predict(X_test)
cm7 = confusion_matrix(y_test, y_pred7)
print("Accuracy without scaling:", accuracy_score(y_test, y_pred7))
print("Confusion Matrix of Model 7:\n", cm7)

Accuracy without scaling: 0.78
Confusion Matrix of Model 7:
 [[57 11]
 [11 21]]


## Results Comparison Table

In [14]:
results_df = pd.DataFrame({
    "Model": ["1","2","3","4","5","6","(Without Scaling) 7"],
    "k": [3,4,5,3,4,5,3],
    "Distance Metric (p)": [1,1,1,2,2,2,1],
    "Accuracy": [
        accuracy_score(y_test, y_pred1),
        accuracy_score(y_test, y_pred2),
        accuracy_score(y_test, y_pred3),
        accuracy_score(y_test, y_pred4),
        accuracy_score(y_test, y_pred5),
        accuracy_score(y_test, y_pred6),
        accuracy_score(y_test, y_pred7)]})

print("Comparision Table:\n")
print(results_df)

Comparision Table:

                 Model  k  Distance Metric (p)  Accuracy
0                    1  3                    1      0.93
1                    2  4                    1      0.93
2                    3  5                    1      0.93
3                    4  3                    2      0.93
4                    5  4                    2      0.92
5                    6  5                    2      0.93
6  (Without Scaling) 7  3                    1      0.78


## Comparison of K Values

| **K Value / Condition**  | **Accuracy** | **Confusion Matrix**    | **Interpretation**                                                                                                                                                                                   |
| ------------------------ | ------------ | ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **k = 3 (with scaling)** | **0.93**     | \[\[64, 4], \[ 3, 29]]  | Very good accuracy. The model correctly classifies most samples with only a few false positives and false negatives. With k=3, the boundary is flexible but scaling helps keep noise under control.  |
| **k = 4 (with scaling)** | **0.93**     | \[\[64, 4], \[ 3, 29]]  | Accuracy remains the same as k=3. The confusion matrix is also identical, meaning both values perform equally well in this case.                                                                     |
| **k = 5 (with scaling)** | **0.93**     | \[\[64, 4], \[ 3, 29]]  | Again, accuracy is stable at 0.93. This shows the model is not very sensitive to small changes in k around this range. Choosing k=5 usually balances bias and variance, so it is a safe choice.      |
| **k = 3 (no scaling)**   | **0.78**     | \[\[57, 11], \[11, 21]] | Accuracy drops significantly without scaling. Larger–range features dominate the distance calculation, leading to poor classification. This highlights the importance of **feature scaling in KNN**. |


## Key Observations

* With scaling, **k=3, 4, and 5 all give excellent accuracy (0.93)**, showing stability in this range.
* Without scaling, accuracy drops to **0.78**, clearly proving that scaling is **crucial for distance-based algorithms** like KNN.
* A middle value such as **k=5** is generally preferred since it balances flexibility (low bias) and stability (low variance).

---


## Summary

* Implemented KNN with **k = 3, 4, 5 (with scaling)** → all achieved accuracy of **0.93**.
* Using **k = 3 without scaling** dropped accuracy to **0.78**, proving that scaling is crucial.
* Confusion matrices confirmed differences in misclassifications across experiments.
* KNN was found stable for multiple `k` values, but scaling significantly improved results.


## Key Takeaways

* KNN is **distance-based** and sensitive to feature scaling.
* Small `k` → flexible, risk of overfitting; larger `k` → smoother, risk of underfitting.
* **Scaling features is mandatory** for good KNN performance.
* Best accuracy achieved in this notebook: **0.93 (scaled data, k=3–5)**.

