# 🤖 Introduction to k-Nearest Neighbors (k-NN) Algorithm and Its Applications

---

## 📌 **What is k-Nearest Neighbors (k-NN)?**

The **k-Nearest Neighbors (k-NN)** algorithm is a **simple, instance-based learning method**.  
It classifies new data points based on the **majority class** among their *k* nearest neighbors in the **feature space**.

---

## 🧠 **Key Characteristics**

- **📍 Instance-Based Learning**  
  - k-NN **stores all training instances** instead of learning an explicit model.  
  - Predictions are deferred until a query is made (also called *lazy learning*).  

- **📏 Distance Metric**  
  - Uses distance measures like:  
    - **Euclidean Distance**:  
      \[
      d(p, q) = \sqrt{\sum (p_i - q_i)^2}
      \]  
    - **Manhattan Distance**:  
      \[
      d(p, q) = \sum |p_i - q_i|
      \]

- **🏷 Classification**  
  - Predicts **class labels** based on **majority voting** among neighbors.  

- **📈 Regression**  
  - Predicts **continuous values** by **averaging the values** of nearest neighbors.  

---

## 🌟 **Applications of k-NN**

- 🖼 **Image Recognition**  
  - Classify images based on similarity in pixel or feature space.  

- 📚 **Recommendation Systems**  
  - Suggest items to users based on similar preferences or behaviors.  

- 🏥 **Medical Diagnosis**  
  - Identify diseases or conditions by comparing patient data to similar historical cases.  

- 🛍 **Customer Segmentation**  
  - Group customers with similar purchasing behaviors for targeted marketing.  

---

## ▶ **Optional: Quick Python Example**
```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

# Initialize and train k-NN classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Make predictions
y_pred = knn.predict(X_test)

# Evaluate accuracy
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc:.2f}")


# k-Nearest Neighbors (k-NN) — Classification & Regression

## 📖 What is k-NN?
k-NN is a simple **instance-based learning** algorithm:
- Classifies a new data point based on the **majority class** among its *k* nearest neighbors.
- For regression, predicts the output as the **average** of the nearest neighbors' values.

---

## 🔑 Key Characteristics
- **Instance-Based Learning**: Stores all training data (no explicit model).
- **Distance Metrics**: Common choices — Euclidean, Manhattan, Minkowski.
- **Classification**: Uses majority voting.
- **Regression**: Uses averaging of neighbor values.

---

## 🔢 Step-by-Step Process

### 1️⃣ Feature Scaling  
Normalize or standardize features to ensure equal weighting.  
- **Min–Max Scaling**:  
  \[
  x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}
  \]  
- **Standardization (Z-score)**:  
  \[
  x' = \frac{x - \mu}{\sigma}
  \]

### 2️⃣ Calculate Distances  
Compute distances between the new point and all training points.  
- **Euclidean Distance**:  
  \[
  d(p,q) = \sqrt{\sum_{i=1}^d (p_i - q_i)^2}
  \]  
- **Manhattan Distance**:  
  \[
  d(p,q) = \sum_{i=1}^d |p_i - q_i|
  \]  
- **Minkowski Distance**:  
  \[
  d(p,q) = \left(\sum_{i=1}^d |p_i - q_i|^p\right)^{1/p}
  \]

### 3️⃣ Identify k Nearest Neighbors  
- Choose **k** (a hyperparameter).  
- Select **k** training points with the smallest distances.  
- **Small k** → Low bias, high variance (sensitive to noise).  
- **Large k** → High bias, low variance (may underfit).  
- Use cross-validation to choose the best **k**.

### 4️⃣ Make Predictions  
- **Classification (Majority Voting)**:  
  \[
  \hat{y} = \text{mode}\{y_{(1)}, y_{(2)}, \dots, y_{(k)}\}
  \]  
- **Weighted Voting**:  
  \[
  w_i = \frac{1}{d_i + \varepsilon}, \qquad 
  \hat{y} = \arg\max_c \sum_{i: y_{(i)}=c} w_i
  \]  
- **Regression (Average)**:  
  \[
  \hat{y} = \frac{1}{k}\sum_{i=1}^{k} y_{(i)}
  \]

---

## 📌 k-NN Classification Example with Iris Dataset
```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

# Load dataset
X, y = load_iris(return_X_y=True)

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train k-NN
knn = KNeighborsClassifier(n_neighbors=5, weights='distance')  # Distance-weighted voting
knn.fit(X_train, y_train)

# Predict and evaluate
y_pred = knn.predict(X_test)
print(classification_report(y_test, y_pred))


## 🔎 Understanding the Model’s Limitations (k-NN)

- **Computationally Expensive**  
  - Predictions require computing distances to **all** training samples.  
  - Becomes slower as the dataset grows.  

- **Feature Scaling Dependence**  
  - Requires **proper scaling** (e.g., Min–Max, Standardization) to avoid dominance by features with larger numeric ranges.  

- **Not Robust to Imbalanced Data**  
  - Classes with more samples can **dominate predictions**.  
  - Consider techniques like resampling, synthetic data generation (SMOTE), or using weighted k-NN.  

### ➕ Additional Considerations  
- **Choice of k Matters**: Too small → sensitive to noise. Too large → may underfit.  
- **Curse of Dimensionality**: As the number of features increases, distances become less meaningful—dimensionality reduction (PCA, t-SNE) may help.  
- **Storage Requirement**: All training data must be stored for predictions, increasing memory usage.  


In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

In [2]:
data = load_iris()
X,y = data.data, data.target

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [4]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [5]:
for k in range(1, 11):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train,  y_train)

    y_pred = knn.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)

    accuracy = accuracy_score(y_test, y_pred)
    print(f"k = {k}, Accuracy = {accuracy:.2f}")



k = 1, Accuracy = 1.00
k = 2, Accuracy = 1.00
k = 3, Accuracy = 1.00
k = 4, Accuracy = 1.00
k = 5, Accuracy = 1.00
k = 6, Accuracy = 1.00
k = 7, Accuracy = 1.00
k = 8, Accuracy = 1.00
k = 9, Accuracy = 1.00
k = 10, Accuracy = 1.00


In [6]:
from sklearn.linear_model import LogisticRegression

# Train logistic regression model
log_reg = LogisticRegression(max_iter= 200)
log_reg.fit(X_train, y_train)

In [7]:
y_pred_lr = log_reg.predict(X_test)
accuracy_lr = accuracy_score(y_test, y_pred_lr)

In [8]:
print("L R A: ",accuracy_lr)

L R A:  1.0


In [9]:
best_k = 5
knn = KNeighborsClassifier(n_neighbors= best_k)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
accuracy_knn = accuracy_score(y_test, y_pred_knn)
print(f"k-NN Accuracy k = 5: ", accuracy_knn)

k-NN Accuracy k = 5:  1.0
