## KNN Imputation (K-Nearest Neighbors Imputation)

### What it does
- KNN Imputation fills a missing value using values from the **K most similar rows**.
- Similarity is computed using a distance metric (usually **Euclidean distance**) on the **available features only**.

---

### Core working idea
1. Select a row with a missing value.
2. Compute distances between this row and all other rows **that have the target value present**.
3. Ignore features that are missing in either row while computing distance.
4. Identify the **K nearest neighbors**.
5. Impute the missing value using:
   - **Mean** of neighbors (numerical)
   - **Mode** of neighbors (categorical)

---

### Distance calculation (important)
Euclidean distance with missing-aware weighting:

\[
d(x,y) = \sqrt{w \times \sum (x_i - y_i)^2}
\]

Where:
- Sum is over **only present coordinates**
-  
\[
w = \frac{\text{Total features}}{\text{Number of present features}}
\]

This penalizes rows with fewer shared features.

---

### Example (conceptual)
If a row has missing **Feature 1**:
- Use Feature 2, Feature 3, Feature 4 for distance
- Find K nearest complete rows
- Average their Feature 1 values → impute

---

### Hyperparameter: K
- Small K → low bias, high variance
- Large K → high bias, low variance
- Typical starting point: **K = 3–7**

---

### Advantages
- Preserves **local data structure**
- More accurate than mean/median imputation
- Maintains correlation between features
- Works well when data is not linearly distributed

---

### Disadvantages
- Computationally expensive (distance to all points)
- Slow for large datasets
- Sensitive to feature scaling
- Poor performance with high missingness
- Requires storing full training data

---

### When to use
- Small to medium datasets
- Strong relationship between features
- Distance-based models
- When accuracy matters more than speed

---

### When to avoid
- Very large datasets
- High-dimensional data
- Real-time inference systems
- If features are poorly correlated

---

### sklearn implementation
```python
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5, weights='distance')
X_imputed = imputer.fit_transform(X)


# example
## KNN Imputation — Worked Numerical Example (Step by Step)

Assume we have the following dataset with **one missing value** in `Feature 1`.

### Given data

| Row | Feature 1 | Feature 2 | Feature 3 | Feature 4 |
|-----|----------|----------|----------|----------|
| 1   | 33       | 67       | 68       | 21 |
| 2   | NaN      | 45       | 68       | 12 |
| 3   | 23       | 51       | 71       | 18 |
| 4   | 40       | NaN      | 81       | NaN |
| 5   | 35       | 60       | 79       | NaN |

We want to **impute Feature 1 for Row 2** using **KNN Imputation**.

---

### Step 1: Identify the row with missing value
Target row (Row 2):

\[
(?,\ 45,\ 68,\ 12)
\]

We must compute distance between this row and other rows **that have Feature 1 present**.

---

### Step 2: Distance formula (missing-aware)

\[
d(x,y) = \sqrt{w \times \sum (x_i - y_i)^2}
\]

Where:

\[
w = \frac{\text{Total number of features}}{\text{Number of present features}}
\]

Total features = 4

---

### Step 3: Distance from Row 1

Row 1: (33, 67, 68, 21)

Common features: Feature 2, 3, 4 → **3 features**

\[
w = \frac{4}{3}
\]

\[
d_1 = \sqrt{\frac{4}{3} \left[(67-45)^2 + (68-68)^2 + (21-12)^2\right]}
\]

\[
= \sqrt{\frac{4}{3} (484 + 0 + 81)}
\]

\[
= \sqrt{\frac{4}{3} \times 565} = \sqrt{753.33} \approx 27.44
\]

---

### Step 4: Distance from Row 3

Row 3: (23, 51, 71, 18)

\[
w = \frac{4}{3}
\]

\[
d_3 = \sqrt{\frac{4}{3} [(51-45)^2 + (71-68)^2 + (18-12)^2]}
\]

\[
= \sqrt{\frac{4}{3} (36 + 9 + 36)}
\]

\[
= \sqrt{108} \approx 10.39
\]

---

### Step 5: Distance from Row 4

Row 4 has missing Feature 2 and Feature 4  
Only Feature 3 is common → **1 feature**

\[
w = \frac{4}{1} = 4
\]

\[
d_4 = \sqrt{4 (81-68)^2} = \sqrt{4 \times 169} = \sqrt{676} = 26
\]

---

### Step 6: Distance from Row 5

Common features: Feature 2 and 3 → **2 features**

\[
w = \frac{4}{2} = 2
\]

\[
d_5 = \sqrt{2 [(60-45)^2 + (79-68)^2]}
\]

\[
= \sqrt{2 (225 + 121)} = \sqrt{692} \approx 26.30
\]

---

### Step 7: Nearest neighbors (K = 2)

| Row | Distance |
|----|---------|
| Row 3 | **10.39** ✅ |
| Row 4 | **26.00** ✅ |
| Row 5 | 26.30 |
| Row 1 | 27.44 |

Nearest neighbors’ Feature 1 values:
- Row 3 → 23
- Row 4 → 40

---

### Step 8: Impute missing value

For numerical data → **mean**

\[
\text{Imputed value} = \frac{23 + 40}{2} = 31.5
\]

---

### ✅ Final Result

Row 2, Feature 1 = **31.5**

---

### Key observations
- Distance ignores missing coordinates
- Weight penalizes fewer shared features
- K controls bias–variance tradeoff
- This preserves local structure better than mean imputation


In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

df = pd.read_csv('train.csv')[['Age','Pclass','Fare','Survived']]

df.head()

Unnamed: 0,Age,Pclass,Fare,Survived
0,22.0,3,7.25,0
1,38.0,1,71.2833,1
2,26.0,3,7.925,1
3,35.0,1,53.1,1
4,35.0,3,8.05,0


In [None]:
X = df.drop(columns=['Survived' ])
y = df['Survived']

X_train, X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

X_train.head()

Unnamed: 0,Age,Pclass,Fare
30,40.0,1,27.7208
10,4.0,3,16.7
873,47.0,3,9.0
182,9.0,3,31.3875
876,20.0,3,9.8458


In [None]:
knn = KNNImputer(n_neighbors=5, weights = 'distance')

X_train_trf = knn.fit_transform(X_train)
X_test_trf = knn.transform(X_test)

pd.DataFrame(X_train_trf,columns=X_train.columns)

Unnamed: 0,Age,Pclass,Fare
0,40.000000,1.0,27.7208
1,4.000000,3.0,16.7000
2,47.000000,3.0,9.0000
3,9.000000,3.0,31.3875
4,20.000000,3.0,9.8458
...,...,...,...
707,30.000000,3.0,8.6625
708,25.047962,3.0,8.7125
709,71.000000,1.0,49.5042
710,29.530076,1.0,221.7792


In [None]:
lr = LogisticRegression()

lr.fit(X_train_trf,y_train)

y_pred = lr.predict(X_test_trf)

accuracy_score(y_test,y_pred)

0.7039106145251397

In [None]:
# Comparision with Simple Imputer -- > mean

si = SimpleImputer()

X_train_trf2 = si.fit_transform(X_train)
X_test_trf2 = si.transform(X_test)

In [None]:
lr = LogisticRegression()

lr.fit(X_train_trf2,y_train)

y_pred2 = lr.predict(X_test_trf2)

accuracy_score(y_test,y_pred2)

0.6927374301675978