# Multivariate Imputation

---
🔹 What is Multivariate Imputation?

Unlike univariate imputation (e.g., filling missing values in each column independently with mean/median/mode), multivariate imputation uses the relationships among multiple features to estimate missing values.

It considers the correlation structure of the dataset, making it much more powerful and realistic.

🔹 Key Idea

Each variable with missing values is predicted from the other variables.

Missing values are imputed using regression models, iterative methods, or stochastic approaches.

It produces more accurate and consistent estimates compared to simple imputation.

🔹 Popular Approaches
1. MICE (Multiple Imputation by Chained Equations)

Works iteratively.

For each feature with missing values, build a regression model using the other features as predictors.

Impute missing values, then move to the next feature.

Repeat the cycle multiple times for better convergence.

Produces multiple imputed datasets to reflect uncertainty.

2. IterativeImputer (Scikit-learn)

Implementation of MICE-like algorithm.

Models each feature with missing values as a function of other features, in a round-robin fashion.

3. KNN Imputation (Multivariate by distance)

Missing values are imputed by finding the k nearest neighbors (based on similarity in other features).

Takes into account multivariate structure.

🔹 Example

Dataset before imputation:

ID	Age	Salary	Experience
1	25	50000	2
2	NaN	60000	5
3	40	NaN	10
4	35	70000	NaN
Using Iterative Imputer (MICE):

Age (row 2) could be predicted using Salary + Experience.

Salary (row 3) could be predicted using Age + Experience.

Experience (row 4) could be predicted using Age + Salary.

🔹 In Scikit-learn
import numpy as np
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Sample data
data = pd.DataFrame({
    "Age": [25, np.nan, 40, 35],
    "Salary": [50000, 60000, np.nan, 70000],
    "Experience": [2, 5, 10, np.nan]
})

# Multivariate imputer (MICE)
imputer = IterativeImputer(max_iter=10, random_state=42)
imputed = imputer.fit_transform(data)

print(pd.DataFrame(imputed, columns=data.columns))


This will fill missing values using a regression model built from other features.

🔹 Advantages

✅ Uses correlation across features → more accurate than univariate methods.
✅ Works well for datasets with strong feature relationships.
✅ Reflects uncertainty better (especially with multiple imputations).

🔹 Disadvantages

❌ More computationally expensive.
❌ Assumes data is Missing at Random (MAR).
❌ Can overfit if dataset is small.

✅ Summary

Multivariate Imputation = handling missing values by leveraging relationships across features.

Simple methods: KNN imputation.

Advanced methods: MICE (IterativeImputer).
It’s usually preferred when features are strongly correlated and dataset is not too small.


# KNN Imputation

---
🔹 What is KNN Imputation?

KNN (K-Nearest Neighbors) Imputation is a technique where missing values are filled based on the values of the k most similar samples (neighbors).

Similarity is usually measured using Euclidean distance (or other distance metrics).

Instead of using mean/median for the whole column, KNN looks at rows that are "closest" to the row with missing data and imputes using their values.

🔹 How It Works

Find the k nearest neighbors of the sample with a missing value (neighbors are chosen based on non-missing features).

Look at the feature that has missing values in these neighbors.

Replace the missing value with:

Mean/Median of neighbor values (for continuous features).

Mode of neighbor values (for categorical features).

🔹 Example

Dataset:

ID	Age	Salary	Gender
1	25	50000	M
2	NaN	60000	F
3	40	55000	M
4	28	NaN	F

Suppose we want to impute Age for ID=2.

We find the 2 or 3 nearest rows (based on Salary and Gender).

Use their Age values to fill the missing one.

🔹 Advantages

✅ More accurate than simple mean/median imputation (considers similarity between rows).
✅ Can preserve relationships between features.
✅ Works for both numerical and categorical data.

🔹 Disadvantages

❌ Computationally expensive for large datasets (distance calculation for every missing value).
❌ Performance depends on choice of k and distance metric.
❌ Sensitive to irrelevant or highly scaled features → requires normalization/standardization before use.

🔹 In Scikit-learn

Scikit-learn provides KNNImputer:

import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer

# Sample data
data = pd.DataFrame({
    "Age": [25, np.nan, 40, 28],
    "Salary": [50000, 60000, 55000, np.nan],
    "Gender": [1, 0, 1, 0]   # (M=1, F=0 for simplicity)
})

# Initialize KNN imputer (k=2 neighbors)
imputer = KNNImputer(n_neighbors=2)
transformed = imputer.fit_transform(data)

print(pd.DataFrame(transformed, columns=data.columns))

🔹 When to Use

When the dataset is not too large.

When missing values are not too frequent.

When you believe that similar samples have similar values.

✅ Summary:
KNN Imputation replaces missing values by looking at the values of the k most similar rows. It is more sophisticated than mean/median imputation but more computationally expensive and requires careful preprocessing (like scaling).


In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [5]:
df = pd.read_csv('titanic.csv',usecols = ['Age','Pclass','Fare','Survived'])

In [6]:
df.head()

Unnamed: 0,Survived,Pclass,Age,Fare
0,0,3,22.0,7.25
1,1,1,38.0,71.2833
2,1,3,26.0,7.925
3,1,1,35.0,53.1
4,0,3,35.0,8.05


In [7]:
x= df.drop(columns = 'Survived')
y = df['Survived']

In [8]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state = 2)

In [10]:
x_train.head()


Unnamed: 0,Pclass,Age,Fare
30,1,40.0,27.7208
10,3,4.0,16.7
873,3,47.0,9.0
182,3,9.0,31.3875
876,3,20.0,9.8458


In [11]:
ki = KNNImputer()
x_train_trf = ki.fit_transform(x_train)
x_test_trf = ki.transform(x_test)

In [12]:
pd.DataFrame(x_train_trf,columns = x_train.columns)

Unnamed: 0,Pclass,Age,Fare
0,1.0,40.0,27.7208
1,3.0,4.0,16.7000
2,3.0,47.0,9.0000
3,3.0,9.0,31.3875
4,3.0,20.0,9.8458
...,...,...,...
707,3.0,30.0,8.6625
708,3.0,24.2,8.7125
709,1.0,71.0,49.5042
710,1.0,28.0,221.7792


In [13]:
lr = LogisticRegression()
lr.fit(x_train_trf,y_train)

y_pred =lr.predict(x_test_trf)

accuracy_score(y_test,y_pred)

0.7039106145251397

In [15]:
for i in range(1,11):
  ki = KNNImputer(n_neighbors = i)
  x_train_trf = ki.fit_transform(x_train)
  x_test_trf = ki.transform(x_test)

  pd.DataFrame(x_train_trf,columns = x_train.columns)

  lr = LogisticRegression()
  lr.fit(x_train_trf,y_train)

  y_pred =lr.predict(x_test_trf)

  print(accuracy_score(y_test,y_pred))

0.7206703910614525
0.7094972067039106
0.7039106145251397
0.7039106145251397
0.7039106145251397
0.7039106145251397
0.7039106145251397
0.7039106145251397
0.7039106145251397
0.6983240223463687


In [19]:
# weights = uniform, distance // by default distance

# in distance we multiply the value by the resiprocal of the distance

for i in range(1,11):
  ki = KNNImputer(n_neighbors = i, weights='distance')
  x_train_trf = ki.fit_transform(x_train)
  x_test_trf = ki.transform(x_test)

  pd.DataFrame(x_train_trf,columns = x_train.columns)

  lr = LogisticRegression()
  lr.fit(x_train_trf,y_train)

  y_pred =lr.predict(x_test_trf)

  print(accuracy_score(y_test,y_pred))

0.7206703910614525
0.7094972067039106
0.7039106145251397
0.7039106145251397
0.7039106145251397
0.7039106145251397
0.7039106145251397
0.7039106145251397
0.7039106145251397
0.7039106145251397


In [24]:
si = SimpleImputer()
x_train_trf2 = si.fit_transform(x_train)
x_test_trf2 = si.transform(x_test)



In [28]:
lr = LogisticRegression()
lr.fit(x_train_trf2,y_train)

y_pred2 = lr.predict(x_test_trf2)

accuracy_score(y_test,y_pred2)

0.6927374301675978