## Multivariate imputation

Multivariate imputation is a technique used in statistics and data analysis to handle missing values in a dataset by predicting or estimating them based on the relationships between variables. Here's a simple explanation:

Imagine you have a dataset with multiple variables, such as age, income, and education level. Sometimes, some of the values for these variables might be missing. Instead of just leaving those missing values as gaps in your data, multivariate imputation tries to fill them in by considering the information from the other variables.

For example, if you have missing values for age but you have information on income and education level for the same individuals, you can use those variables to predict what the missing age values might be. This prediction is typically done using statistical models or algorithms that learn patterns from the available data.

By using multivariate imputation, you can make more informed estimates for the missing values in your dataset, which can help you perform more accurate analyses and make better decisions based on your data.

## Univariate imputation

In univariate imputation, missing values in a dataset are filled in based only on information from the variable with the missing values itself, without considering other variables.

For example, let's say you have a dataset with a single variable like age, and some of the age values are missing. With univariate imputation, you might fill in those missing age values using measures like the mean, median, or mode of the available age data.

So, if you have missing age values, you would calculate the mean (average), median (middle value), or mode (most frequent value) of the ages that are present in your dataset, and then use that value to replace the missing ones.

Univariate imputation is simpler and quicker to implement compared to multivariate imputation, but it may not capture the relationships between variables in the dataset as effectively. However, it can still be useful for handling missing data when you have only one variable to consider.

## K-Nearest Neighbors (KNN)

K-Nearest Neighbors (KNN) imputation is a technique used to fill in missing values in a dataset based on similar instances in the data. Here's how it works:

1. **Identifying Similar Instances**: For each instance with missing values, KNN identifies the K nearest instances in the dataset that have complete information (i.e., no missing values) based on a distance metric. The most common distance metric used is Euclidean distance, but other metrics like Manhattan distance or cosine similarity can also be used.

2. **Imputation**: Once the K nearest neighbors are identified, the missing values are imputed by taking either the average or weighted average of the corresponding values from those neighbors.

Here's a simplified step-by-step explanation:

1. **Identify missing values**: First, identify which values in your dataset are missing.

2. **Find nearest neighbors**: For each instance with missing values, find the K nearest instances (neighbors) in the dataset that have complete information. The "nearest" is determined by a chosen distance metric.

3. **Impute missing values**: Once you have identified the nearest neighbors, impute the missing values by taking the average (or weighted average) of the corresponding values from those neighbors.

4. **Repeat for all missing values**: Repeat steps 2 and 3 for all instances with missing values in your dataset.

5. **Finalize imputed dataset**: Once all missing values have been imputed, you will have a dataset with no missing values.

KNN imputation can be quite effective, especially when the dataset has complex patterns and relationships between variables. However, it's important to choose an appropriate value of K and distance metric to ensure accurate imputation results. Additionally, KNN imputation may not perform well with high-dimensional datasets or datasets with sparse features.

![](knnalgo.png)

The K-NN working can be explained on the basis of the below algorithm:

1. Select the number K of the neighbors

2. Calculate the Euclidean distance of K number of neighbors

3. Take the K nearest neighbors as per the calculated Euclidean distance.

4. Among these k neighbors, count the number of the data points in each category.

5. Assign the new data points to that category for which the number of the neighbor is maximum.

6. Our model is ready.

**KNN Imputer uses nan_euclidean_distances** 

Compute the euclidean distance between each pair of samples in X and Y, where Y=X is assumed if Y=None. When calculating the distance between a pair of samples, this formulation ignores feature coordinates with a missing value in either sample and scales up the weight of the remaining coordinates:

![](nan_euclidean_distances_formula.png)

The term "nan_euclidean_distances" in KNN imputation refers to a variation of the Euclidean distance metric that can handle missing values represented by NaN (Not a Number).

When calculating distances between data points in KNN imputation, you need a distance metric that can handle missing values appropriately. NaN values can cause issues in distance calculations because they represent unknown or unavailable values. The "nan_euclidean_distances" method addresses this by handling NaN values properly when computing distances between data points.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

Data: https://raw.githubusercontent.com/campusx-official/100-days-of-machine-learning/main/day39-knn-imputer/train.csv

In [2]:
df = pd.read_csv(' https://raw.githubusercontent.com/campusx-official/100-days-of-machine-learning/main/day39-knn-imputer/train.csv',
                usecols=['Age', 'Pclass', 'Fare', 'Survived'])

In [3]:
df.head()

Unnamed: 0,Survived,Pclass,Age,Fare
0,0,3,22.0,7.25
1,1,1,38.0,71.2833
2,1,3,26.0,7.925
3,1,1,35.0,53.1
4,0,3,35.0,8.05


In [4]:
df.isnull().sum()

Survived      0
Pclass        0
Age         177
Fare          0
dtype: int64

In [5]:
df.isnull().mean()*100

Survived     0.00000
Pclass       0.00000
Age         19.86532
Fare         0.00000
dtype: float64

In [6]:
x = df.drop(columns=['Survived'])
y = df['Survived']

In [7]:
x.head()

Unnamed: 0,Pclass,Age,Fare
0,3,22.0,7.25
1,1,38.0,71.2833
2,3,26.0,7.925
3,1,35.0,53.1
4,3,35.0,8.05


In [8]:
X_train,X_test, y_train, y_test = train_test_split(x,y,test_size=0.2, random_state=2)

In [9]:
X_train.head()

Unnamed: 0,Pclass,Age,Fare
30,1,40.0,27.7208
10,3,4.0,16.7
873,3,47.0,9.0
182,3,9.0,31.3875
876,3,20.0,9.8458


In [10]:
X_train.shape, X_test.shape

((712, 3), (179, 3))

In [11]:
df.shape

(891, 4)

KNN Documentation = https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html

In [12]:
knn = KNNImputer(n_neighbors=4, weights="distance")

X_train_trf = knn.fit_transform(X_train)
X_test_trf = knn.fit_transform(X_test)

In [13]:
X_train_trf

array([[  1.        ,  40.        ,  27.7208    ],
       [  3.        ,   4.        ,  16.7       ],
       [  3.        ,  47.        ,   9.        ],
       ...,
       [  1.        ,  71.        ,  49.5042    ],
       [  1.        ,  31.77665827, 221.7792    ],
       [  1.        ,  49.4290547 ,  25.925     ]])

In [14]:
pd.DataFrame(X_train_trf, columns=X_train.columns)

Unnamed: 0,Pclass,Age,Fare
0,1.0,40.000000,27.7208
1,3.0,4.000000,16.7000
2,3.0,47.000000,9.0000
3,3.0,9.000000,31.3875
4,3.0,20.000000,9.8458
...,...,...,...
707,3.0,30.000000,8.6625
708,3.0,26.968023,8.7125
709,1.0,71.000000,49.5042
710,1.0,31.776658,221.7792


In [15]:
np.isnan(X_test_trf).sum()

0

In [16]:
np.isnan(X_test).sum()

Pclass     0
Age       29
Fare       0
dtype: int64

In [17]:
lr = LogisticRegression()

lr.fit(X_train_trf, y_train)

y_pred = lr.predict(X_test_trf)

accuracy_score(y_test, y_pred)

0.7039106145251397

In [18]:
# Comparision with Simple Imputer --> mean

si = SimpleImputer()

X_train_trf2 = si.fit_transform(X_train)
X_test_trf2 = si.transform(X_test)

In [19]:
lr = LogisticRegression()

lr.fit(X_train_trf2,y_train)

y_pred2 = lr.predict(X_test_trf2)

accuracy_score(y_test,y_pred2)

0.6927374301675978