# KNN Imputer


In [1]:
import pandas as pd
from sklearn.impute import KNNImputer

In [2]:
df = pd.read_csv("../../Datasets/Data.csv")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    10 non-null     object 
 1   Age        8 non-null      float64
 2   Salary     8 non-null      float64
 3   Purchased  10 non-null     object 
dtypes: float64(2), object(2)
memory usage: 452.0+ bytes


In [4]:
df.describe()

Unnamed: 0,Age,Salary
count,8.0,8.0
mean,40.25,62750.0
std,6.734771,12691.391908
min,30.0,48000.0
25%,36.5,53500.0
50%,39.0,59500.0
75%,45.0,70000.0
max,50.0,83000.0


In [5]:
df.isna().sum()

Country      0
Age          2
Salary       2
Purchased    0
dtype: int64

In [6]:
df.columns

Index(['Country', 'Age', 'Salary', 'Purchased'], dtype='object')

`KNNImputer` is a class in scikit-learn that is used to impute missing values in a df using k-Nearest Neighbors. It replaces missing values with the mean value of the k-nearest neighbors of each sample.

The `KNNImputer` class takes several parameters that control the behavior of the imputation process. Here are the parameters of the `KNNImputer` class:

- `n_neighbors`: The number of neighbors to use for imputation. The default value is `5`.
- `weights`: The weight function used in prediction. The default value is `'uniform'`. Other options include `'distance'`, which weights the neighbors by the inverse of their distance, and a callable function that takes an array of distances and returns an array of weights.
- `metric`: The distance metric used to calculate the distance between samples. The default value is `'nan_euclidean'`, which treats missing values as if they have a distance of infinity. Other options include any valid distance metric from `scipy.spatial.distance`.
- `copy`: Whether to create a copy of the input data before imputing missing values. The default value is `True`.

Here's an example of how to use `KNNImputer` to impute missing values in a df:


In [7]:
imputer = KNNImputer(n_neighbors=2, weights="distance")
df[["Age", "Salary"]] = imputer.fit_transform(df[["Age", "Salary"]])
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,71800.0,No
1,Spain,31.875,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,63400.0,Yes
5,France,35.0,58000.0,Yes
6,Spain,31.25,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In this example, we are creating a `KNNImputer` object called `imputer` with `n_neighbors=2` and `weights='distance'`. We then use the `fit_transform` method of the `imputer` object to impute missing values in the df.
