## KNN imputation

The missing values are estimated as the average value from the closest K neighbours.

[KNNImputer from sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#sklearn.impute.KNNImputer)

- Same K will be used to impute all variables
- Can't really optimise K to better predict the missing values
- Could optimise K to better predict the target

**Note**

If what we want is to predict, as accurately as possible the values of the missing data, then, we would not use the KNN imputer, we would build individual KNN algorithms to predict 1 variable from the remaining ones. This is a common regression problem.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# to split the datasets
from sklearn.model_selection import train_test_split

# multivariate imputation
from sklearn.impute import KNNImputer

In [2]:
# load data

# list with numerical varables

cols_to_use = [
    'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
    'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea',
    'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF',
    '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
    'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
    'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
    'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea',
    'WoodDeckSF',  'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
    'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold',
    'SalePrice'
]

In [5]:
# load dataset

data = pd.read_csv('..\house_price.csv',usecols=cols_to_use)
data.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
0,60,65.0,8450,7,5,2003,2003,196.0,706,0,...,0,61,0,0,0,0,0,2,2008,208500
1,20,80.0,9600,6,8,1976,1976,0.0,978,0,...,298,0,0,0,0,0,0,5,2007,181500
2,60,68.0,11250,7,5,2001,2002,162.0,486,0,...,0,42,0,0,0,0,0,9,2008,223500
3,70,60.0,9550,7,5,1915,1970,0.0,216,0,...,0,35,272,0,0,0,0,2,2006,140000
4,60,84.0,14260,8,5,2000,2000,350.0,655,0,...,192,84,0,0,0,0,0,12,2008,250000


In [8]:
# print out all columns with null values

for col in data.columns:
    if data[col].isnull().sum()>0:
        print(col, data[col].isnull().sum())

LotFrontage 259
MasVnrArea 8
GarageYrBlt 81


In [9]:
# let's separate into training and testing set

# first drop the target from the feature list
cols_to_use.remove('SalePrice')

X_train, X_test, y_train, y_test = train_test_split(
    data[cols_to_use],
    data['SalePrice'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((1022, 36), (438, 36))

In [12]:
# reset index, so we can compare values later on
# in the demo

X_train.reset_index(inplace=True, drop=True)
X_test.reset_index(inplace=True, drop=True)

## KNN imputation

In [13]:
imputer = KNNImputer(
    n_neighbors=5, # the number of neighbours K
    weights='distance', # the weighting factor
    metric='nan_euclidean', # the metric to find the neighbours
    add_indicator=False, # whether to add a missing indicator
)

In [14]:
imputer.fit(X_train)

KNNImputer(weights='distance')

In [15]:
train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)

# sklearn returns a Numpy array
# lets make a dataframe
train_t = pd.DataFrame(train_t, columns=X_train.columns)
test_t = pd.DataFrame(test_t, columns=X_test.columns)

train_t.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,60.0,70.115142,9375.0,7.0,5.0,1997.0,1998.0,573.0,739.0,0.0,...,645.0,576.0,36.0,0.0,0.0,0.0,0.0,0.0,2.0,2009.0
1,120.0,42.533053,2887.0,6.0,5.0,1996.0,1997.0,0.0,1003.0,0.0,...,431.0,307.0,0.0,0.0,0.0,0.0,0.0,0.0,11.0,2008.0
2,20.0,50.0,7207.0,5.0,7.0,1958.0,2008.0,0.0,696.0,0.0,...,0.0,117.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2010.0
3,50.0,60.0,9060.0,6.0,5.0,1939.0,1950.0,0.0,204.0,0.0,...,280.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,2009.0
4,30.0,60.0,8400.0,2.0,5.0,1920.0,1950.0,0.0,290.0,0.0,...,246.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2009.0


In [17]:
# print out all columns with null values

for col in train_t.columns:
    if train_t[col].isnull().sum()>0:
        print(col, train_t[col].isnull().sum())

In [18]:
# variables without NA after the imputation

train_t[['LotFrontage', 'MasVnrArea', 'GarageYrBlt']].isnull().sum()

LotFrontage    0
MasVnrArea     0
GarageYrBlt    0
dtype: int64

In [20]:
# no of observations in original with NA
X_train[X_train['MasVnrArea'].isnull()]['MasVnrArea']

420   NaN
490   NaN
642   NaN
824   NaN
921   NaN
Name: MasVnrArea, dtype: float64

In [21]:
# no of observations in imputed dataset with NA
train_t[X_train['MasVnrArea'].isnull()]['MasVnrArea']

420     99.765717
490     34.106592
642      0.000000
824    375.749332
921     85.817715
Name: MasVnrArea, dtype: float64

In [22]:
# the mean value of the variable
train_t['MasVnrArea'].mean()

103.62958841087315

In some cases, the imputation values are very different from the mean value we would have used in MeanMedianImputation.

## Imputing a slice of the dataframe

We can use Feature-engine to apply the KNNImputer to a slice of the dataframe.

In [23]:
from feature_engine.wrappers import SklearnTransformerWrapper

In [30]:
data = pd.read_csv('../house_price.csv')

X_train, X_test, y_train, y_test = train_test_split(
    data.drop('SalePrice', axis=1),
    data['SalePrice'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((1022, 80), (438, 80))

In [31]:
# start the KNNimputer inside the SKlearnTransformerWrapper

imputer = SklearnTransformerWrapper(
    transformer = KNNImputer(weights='distance'),
    variables = cols_to_use,
)

In [32]:
# fit the wrapper + KNNImputer
imputer.fit(X_train)

# transform the data
train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)

# feature-engine returns a dataframe
train_t.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
64,65,60.0,RL,70.115142,9375.0,Pave,,Reg,Lvl,AllPub,...,0.0,0.0,,GdPrv,,0.0,2.0,2009.0,WD,Normal
682,683,120.0,RL,42.533053,2887.0,Pave,,Reg,HLS,AllPub,...,0.0,0.0,,,,0.0,11.0,2008.0,WD,Normal
960,961,20.0,RL,50.0,7207.0,Pave,,IR1,Lvl,AllPub,...,0.0,0.0,,,,0.0,2.0,2010.0,WD,Normal
1384,1385,50.0,RL,60.0,9060.0,Pave,,Reg,Lvl,AllPub,...,0.0,0.0,,MnPrv,,0.0,10.0,2009.0,WD,Normal
1100,1101,30.0,RL,60.0,8400.0,Pave,,Reg,Bnk,AllPub,...,0.0,0.0,,,,0.0,1.0,2009.0,WD,Normal


In [45]:
# print out all columns with null values for numerical variables

for col in train_t.columns:
    if train_t[col].isnull().sum()>0 and train_t[col].dtype != 'O':
        print(col, train_t[col].isnull().sum())

In [34]:
# no NA after the imputation

train_t['MasVnrArea'].isnull().sum()

0

In [46]:
# same imputation values as previously

train_t[X_train['MasVnrArea'].isnull()]['MasVnrArea']

1278     99.765717
936      34.106592
650       0.000000
234     375.749332
973      85.817715
Name: MasVnrArea, dtype: float64

## Automatically find best imputation parameters

We can optimise the parameters of the KNN imputation to better predict our outcome.

In [47]:
# import extra classes for modelling
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

In [64]:
data = pd.read_csv('../house_price.csv')

X_train, X_test, y_train, y_test = train_test_split(
    data[cols_to_use],
    data['SalePrice'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((1022, 36), (438, 36))

In [65]:
# setting up the pipeline

pipe = Pipeline(steps=[
    ('imputer', KNNImputer(
        n_neighbors=5,
        weights='distance',
        add_indicator=False)),
    
    ('scaler', StandardScaler()),
    ('regressor', Lasso(max_iter=2000)),
])

In [66]:
# set up the gridSearch param

param_grid = {
    'imputer__n_neighbors': [5,10],
    'imputer__weights': ['uniform', 'distance'],
    'imputer__add_indicator': [True, False],
    'regressor__alpha': [100, 200],
}

In [67]:
# now create the GridSearchCV object
grid_search = GridSearchCV(estimator=pipe, param_grid=param_grid, cv=3, scoring='r2')

In [68]:
# and now we train over all the possible combinations 
# of the parameters above
grid_search.fit(X_train, y_train)

# and we print the best score over the train set
print(("best linear regression from grid search: %.3f"
       % grid_search.score(X_train, y_train)))

best linear regression from grid search: 0.845


In [69]:
# let's check the performance over the test set
print(("best linear regression from grid search: %.3f"
       % grid_search.score(X_test, y_test)))

best linear regression from grid search: 0.730


In [70]:
# best params

grid_search.best_params_

{'imputer__add_indicator': True,
 'imputer__n_neighbors': 5,
 'imputer__weights': 'distance',
 'regressor__alpha': 200}

## Compare with univariate imputation

In [71]:
from sklearn.impute import SimpleImputer

In [72]:
data = pd.read_csv('../house_price.csv')

X_train, X_test, y_train, y_test = train_test_split(
    data[cols_to_use],
    data['SalePrice'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((1022, 36), (438, 36))

In [73]:
pipe = Pipeline(steps = [
    ('imputer',SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('regressor', Lasso(max_iter=2000))
])

In [74]:
param_grid = {
    'imputer__strategy' : ['mean', 'median'],
    'imputer__add_indicator': [True, False],
    'regressor__alpha': [10, 100, 200],
}

grid_search = GridSearchCV(pipe, param_grid=param_grid, cv = 3, scoring='r2')

In [75]:
grid_search.fit(X_train, y_train)

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('imputer', SimpleImputer()),
                                       ('scaler', StandardScaler()),
                                       ('regressor', Lasso(max_iter=2000))]),
             param_grid={'imputer__add_indicator': [True, False],
                         'imputer__strategy': ['mean', 'median'],
                         'regressor__alpha': [10, 100, 200]},
             scoring='r2')

In [77]:
print('Score train data grid search : %.3f'
     %grid_search.score(X_train,y_train))

Score train data grid search : 0.845


In [78]:
print('Score train data grid search : %.3f'
     %grid_search.score(X_test,y_test))

Score train data grid search : 0.726


In [79]:
grid_search.best_params_

{'imputer__add_indicator': True,
 'imputer__strategy': 'mean',
 'regressor__alpha': 200}

We see that imputing the values with the mean value, returns approximately the same performance as doing KNN imputation, so we might not want to add the additional complexity of training models to impute NA, to then go ahead and predict the real target we are interested in.