What is Missing Value Imputation?

Imputation is the act of replacing missing data with statistical estimates of the missing values. The goal of any imputation technique is to produce a complete data set that can be used to train machine learning models.

In [5]:
import pandas as pd
import numpy as np
import os
os.environ['KAGGLE_CONFIG_DIR'] = '/content'

In [6]:
!kaggle datasets download -d abhisheksingh000261/houseprice

Downloading houseprice.zip to /content
  0% 0.00/94.0k [00:00<?, ?B/s]
100% 94.0k/94.0k [00:00<00:00, 35.7MB/s]


In [7]:
!unzip \*.zip && rm *.zip

Archive:  houseprice.zip
  inflating: houseprice.csv          


In [25]:
df = pd.read_csv("/content/houseprice.csv")

In [26]:
def missing_value_columns(dataframe):
  columns = []
  for column in dataframe.columns:
    if (dataframe[column].isnull().sum() > 0):
      print(f"{column} : {dataframe[column].isnull().sum()}")
      columns.append(column)
  print(f"Total columns with missing values: {len(columns)}")
  return columns

### Check missing values

In [27]:
columns = missing_value_columns(df)

LotFrontage : 259
Alley : 1369
MasVnrType : 8
MasVnrArea : 8
BsmtQual : 37
BsmtCond : 37
BsmtExposure : 38
BsmtFinType1 : 37
BsmtFinType2 : 38
Electrical : 1
FireplaceQu : 690
GarageType : 81
GarageYrBlt : 81
GarageFinish : 81
GarageQual : 81
GarageCond : 81
PoolQC : 1453
Fence : 1179
MiscFeature : 1406
Total columns with missing values: 19


In [28]:
df[columns].head(5)

Unnamed: 0,LotFrontage,Alley,MasVnrType,MasVnrArea,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,Electrical,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageQual,GarageCond,PoolQC,Fence,MiscFeature
0,65.0,,BrkFace,196.0,Gd,TA,No,GLQ,Unf,SBrkr,,Attchd,2003.0,RFn,TA,TA,,,
1,80.0,,,0.0,Gd,TA,Gd,ALQ,Unf,SBrkr,TA,Attchd,1976.0,RFn,TA,TA,,,
2,68.0,,BrkFace,162.0,Gd,TA,Mn,GLQ,Unf,SBrkr,TA,Attchd,2001.0,RFn,TA,TA,,,
3,60.0,,,0.0,TA,Gd,No,ALQ,Unf,SBrkr,Gd,Detchd,1998.0,Unf,TA,TA,,,
4,84.0,,BrkFace,350.0,Gd,TA,Av,GLQ,Unf,SBrkr,TA,Attchd,2000.0,RFn,TA,TA,,,


### Droping all the missing values

In [30]:
# Let's drop all the missing values from column LotFrontage
x1 = df.copy()
x1.dropna(axis=0, subset=['LotFrontage'], inplace=True)
x1['LotFrontage'].isnull().sum()

0

### Univariate Imputation

This class provides basic strategies (eg. Mean, Median, Most Frequent and Constant) for imputing missing values

sklearn.impute.SimpleImputer

feature_engine.imputation.MeanMedianImputer

In [31]:
!pip install feature-engine
from sklearn.impute import SimpleImputer
from feature_engine.imputation import MeanMedianImputer



In [32]:
# Let's replace the missing values in MasVnrArea column with mean value of the column
mean_imputer = SimpleImputer(missing_values= np.nan, strategy= 'mean')
x2 = df.copy()
x2[['MasVnrArea']] = mean_imputer.fit_transform(x2[['MasVnrArea']])
x2['MasVnrArea'].isnull().sum()
# Similarly we can replace with Median, Most Frequent and even with constant using SimpleImputer of sklearn

0

In [35]:
median_imputer = MeanMedianImputer(imputation_method='median', variables= ['GarageYrBlt'])
x3 = df.copy()
x3 = median_imputer.fit_transform(x3)
x3['GarageYrBlt'].isnull().sum()

0

### Multivariate Imputation

A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion.
Here, we can pass a ML model to replace the missing values.

sklearn.impute.IterativeImputer

In [36]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.tree import DecisionTreeClassifier

In [37]:
decisiontree_imputer = IterativeImputer(estimator= DecisionTreeClassifier())
x4 = df.copy()
x4[['GarageYrBlt']] = decisiontree_imputer.fit_transform(x4[['GarageYrBlt']])
x4['GarageYrBlt'].isnull().sum()

0

### Categorical Imputer

The CategoricalImputer() replaces missing data in categorical variables by a string like ‘Missing’ or any other entered by the user. Alternatively, it replaces missing data by the most frequent category.

In [39]:
from feature_engine.imputation import CategoricalImputer

In [44]:
categorical_imputer = CategoricalImputer(fill_value='Gd', variables= ['BsmtQual'])
x5 = df.copy()
x5 = categorical_imputer.fit_transform(x5)
x5['BsmtQual'].isnull().sum()

0