# Missing value imputation: AddMissingIndicator


AddNaNBinaryImputer adds an additional column per indicated variable, indicating if the observation is missing (missing indicator). It adds an additional binary variable that indicates 1 if the observation contains a NaN or 0 otherwise.

Imputer works for both numerical and categorical variables.
When no variable list is passed, it will default to all variables in the dataset.

**For this demonstration, we use the Ames House Prices dataset produced by Professor Dean De Cock:**

Dean De Cock (2011) Ames, Iowa: Alternative to the Boston Housing
Data as an End of Semester Regression Project, Journal of Statistics Education, Vol.19, No. 3

http://jse.amstat.org/v19n3/decock.pdf

https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627

The version of the dataset used in this notebook can be obtained from [Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)




In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from feature_engine.imputation import missing_indicator

In [2]:
data = pd.read_csv('houseprice.csv')
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [3]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0)

X_train.shape, X_test.shape

((1022, 79), (438, 79))

In [4]:
addBinary_imputer = missing_indicator.AddMissingIndicator(
    variables=['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea'])

addBinary_imputer.fit(X_train)

AddMissingIndicator(variables=['Alley', 'MasVnrType', 'LotFrontage',
                               'MasVnrArea'])

In [5]:
# After transformation we see the binary _na variable for each of the indicated variables

train_t = addBinary_imputer.transform(X_train)
test_t = addBinary_imputer.transform(X_test)

train_t[['Alley_na', 'MasVnrType_na', 'LotFrontage_na', 'MasVnrArea_na']].head()

Unnamed: 0,Alley_na,MasVnrType_na,LotFrontage_na,MasVnrArea_na
64,1,0,1,0
682,1,0,1,0
960,1,0,0,0
1384,1,0,0,0
1100,1,0,0,0


In [6]:
train_t[['Alley_na', 'MasVnrType_na', 'LotFrontage_na', 'MasVnrArea_na']].mean()

Alley_na          0.939335
MasVnrType_na     0.004892
LotFrontage_na    0.184932
MasVnrArea_na     0.004892
dtype: float64

### Automatically select the variables

When no variable list is indicated, the imputer selects all variables.

In [7]:
addBinary_imputer = missing_indicator.AddMissingIndicator()
addBinary_imputer.fit(X_train)

AddMissingIndicator()

In [8]:
data.shape

(1460, 81)

In [9]:
# we can see that after transforming the dataset, we obtain double number of columns
train_t = addBinary_imputer.transform(X_train)
test_t = addBinary_imputer.transform(X_test)

train_t.shape

(1022, 98)

### Missing_only flag

True: A indicator will be created only for those who show missing during fit.(default)
False: A indicator will created for every variable.

In [11]:
addBinary_imputer = missing_indicator.AddMissingIndicator(missing_only=False)
addBinary_imputer.fit(X_train)

AddMissingIndicator(missing_only=False)

In [12]:
# we can see that after transforming the dataset, we obtain double number of columns
train_t = addBinary_imputer.transform(X_train)
test_t = addBinary_imputer.transform(X_test)

train_t.shape

(1022, 158)