# Missing value imputation: DropMissingData


Deletes rows where Na values are encountered .

The AddMissingIndicator works both with numerical and categorical variables. When no variable list is passed, it will default to all variables in the dataset. In addition, in the parameter missing_only, we can indicate if we want to add missing indicators for all variables, or only for those with missing data.

**For this demonstration, we use the Ames House Prices dataset produced by Professor Dean De Cock:**

Dean De Cock (2011) Ames, Iowa: Alternative to the Boston Housing
Data as an End of Semester Regression Project, Journal of Statistics Education, Vol.19, No. 3

http://jse.amstat.org/v19n3/decock.pdf

https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627

The version of the dataset used in this notebook can be obtained from [Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)

In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from feature_engine.imputation import DropMissingData

In [3]:
data = pd.read_csv('houseprice.csv')
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [4]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0)

X_train.shape, X_test.shape

((1022, 79), (438, 79))

In [5]:
# let's create an instance of the imputer

na_imputer = DropMissingData(
    variables=['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea'])

na_imputer.fit(X_train)

DropMissingData(variables=['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea'])

In [6]:

na_imputer.variables_

['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea']

In [7]:
# No of NA values before transformation per variable
X_train[na_imputer.variables].isna().sum()

Alley          960
MasVnrType       5
LotFrontage    189
MasVnrArea       5
dtype: int64

In [8]:
# After transformation we see the rows with NA values are deleted form the dataframe
train_t = na_imputer.transform(X_train)
test_t = na_imputer.transform(X_test)


In [9]:
# No of NA values after transformation per variable
train_t[na_imputer.variables].isna().sum()

Alley          0
MasVnrType     0
LotFrontage    0
MasVnrArea     0
dtype: int64

In [10]:
train_t.shape

(59, 79)

In [11]:
# Example of "return_dropped_data" method  that returns a dataframe free from NA values in variables it learns 
# during fit or when passed to it.
na_imputer.return_dropped_data(X_train)

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
64,60,RL,,9375,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,GdPrv,,0,2,2009,WD,Normal
682,120,RL,,2887,Pave,,Reg,HLS,AllPub,Inside,...,0,0,,,,0,11,2008,WD,Normal
960,20,RL,50.0,7207,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,2,2010,WD,Normal
1384,50,RL,60.0,9060,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,MnPrv,,0,10,2009,WD,Normal
1100,30,RL,60.0,8400,Pave,,Reg,Bnk,AllPub,Inside,...,0,0,,,,0,1,2009,WD,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
763,60,RL,82.0,9430,Pave,,Reg,Lvl,AllPub,Inside,...,180,0,,,,0,7,2009,WD,Normal
835,20,RL,60.0,9600,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,2,2010,WD,Normal
1216,90,RM,68.0,8930,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,4,2010,WD,Normal
559,120,RL,,3196,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,10,2006,WD,Normal


### Automatically selecting all variables

In [12]:
# let's create an instance of the imputer

na_imputer = DropMissingData()

na_imputer.fit(X_train)

DropMissingData()

In [13]:
# Variables idetified by transformer
na_imputer.variables_

['LotFrontage',
 'Alley',
 'MasVnrType',
 'MasVnrArea',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Electrical',
 'FireplaceQu',
 'GarageType',
 'GarageYrBlt',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PoolQC',
 'Fence',
 'MiscFeature']

In [15]:
# No of NA values before transformation per variable
X_train[na_imputer.variables_].isna().sum()

LotFrontage      189
Alley            960
MasVnrType         5
MasVnrArea         5
BsmtQual          24
BsmtCond          24
BsmtExposure      24
BsmtFinType1      24
BsmtFinType2      25
Electrical         1
FireplaceQu      478
GarageType        54
GarageYrBlt       54
GarageFinish      54
GarageQual        54
GarageCond        54
PoolQC          1019
Fence            831
MiscFeature      978
dtype: int64

In [16]:
# After transformation we see the rows with NA values are deleted form the dataframe
train_t = na_imputer.transform(X_train)
test_t = na_imputer.transform(X_test)

In [17]:
# No of NA values after transformation per variable
train_t[na_imputer.variables_].isna().sum()

LotFrontage     0
Alley           0
MasVnrType      0
MasVnrArea      0
BsmtQual        0
BsmtCond        0
BsmtExposure    0
BsmtFinType1    0
BsmtFinType2    0
Electrical      0
FireplaceQu     0
GarageType      0
GarageYrBlt     0
GarageFinish    0
GarageQual      0
GarageCond      0
PoolQC          0
Fence           0
MiscFeature     0
dtype: int64