# Missing value imputation: Na_transformer


Deletes rows where Na values are encountered .

The AddMissingIndicator works both with numerical and categorical variables. When no variable list is passed, it will default to all variables in the dataset. In addition, in the parameter missing_only, we can indicate if we want to add missing indicators for all variables, or only for those with missing data.

**For this demonstration, we use the Ames House Prices dataset produced by Professor Dean De Cock:**

Dean De Cock (2011) Ames, Iowa: Alternative to the Boston Housing
Data as an End of Semester Regression Project, Journal of Statistics Education, Vol.19, No. 3

http://jse.amstat.org/v19n3/decock.pdf

https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627

The version of the dataset used in this notebook can be obtained from [Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from feature_engine.imputation import Na_transformer

In [2]:
data = pd.read_csv('houseprice.csv')
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [3]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0)

X_train.shape, X_test.shape

((1022, 79), (438, 79))

In [21]:
# let's create an instance of the imputer

na_imputer = Na_transformer(
    variables=['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea'])

na_imputer.fit(X_train)

Na_transformer(variables=['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea'])

In [22]:

na_imputer.variables_

['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea']

In [23]:
X_train[na_imputer.variables].isna().sum()

Alley          960
MasVnrType       5
LotFrontage    189
MasVnrArea       5
dtype: int64

In [24]:
# After transformation we see the binary _na variable are deleted form the dataframe
train_t = na_imputer.transform(X_train)
test_t = na_imputer.transform(X_test)


In [25]:
train_t[na_imputer.variables].isna().sum()

Alley          0
MasVnrType     0
LotFrontage    0
MasVnrArea     0
dtype: int64

In [None]:
X_train