Missing Category imputation with Scikit-learn: SimpleImputer
Scikit-learn provides a class to make most of the most common data imputation techniques.

The SimpleImputer class provides basic strategies for imputing missing values, including:

Mean and median imputation for numerical variables
Most frequent category imputation for categorical variables
Arbitrary value imputation for both categorical and numerical variables

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

In [2]:
#load the dataset with a few categorical variables

cols_to_use = ['BsmtQual', 'FireplaceQu', 'SalePrice']
data = pd.read_csv('houseprice.csv', usecols = cols_to_use)
data.head()

Unnamed: 0,BsmtQual,FireplaceQu,SalePrice
0,Gd,,208500
1,Gd,TA,181500
2,Gd,TA,223500
3,TA,Gd,140000
4,Gd,TA,250000


In [3]:
data.isnull().mean()

BsmtQual       0.025342
FireplaceQu    0.472603
SalePrice      0.000000
dtype: float64

In [4]:
# train test split
cols_to_use.remove('SalePrice')

X_train, X_test, y_train, y_test = train_test_split(
    data[cols_to_use],
    data['SalePrice'],
    test_size=0.3,
    random_state=0
)

X_train.shape, X_test.shape

((1022, 2), (438, 2))

In [5]:
#check for missing
X_train.isnull().mean()

BsmtQual       0.023483
FireplaceQu    0.467710
dtype: float64

In [6]:
# let's inspect the values of the categorical variable
X_train['BsmtQual'].unique()

array(['Gd', 'TA', 'Fa', nan, 'Ex'], dtype=object)

In [7]:
# let's inspect the values of the categorical variable
X_train['FireplaceQu'].unique()

array([nan, 'Gd', 'TA', 'Fa', 'Po', 'Ex'], dtype=object)

In [8]:
#impute the missing values with SimpleImputer

imputer = SimpleImputer(strategy='constant',
                       fill_value ='Missing'
                       )

#fit the imputer to the train set
imputer.fit(X_train)

SimpleImputer(add_indicator=False, copy=True, fill_value='Missing',
              missing_values=nan, strategy='constant', verbose=0)

In [9]:
# we can look at the learnt modes like this:
imputer.statistics_

array(['Missing', 'Missing'], dtype=object)

In [10]:
# and now we impute the train and test set

# NOTE: the data is returned as a numpy array!!!
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

X_train

array([['Gd', 'Missing'],
       ['Gd', 'Gd'],
       ['TA', 'Missing'],
       ...,
       ['Missing', 'Missing'],
       ['Gd', 'TA'],
       ['Gd', 'Missing']], dtype=object)

In [11]:
# encode the train set back to a dataframe:

X_train = pd.DataFrame(X_train, columns=cols_to_use)
X_train.head()

Unnamed: 0,BsmtQual,FireplaceQu
0,Gd,Missing
1,Gd,Gd
2,TA,Missing
3,TA,Missing
4,TA,Missing
