# Filling missing data with SimpleImputer, pipelines, and make_column_transformer

_By Jeff Hale_

---

## Learning Objectives
By the end of this lesson students will be able to:

- Explain why you might want to use sklearn's `SimpleImputer`
- Use `SimpleImputer` to fill missing values with the mean or a constant
- Use `SimpleImputer` in a pipeline
- Use `SimpleImptuter` with `make_column_transformer` and a pipeline to use different imputation strategies with different columns

---

SimpleImputer is an sklearn class you can use so that you don't have to worry about X_test information leaking into your imputation metrics.

It can be used in a pipeline.

 #### Parameters: 
 
    missing_values : number, string, np.nan (default) or None
        The placeholder for the missing values. All occurrences of
        `missing_values` will be imputed.
        
    strategy : string, default='mean'
        The imputation strategy.
        - If "mean", then replace missing values using the mean along
          each column. Can only be used with numeric data.
        - If "median", then replace missing values using the median along
          each column. Can only be used with numeric data.
        - If "most_frequent", then replace missing using the most frequent
          value along each column. Can be used with strings or numeric data.
        - If "constant", then replace missing values with fill_value. Can be
          used with strings or numeric data.
      

    fill_value : string or numerical value, default=None
        When strategy == "constant", fill_value is used to replace all
        occurrences of missing_values.
        If left to the default, fill_value will be 0 when imputing numerical
        data and "missing_value" for strings or object data types.

    add_indicator : boolean, default=False
        If True, a :class:`MissingIndicator` transform will stack onto output
        of the imputer's transform. This allows a predictive estimator
        to account for missingness despite imputation. If a feature has no
        missing values at fit/train time, the feature won't appear on
        the missing indicator even if there are missing values at
        transform/test time.

## Set things up with Titanic dataset

In [100]:
import seaborn as sns
import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

In [101]:
# read in the data
df_titanic = sns.load_dataset("Titanic")

In [102]:
# inspect 
df_titanic.head(2)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False


In [103]:
df_titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     891 non-null    int64  
 1   pclass       891 non-null    int64  
 2   sex          891 non-null    object 
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64  
 5   parch        891 non-null    int64  
 6   fare         891 non-null    float64
 7   embarked     889 non-null    object 
 8   class        891 non-null    object 
 9   who          891 non-null    object 
 10  adult_male   891 non-null    bool   
 11  deck         203 non-null    object 
 12  embark_town  889 non-null    object 
 13  alive        891 non-null    object 
 14  alone        891 non-null    bool   
dtypes: bool(2), float64(2), int64(4), object(7)
memory usage: 92.4+ KB


Let's say we just want to use `age` to predict survival. It has some missing data.

In [104]:
df_titanic.isna().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [105]:
# break into X and y
X = df_titanic[['age']]
y = df_titanic['survived']

In [106]:
X.head(2)

Unnamed: 0,age
0,22.0
1,38.0


In [107]:
y.head(2)

0    0
1    1
Name: survived, dtype: int64

#### train test split

In [108]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

In [109]:
X_train.head(2)

Unnamed: 0,age
660,50.0
852,9.0


In [110]:
y_train.head(2)

660    1
852    0
Name: survived, dtype: int64

In [111]:
X_train.shape

(668, 1)

In [112]:
y_train.shape

(668,)

In [113]:
X_test.shape

(223, 1)

In [114]:
y_test.shape

(223,)

## Use SimpleImputer 

The `.fit_transform()` method with compute the mean (the fit step) and then will fill the missing values in X_train's `age` column with the mean of X_train's existing `age` values (the train step). 


The `.transform()` method with `X_test` will use that same mean value computed just before with `.fit(X_train)` step.  

#### Instantiate a SimpleImputer object that will use the mean to fill missing values.

In [115]:
si = SimpleImputer(strategy='mean')

#### Fit it to the training data

In [116]:
si.fit(X_train, y_train)

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='mean', verbose=0)

#### Transform the training data

In [118]:
si.transform(X_train)

array([[50.        ],
       [ 9.        ],
       [25.        ],
       [27.        ],
       [40.5       ],
       [30.        ],
       [28.        ],
       [30.        ],
       [74.        ],
       [25.        ],
       [34.        ],
       [31.        ],
       [27.        ],
       [35.        ],
       [ 7.        ],
       [29.93432892],
       [23.        ],
       [30.        ],
       [ 2.        ],
       [24.        ],
       [29.93432892],
       [29.93432892],
       [59.        ],
       [51.        ],
       [22.        ],
       [34.        ],
       [33.        ],
       [24.        ],
       [47.        ],
       [47.        ],
       [36.        ],
       [29.93432892],
       [14.        ],
       [41.        ],
       [29.93432892],
       [29.93432892],
       [29.93432892],
       [33.        ],
       [31.        ],
       [17.        ],
       [19.        ],
       [ 2.        ],
       [18.        ],
       [29.93432892],
       [52.        ],
       [34

#### Save the transformed training data

In [119]:
X_train_filled = si.transform(X_train)

#### Verify no nulls in X_train

In [121]:
np.isnan(X_train_filled).sum()

0

#### transform the test data

In [122]:
si.transform(X_test)

array([[ 1.        ],
       [29.93432892],
       [30.        ],
       [61.        ],
       [27.        ],
       [46.        ],
       [40.        ],
       [27.        ],
       [18.        ],
       [29.        ],
       [ 9.        ],
       [28.        ],
       [52.        ],
       [24.5       ],
       [50.        ],
       [ 0.75      ],
       [58.        ],
       [29.93432892],
       [45.        ],
       [29.        ],
       [29.93432892],
       [ 8.        ],
       [39.        ],
       [ 4.        ],
       [ 2.        ],
       [25.        ],
       [22.        ],
       [17.        ],
       [19.        ],
       [29.93432892],
       [ 2.        ],
       [40.        ],
       [34.        ],
       [26.        ],
       [19.        ],
       [11.        ],
       [42.        ],
       [51.        ],
       [24.        ],
       [40.        ],
       [14.        ],
       [29.93432892],
       [29.93432892],
       [63.        ],
       [16.        ],
       [25

In [123]:
X_test_filled = si.transform(X_test)

## How would we fill with most common value?

In [124]:
si = SimpleImputer(strategy='most_frequent')

In [125]:
si.fit_transform(X_train, y_train)

array([[50.  ],
       [ 9.  ],
       [25.  ],
       [27.  ],
       [40.5 ],
       [30.  ],
       [28.  ],
       [30.  ],
       [74.  ],
       [25.  ],
       [34.  ],
       [31.  ],
       [27.  ],
       [35.  ],
       [ 7.  ],
       [22.  ],
       [23.  ],
       [30.  ],
       [ 2.  ],
       [24.  ],
       [22.  ],
       [22.  ],
       [59.  ],
       [51.  ],
       [22.  ],
       [34.  ],
       [33.  ],
       [24.  ],
       [47.  ],
       [47.  ],
       [36.  ],
       [22.  ],
       [14.  ],
       [41.  ],
       [22.  ],
       [22.  ],
       [22.  ],
       [33.  ],
       [31.  ],
       [17.  ],
       [19.  ],
       [ 2.  ],
       [18.  ],
       [22.  ],
       [52.  ],
       [34.  ],
       [31.  ],
       [36.  ],
       [29.  ],
       [18.  ],
       [63.  ],
       [22.  ],
       [28.  ],
       [50.  ],
       [22.  ],
       [20.  ],
       [22.  ],
       [48.  ],
       [40.  ],
       [42.  ],
       [22.  ],
       [22.  ],
       [

In [126]:
si.transform(X_test)

array([[ 1.  ],
       [22.  ],
       [30.  ],
       [61.  ],
       [27.  ],
       [46.  ],
       [40.  ],
       [27.  ],
       [18.  ],
       [29.  ],
       [ 9.  ],
       [28.  ],
       [52.  ],
       [24.5 ],
       [50.  ],
       [ 0.75],
       [58.  ],
       [22.  ],
       [45.  ],
       [29.  ],
       [22.  ],
       [ 8.  ],
       [39.  ],
       [ 4.  ],
       [ 2.  ],
       [25.  ],
       [22.  ],
       [17.  ],
       [19.  ],
       [22.  ],
       [ 2.  ],
       [40.  ],
       [34.  ],
       [26.  ],
       [19.  ],
       [11.  ],
       [42.  ],
       [51.  ],
       [24.  ],
       [40.  ],
       [14.  ],
       [22.  ],
       [22.  ],
       [63.  ],
       [16.  ],
       [25.  ],
       [39.  ],
       [42.  ],
       [20.  ],
       [24.  ],
       [22.  ],
       [ 6.  ],
       [20.5 ],
       [35.  ],
       [24.  ],
       [22.  ],
       [16.  ],
       [18.  ],
       [29.  ],
       [14.  ],
       [33.  ],
       [18.  ],
       [

## Or instead, use a Pipeline and make life easier 😁

#### Make a pipeline object 

Fill null values with the median using simple imputer and logistic regression

In [130]:
fill_pipe = make_pipeline(
                     SimpleImputer(strategy='median'),
                     LogisticRegression())

#### Fit the pipeline on the training set

In [131]:
fill_pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('simpleimputer',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='median',
                               verbose=0)),
                ('logisticregression',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

### Score the pipeline on the train set

In [133]:
fill_pipe.score(X_train, y_train)

0.6137724550898204

### Score the pipeline on the test set

In [132]:
fill_pipe.score(X_test, y_test)

0.6233183856502242

### We could use grid search to hyperparameter tune the SimpleImputer and LogisticRegression arguments

## Use different imputation strategies on different columns

To use different imputation strategies on different columns, the easiest thing to do is use sklearn's `make_column_transformer`.

## make_column_transformer

`make_column_transformer` is a convenience method to make a `ColumnTransformer` object, similar to how `make_pipeline` is a convenience method to make a `Pipeline` object.

A ColumnTransformer object can apply different transformers to different columns! 🎉

#### Parameters

    *transformers : tuples
        
        Tuples of the form (transformer, column(s)) specifying the
        transformer objects to be applied to subsets of the data.
        
        - transformer : estimator or {'passthrough', 'drop'}
            Estimator must support :term:`fit` and :term:`transform`.
            Special-cased strings 'drop' and 'passthrough' are accepted as
            well, to indicate to drop the columns or to pass them through
            untransformed, respectively.
        
        - column(s) : string or int, array-like of string or int, slice, boolean mask array or callable
            Indexes the data on its second axis. Integers are interpreted as
            positional columns, while strings can reference DataFrame columns
            by name. A scalar string or int should be used where
            ``transformer`` expects X to be a 1d array-like (vector),
            otherwise a 2d array will be passed to the transformer.
            A callable is passed the input data `X` and can return any of the
            above.
    
    remainder : {'drop', 'passthrough'} or estimator, default 'drop'
        By default, only the specified columns in `transformers` are
        transformed and combined in the output, and the non-specified
        columns are dropped. (default of ``'drop'``).
        
        By specifying ``remainder='passthrough'``, all remaining columns that
        were not specified in `transformers` will be automatically passed
        through. This subset of columns is concatenated with the output of
        the transformers.
        
        By setting ``remainder`` to be an estimator, the remaining
        non-specified columns will use the ``remainder`` estimator. The
        estimator must support :term:`fit` and :term:`transform`.

#### Import

In [134]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder

#### Let's use a larger set of columns for our X here.

In [135]:
X = df_titanic[['age', 'fare', 'embarked', 'sex']]
y = df_titanic['survived']

In [136]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       714 non-null    float64
 1   fare      891 non-null    float64
 2   embarked  889 non-null    object 
 3   sex       891 non-null    object 
dtypes: float64(2), object(2)
memory usage: 28.0+ KB


In [137]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

#### Create a ColumnTransformer object with `make_column_transformer`

In [139]:
ctf = make_column_transformer(
    (SimpleImputer(strategy='median'), ['age']),
    (SimpleImputer(strategy='most_frequent'), ['embarked']),
    remainder='passthrough'  #important
)

### We could fit it, but we probably want to use it in a pipeline

In [None]:
# ctf.fit(X_train, y_train)

#### Pass the ColumnTransformer object to the pipeline, dummify, and use logistic regression

In [140]:
pipe = make_pipeline(
    ctf, 
    OneHotEncoder(handle_unknown='ignore'), 
    LogisticRegression()
)

#### Fit the pipeline

In [141]:
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('simpleimputer-1',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=None,
                                                                missing_values=nan,
                                                                strategy='median',
                                                                verbose=0),
                                                  ['age']),
                                                 ('simpleimputer-2',
                                                  SimpleImputer(add

#### Score

In [142]:
pipe.score(X_test, y_test)

0.8251121076233184

Next, we could pass the pipeline and a parameter grid to GridSearchCV and optimize the hyperparameters, but let's look at evaluation now.

## Evaluate the performance

look at the predictions and make a confusion matrix

In [144]:
preds= pipe.predict(X_test)
preds

array([1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
       0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0])

In [145]:
from sklearn.metrics import confusion_matrix 

confusion_matrix(y_test, preds)

array([[124,  15],
       [ 24,  60]])

#### Unpack the confusion_matrix

In [146]:
tn, fp, fn, tp = (confusion_matrix(y_test, preds)).ravel()

#### Compute the recall

In [147]:
tp/(tp+fn)

0.7142857142857143

#### Compute the precision

In [148]:
tp / (tp + fp)

0.8

# Summary

You've seen how to use `SimpleImputer` to fill missing values. You've seen how to use it with  `make_column_transformer` and pipelines.

## Check for understanding

- Why would you want to use `make_column_transformer` with pipelines ?
- What are different imputation strategies?
- What is precision?



With `GridSearchCV`, `make_pipeline`, and `make_column_transformer` you can do preprocessing and grid searching efficiently!  🎉
