# Filling missing data with SimpleImputer, pipelines, and make_column_transformer

_By Jeff Hale_

---

## Learning Objectives
By the end of this lesson students will be able to:

- Explain why you might want to use sklearn's `SimpleImputer`
- Use `SimpleImputer` to fill missing values with the mean or a constant
- Use `SimpleImputer` in a pipeline
- Use `SimpleImptuter` with `make_column_transformer` and a pipeline to use different imputation strategies with different columns

---

SimpleImputer is an sklearn class you can use so that you don't have to worry about X_test information leaking into your imputation metrics.

It can be used in a pipeline.

 #### Parameters: 
 
    missing_values : number, string, np.nan (default) or None
        The placeholder for the missing values. All occurrences of
        `missing_values` will be imputed.
        
    strategy : string, default='mean'
        The imputation strategy.
        - If "mean", then replace missing values using the mean along
          each column. Can only be used with numeric data.
        - If "median", then replace missing values using the median along
          each column. Can only be used with numeric data.
        - If "most_frequent", then replace missing using the most frequent
          value along each column. Can be used with strings or numeric data.
        - If "constant", then replace missing values with fill_value. Can be
          used with strings or numeric data.
      

    fill_value : string or numerical value, default=None
        When strategy == "constant", fill_value is used to replace all
        occurrences of missing_values.
        If left to the default, fill_value will be 0 when imputing numerical
        data and "missing_value" for strings or object data types.

    add_indicator : boolean, default=False
        If True, a :class:`MissingIndicator` transform will stack onto output
        of the imputer's transform. This allows a predictive estimator
        to account for missingness despite imputation. If a feature has no
        missing values at fit/train time, the feature won't appear on
        the missing indicator even if there are missing values at
        transform/test time.

## Set things up with Titanic dataset

In [None]:
import seaborn as sns
import pandas as pd
import numpy as np


from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression


In [None]:
# read in the data
df_titanic = sns.load_dataset("Titanic")

In [None]:
# inspect 
df_titanic.head(2)

In [None]:
df_titanic.info()

Let's say we just want to use `age` to predict survival. It has some missing data.

In [None]:
df_titanic.isna().sum()

In [None]:
# break into X and y
X = df_titanic[['age']]
y = df_titanic['survived']

In [None]:
X.head(2)

In [None]:
y.head(2)

#### train test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

In [None]:
X_train.head(2)

In [None]:
y_train.head(2)

In [None]:
X_train.shape

In [None]:
y_train.shape

In [None]:
X_test.shape

In [None]:
y_test.shape

## Use SimpleImputer in a pipeline 

The `.fit_transform()` method with compute the mean (the fit step) and then will fill the missing values in X_train's `age` column with the mean of X_train's existing `age` values (the train step). 


The `.transform()` method with `X_test` will use that same mean value computed just before with `.fit(X_train)` step.  

#### Instantiate a SimpleImputer object that will use the mean to fill missing values.

#### Fit it to the training data

#### Transform the training data

#### Save the transformed training data

#### Verify no nulls in X_train

#### transform the test data

## How would we fill with most common value?

## Or instead, use a Pipeline and make life easier 😁

#### Make a pipeline object 

Fill null values with the median using simple imputer and logistic regression

#### Fit the pipeline on the training set

### Score the pipeline on the train set

### Score the pipeline on the test set

### We could use grid search to hyperparameter tune the SimpleImputer and LogisticRegression arguments

## Use different imputation strategies on different columns

To use different imputation strategies on different columns, the easiest thing to do is use sklearn's `make_column_transformer`.

## make_column_transformer

`make_column_transformer` is a convenience method to make a `ColumnTransformer` object, similar to how `make_pipeline` is a convenience method to make a `Pipeline` object.

A ColumnTransformer object can apply different transformers to different columns! 🎉

#### Parameters

    *transformers : tuples
        
        Tuples of the form (transformer, column(s)) specifying the
        transformer objects to be applied to subsets of the data.
        
        - transformer : estimator or {'passthrough', 'drop'}
            Estimator must support :term:`fit` and :term:`transform`.
            Special-cased strings 'drop' and 'passthrough' are accepted as
            well, to indicate to drop the columns or to pass them through
            untransformed, respectively.
        
        - column(s) : string or int, array-like of string or int, slice, boolean mask array or callable
            Indexes the data on its second axis. Integers are interpreted as
            positional columns, while strings can reference DataFrame columns
            by name. A scalar string or int should be used where
            ``transformer`` expects X to be a 1d array-like (vector),
            otherwise a 2d array will be passed to the transformer.
            A callable is passed the input data `X` and can return any of the
            above.
    
    remainder : {'drop', 'passthrough'} or estimator, default 'drop'
        By default, only the specified columns in `transformers` are
        transformed and combined in the output, and the non-specified
        columns are dropped. (default of ``'drop'``).
        
        By specifying ``remainder='passthrough'``, all remaining columns that
        were not specified in `transformers` will be automatically passed
        through. This subset of columns is concatenated with the output of
        the transformers.
        
        By setting ``remainder`` to be an estimator, the remaining
        non-specified columns will use the ``remainder`` estimator. The
        estimator must support :term:`fit` and :term:`transform`.

#### Import

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder

#### Let's use a larger set of columns for our X here.

In [None]:
X = df_titanic[['age', 'fare', 'embarked', 'sex']]
y = df_titanic['survived']

In [None]:
X.info()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

#### Create a ColumnTransformer object with `make_column_transformer`

### We could fit it, but we probably want to use it in a pipeline

In [None]:
# ctf.fit(X_train, y_train)

#### Pass the ColumnTransformer object to the pipeline, dummify, and use logistic regression

#### Fit the pipeline

#### Score

Next, we could pass the pipeline and a parameter grid to GridSearchCV and optimize the hyperparameters, but let's look at evaluation now.

## Evaluate the performance

look at the predictions and make a confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix 



#### Unpack the confusion_matrix

#### Compute the recall

#### Compute the precision

# Summary

You've seen how to use `SimpleImputer` to fill missing values. You've seen how to use it with  `make_column_transformer` and pipelines.

## Check for understanding

- Why would you want to use `make_column_transformer` with pipelines ?
- What are different imputation strategies?
- What is precision?



With `GridSearchCV`, `make_pipeline`, and `make_column_transformer` you can do preprocessing and grid searching efficiently!  🎉
