# Imputation

Sometimes our features may be missing observations, what can we do.

## What we will accomplish

In this notebook we will:
- Discuss a method to deal with missing values,
- Demonstrate that method on a penguin data set,
- Illustrate various approaches to imputation and
- Show how to integrate imputation into a train test split procedure.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from seaborn import set_style
set_style("whitegrid")

## Missing data

There will be times when you need to work with a data set that has missing data. We can see one such example with this edited `seaborn` penguins data set, <a href="https://seaborn.pydata.org/examples/scatterplot_matrix.html">https://seaborn.pydata.org/examples/scatterplot_matrix.html</a>.

In [None]:
penguins = pd.read_csv("../../Data/penguins_w_nas.csv")

penguins.loc[penguins.sex=='Male', 'sex'] = 0
penguins.loc[penguins.sex=='Female', 'sex'] = 1

In [None]:
penguins.info()

Using `.info()` above we can see that the data set has 344 entries with some observations that are missing values for one or more columns. If there are NAs in features that you plan to use in a model you will be unable to use those observations in training or validation sets.

So what can we do?

## Imputation

The process of replacing missing values in data is known as <i>imputation</i>. There are a few different ways you can impute missing data.

### Imputing a preset constant value

The simplest approach is just just impute a constant value. For example, maybe scientists have already observed an average body mass for penguins, let's say it is `4207`. You could then impute this value for all missing `body_mass_g` values.

In [None]:
## Make a copy to demonstrate impute strategy
penguins_constant_impute = penguins.copy()

## Replace the missing data
## .isna() checks for missing data


In [None]:
penguins_constant_impute.info()

### Imputing based on a sample statistic

A common strategy is to impute missing values using sample statistics from the non-missing values in a column. For example, we can replace NAs with the mean, median or mode of the column.

While we can do this by hand using `numpy` or `pandas` it may be easier, and more easily used in predictive modeling, to use `sklearn`'s `SimpleImputer`, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html">https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html</a>.

In [None]:
## make a SimpleImputer object
## strategy determines how the imputation happens
## options are 'mean', 'median', 'most_frequent' and 'constant'
impute = 

## fit the impute object


## Show the transformed data


### Building a model to impute

We can also build a model to impute the missing values. Let's regress `body_mass_g` on all other columns to get imputed values for the missing data.

Note that we cannot do this for observations missing multiple features. For now we will ignore those rows, but in practice we would have to come up for unique imputation strategies for each columns (and sometimes each row).

In [None]:
penguins_na = penguins.loc[penguins.body_mass_g.isna()].dropna(subset=['bill_length_mm', 
                                                                       'bill_depth_mm', 
                                                                       'flipper_length_mm', 
                                                                       'sex']).copy()
penguins_non_na = penguins.dropna().copy()

In [None]:
penguins_na[['Adelie', 'Gentoo']] = pd.get_dummies(penguins_na.species)[['Adelie', 'Gentoo']].copy()
penguins_non_na[['Adelie', 'Gentoo']] = pd.get_dummies(penguins_non_na.species)[['Adelie', 'Gentoo']].copy()

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
reg = LinearRegression(copy_X=True)

reg.fit(penguins_non_na[['bill_length_mm', 
                         'bill_depth_mm', 
                         'flipper_length_mm', 
                         'sex', 
                         'Adelie', 
                         'Gentoo']].values,
        penguins_non_na.body_mass_g.values)

In [None]:
reg.predict(penguins_na[['bill_length_mm', 
                         'bill_depth_mm', 
                         'flipper_length_mm', 
                         'sex', 
                         'Adelie', 
                         'Gentoo']].values)

For missing values that occur in rows where modeling is impossible you would have to use one of the other prior strategies.

#### A note about predictive modeling

If the end goal for your data set is to build a predictive model you <b>cannot</b> use the column that you are trying to predict in an imputation model. For example, if we were looking to build a model that predicted the `species` of the penguin we would not be allowed to use it for feature imputation models.

## Imputation in predictive modeling projects

Recall that when we scale data in predictive modeling projects we have to fit the scaler on the training data and then use that fit scaler on the test data. Importantly, we do <b>not</b> refit the scaler on the test or holdout data. We have to take a similar approach to any imputation technique.

For example:
- Imputation using any sample statistic approach must use the sample statistic computed on the training data to impute missing values in the test, validation or holdout sets and
- Models trained to impute missing values must be fit on the training data and that fitted model is what used on the test data.

`sklearn`'s `SimpleImputer` object is nicely set up for this.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
penguins_train, penguins_test = train_test_split(penguins, 
                                                    shuffle=True,
                                                    random_state=32,
                                                    test_size=.2)

In [None]:
penguins_train.info()

In [None]:
penguins_test.info()

In [None]:
## Define an imputer with the 'mean' strategy




## fitting the imputer on the training set
impute.fit(penguins_train[['bill_length_mm', 
                          'bill_depth_mm', 
                          'flipper_length_mm',
                          'body_mass_g']])

## Imputing the training set
impute.transform(penguins_train[['bill_length_mm', 
                              'bill_depth_mm', 
                              'flipper_length_mm',
                              'body_mass_g']])


In [None]:
## Imputing the test set
impute.transform(penguins_test[['bill_length_mm', 
                          'bill_depth_mm', 
                          'flipper_length_mm',
                          'body_mass_g']])

`SimpleImputer` (and `sklearn`'s other imputer objects, <a href="https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute">https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute</a>) can be implemented into a pipeline just like `StandardScaler` or any other `sklearn` preprocessing object.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)