## Imputing Values for Machine Learning
## By Jeff Hale

Machine learning algorithms in sklearn require that there be no missing data. Some machine learning algorithms that can be implemented outside sklearn can automatically impute missing data. In this post we're going to assume you want to learn all the sklean goodness such as pipelines, so we're going to to make sure our pandas data frames don't have any missing data.

There are a variety of ways to deal with missing data in machine learning. 

One option is to delete observations or variables with any missing data. If a very small proporition of observations or variables have missing data dropping them might not have much of an effect on a model's performance. But generally we don't want to throw away data that isn't erroneous - it might include some bit of meaning that could theoretically help a machine learning model perform better. 

For example, in the popular [Ames Housing Dataset ](http://http://ww2.amstat.org/publications/jse/v19n3/decock.pdf) very few of the samples in the dataset have a value for the Pool Quality variable, yet we might expect that property with a pool in good shape is more valuable than a similar property with a pool in poor shape. A model might use that information to more accurately predict the sale price of the property.  Drop that column from the dataframe and that opportunity is lost.

Likewise you don't want to drop observations just because they don't have a value for the pool quality variable because you wouldn't have many observations left :) 

Of couse the true test for the value of imputing missing data for  any individual data set is whether a model can use the missing data to perform better or not. The goal of the current project  is to develop some processes  and code snippets as guidelines or starting points that will be helpful for imputing data. We want a process to impute data in a manner that is quick and useful for machine learning pipelines. 

I also hope to expose readers to a wider range of options than you might have known were possible to help you deal with missing data.

So how should we fill the missing values? Like lots of things in Data Science there are a slew of different options. 

For the absolute best results you should try several approaches. But because there is a time and effort tradeoff and there are so many aspects of a problem to look at, we want a quick way to handle missing values that does a good enough job.

Imputer options for interval features:
* Mean
* Median 
* Mode - probably only use if really ordinal or nominal data.

Options for string data type
*If one hot encoding categorical data, we can also make a one hot encoded column for missing data for a feature.

Special options for time series data:
*Fill forward
*Fill backward
*Average of backward and forward
*Any of the above options combined with a seasonality factor

Options for any type of data:
*Use a model such as KNN to determine how similar the data is.
*Use multiple imputation that solves some problems and is more thorough than any single imputation. MICE is available in Fancy Imputer (Multiple Imputation by Chained Equations)  *from fancyimpute import MICE*

The important thing is the type of data (interval, ordinal, or nominal) - not the original dtype. Everything is going to be made into integer or float data for use in the machine learning model.

Then after imputation, there is the additional choice:
*Create a column for each predictor variable to note whether the final value was imputed for that elent or not. This is suggested by Dan Becker in Kaggle's guide [Handling Missing Values](https://www.kaggle.com/dansbecker/handling-missing-values). 

Here's a [great post](https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4) on imputing values by Alvira Swalin. The code is in R. 

Here's more info on [MICE:](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/).

Fancyimpute also provides implementation of basic imputation, KNN, and other methods in addition to MICE [fancyimpute package](https://github.com/iskandr/fancyimpute).

Using MICE is generally better because it creates some randomness.

## Measurement scales

To talk about imputing we need to talk about the measurement scales of variables. See the discussion in my previous Kaggle kernel [here](https://www.kaggle.com/discdiver/measurement-scales-for-machine-learning/ ).

Here's a working process (also a work-in-process) for dealing with missing values. 

1. Your first task is to figure out which data scale each column is. Write out all your variables with definitions  and if discrete, their possible values in a spreadsheet. Note their dtype and what type of data measurement scale they are. Write out the variables with definitions and thoughts in a google sheet. [Here's mine for the Ames Housing Dataset](https://docs.google.com/spreadsheets/d/106ZP2r97yRkkTbBqV9oEt00XNnjomhj3BvIaCNaeWlk/edit?usp=sharing). This idea came from [ Pedro Marcelino's popular Kernel](https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python). This sheet can be very helpful when you're seeking to understand your data and when you looking to create new features. 
2. Once you've decided what measurement scale each variable is, turn the values into numbers.
3. Use MICE to impute the missing values.
4. Create columns of values denoting whether a value was imputed.
4. One hot encode nominal and ordinal data or ideally, try several encoding schemes with [Category Encoders](http://contrib.scikit-learn.org/categorical-encoding/). If you haven't imputed your missing values yet, Category Encoders will make a new column to indicate whether a variable
5. Bin/binarize as needed.
6. Proceed with preprocessing to evaluate outliers, scale and transform data, feature engineer, etc.


Have improvements? Please share them in the comments.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.base import TransformerMixin


Basic imputer that imputes differently depending upon datatype. Often will need to tweak depending upon time series, ordinal, or other data characteristics. 

In [None]:
class DataFrameImputer(TransformerMixin):

    def __init__(self):
        """Impute missing values.

        Columns of dtype object are imputed with the mode.

        Columns of numberical dtypes are imputed with mean of column.
        
        need a way to denote ordinal data more easily in pandas

        """
    def fit(self, X, y=None):

        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
            index=X.columns)
        # how treat boolean data?

        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)
    
    # make columns to track whether a value was imputed?
    # make a column for one hot encode that notes the data was missing?


#train_data = DataFrameImputer().fit_transform(train_data)
#test_data = DataFrameImputer().fit_transform(test_data)