## When do I need to care about missing data?

If you don't know some of the values in your dataset, then you can't

- calculate summary statistics
- run most machine learning models

## Why is data sometimes missing?

There are many reasons!

- For survey data, perhaps someone declined to answer a question.
- For website data, perhaps some users installed privacy tools so you could not track their behavior.
- For sensor data, perhaps a sensor was not working, or a signal was too small to detect.

Understanding why data is missing is crucial to deciding how best to deal with it.

## How can I handle missing data?

There are essentially two choices:

- Delete the missing data (drop the rows from your dataset)
- Make up reasonable values (imputation)

## What are the steps for handling missing data?

1. Standardize how missing values are recorded
2. Quantify how much missing data you have
3. Either delete or impute the missing values
4. Run your model (or do other analysis)
5. Check the effect of missing values on your model

## What Python packages can I use for handling missing data?

- **pandas** (used here)
- **scikit-learn** (used here)
- **PyCaret**

## Limitations

- **Time series aren't covered here**. In time series data, rows at nearby time points are related, so they have their own methods for dealing with missing data.
- **Survival analysis isn't covered here**. If values are missing because they exceed a threshold, then you have a survival analysis problem, which requires different techniques.
- **Multiple imputation isn't covered here**. The most sophisticated techniques for dealing with missing data involve multiple imputation, but that isn't well supported in scikit-learn.
- **Imputation is not valid when missing values are "missing not at random"**. If there is a pattern to the missingness caused by variables that aren't in the dataset, then the techniques discussed here aren't valid. 

## Case study: Mammalian sleep durations

Let's explore a popular dataset on mammalian sleep durations, `msleep`. The dataset was cribbed from a 2007 paper by V. M. Savage and G. B. West and Wikipedia, and was popularized by R's **ggplot2** package. It's available in Python via the **plotnine** package.

Since the dataset is a dataframe, we'll import **pandas** too.

In [None]:
import pandas as pd
from plotnine.data import msleep

Some extra code has been added to demonstrate issues with dirty data.

In [None]:
msleep_dirty = msleep.copy()
msleep_dirty["conservation"] = msleep_dirty["conservation"].cat.set_categories(["lc", "nt", "vu", "en", "cr", "ew", "ex", "unknown"], ordered=True).fillna(value="unknown")
msleep_dirty["sleep_rem"] = msleep_dirty["sleep_rem"].fillna(value=-999)
msleep_dirty

## Standardizing missing values

Before we can deal with missing data, we must standardize the format of the missing values. That means

- converting strings like `"N/A"` or `"unknown"` to true NAs.
- converting code numbers like `-999` to true NAs.

By a "true NA", I mean **NumPy**'s `nan` value. (**pandas** also has a special value for missing data, `NA`, but it isn't yet widely supported.)

In [None]:
from numpy import nan

To standardize the missing values, we replace them using the [`.replace()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html) method.

In [None]:
msleep = msleep_dirty.replace(
    {"conservation": "unknown", "sleep_rem": -999},
    value=nan
)
msleep

Notice that the missing values in the `conservation` and `sleep_rem` columns now display as `NaN` or `null`, depending on your Jupyter notebook editor.

## Quantifying missing values

To decide on the best technique for handling missing data, we need to know how much there is. We can find the proportion of missing data in each column by combining the [`.isna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html) and [`.mean()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html) methods.

A value of zero means no missing data, and a value of one means all values in the column are missing.

In [None]:
msleep.isna().mean()

`vore` has a small amount of missing data, with 8% missing. `sleep_cycle` has the most missing data, with 61% missing.

## Dropping rows with missing values

The simplest solution to handling missing data is to get rid of any rows where there is a missing value using the [.dropna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) method.

In [None]:
msleep.dropna()

In this case, we've droppped most of the dataset, going from 83 rows to 12. This is far from ideal.

Dropping data is only a suitable solution if there is only a very small amount of missing data.

## Separating the dataset by column data type

In **scikit-learn**, numeric data and unordered categorical data currently require different techniques for imputation. 

For ordered categorical data, you can either treat it in the same way as unordered categorical data, or transform in into integers with [`OrdinalEncoder()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html), and pretend that it is numeric. Here we'll take the former approach for simplicity.

The next step is to split the dataset into two by column type.

In [None]:
msleep_num = msleep.select_dtypes("float")
msleep_num

In [None]:
msleep_cat = msleep.select_dtypes(["object", "category"])
msleep_cat

## Replacing with means or medians

The simplest (and stupidest) imputation approach is to just to replace missing values with the mean or median of the column.

We have two options: **scikit-learn**'s [`SimpleImputer()`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) and **pandas**'s [`.fillna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html).

We'll focus on `SimpleImputer()` since it's easiest to transition to the more advanced imputers later, but check our working with `.fillna()`.

First we import the **impute** submodule of **scikit learn**, and create a `SimpleImputer()` object.

In [None]:
import sklearn.impute as si
simp = si.SimpleImputer()

Now we call the [`.fit_transform()`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer.fit_transform) method. *Fitting* means calculating the mean of the columns, and *transforming* means replacing the missing values with those means.

One annoyance is the **scikit-learn** only cares about **NumPy** arrays, so we have to manually convert it back to a data frame.

In [None]:
pd.DataFrame(simp.fit_transform(msleep_num), columns=msleep_num.columns)

Before we get into the results, let's try this again with **pandas** to see how it works. We calculate the mean of each column (fit), and fill the missing values with those means (transform).

In [None]:
column_means = msleep_num.mean()
msleep_num.fillna(column_means)

Notice that we have the same result in each case, so our code is correct!

Just like dropping data, imputing with the mean or median is only a suitable solution if

1. There is only a very small amount of missing data, and
2. anecdotally it performs better when you have lots of features.

## Replacing with the most frequent value

For categorical columns, there is no mean, so an alternative is to use the mode. That is, the most frequent value in the column.

In [None]:
simp_mf = si.SimpleImputer(strategy="most_frequent")
pd.DataFrame(simp_mf.fit_transform(msleep_cat), columns=msleep_cat.columns)

Again, this isn't ideal, but there aren't really any good alternatives built-in to **scikit-learn**.

## Using iterative methods to find the best replacement

A more sophisticated option is to use [`IterativeImputer()`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html). This fits a predictive model (a Bayesian Ridge Regression) to each column, one at a time using the non missing values in the other columns. By repeating this process several times (iterating), more and more missing data gets filled in.

As with replacing using the mean, this only works with numeric columns.

`IterativeImputer()` should be your default starting point for most imputation. As of **scikit learn** `1.0.2`, it's considered  experimental, so you need to enable it before using it.

In [None]:
from sklearn.experimental import enable_iterative_imputer
iimp = si.IterativeImputer()
pd.DataFrame(iimp.fit_transform(msleep_num), columns=msleep_num.columns)

Notice that this time, the first three mising values in `sleep_cycle` have been replaced with different values.

## Where can I learn more?

- DataCamp's [Dealing with Missing Data in Python](https://app.datacamp.com/learn/courses/dealing-with-missing-data-in-python), [Cleaning Data in Python](https://app.datacamp.com/learn/courses/cleaning-data-in-python) and [Machine Learning with scikit-learn](https://app.datacamp.com/learn/courses/machine-learning-with-scikit-learn) courses.
- pandas [Working with Missing Data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html) tutorial.
- scikit-learn [Imputation of missing values](https://scikit-learn.org/stable/modules/impute.html) tutorial.