# $\color{purple}{\text{Understanding Missing Data and How to Deal with It (Part 4)}}$

## $\color{purple}{\text{Common Treatment Practices}}$

### $\color{purple}{\text{Libraries for this lesson}}$

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import missingno
import numpy as np
plt.style.use('ggplot')
from helpers import stat_comparison
from autoimpute.imputations import SingleImputer

### $\color{purple}{\text{Load Datasets for this Lesson}}$
These datasets will be used for most of this section

In [None]:
pristine_df=pd.read_csv('data/full_set.csv')
mcar_df = pd.read_csv('data/mcar_set.csv')
mar_df = pd.read_csv('data/mar_set.csv')

### $\color{purple}{\text{Deletion}}$

Deletion is the most common form of treatment, but is the most susceptible to bias. It is intended to be used with MCAR mechanism. Some scenarios have guidelines where if the missingness is below a threshold then deletion bias is considered acceptable.

Deletion comes in two flavors, _list deletion_ and _pair deletion_

#### $\color{purple}{\text{List Deletion}}$
This is the simplest, you simply drop the records that have missing values

In [None]:
cleaned=mar_df.dropna()

In [None]:
stat_comparison( pristine_df, cleaned, 'feature a')

#### $\color{purple}{\text{Pair Deletion}}$
Values are dropped if they are used. 
For example for covariance

$$\sigma_{ij} =\sum_{k} \frac{(x_{ik}-\mu_i)(x_{jk}-\mu_j)}{N^*} $$
where for all $k$ where $x_{ik}$ and $x_{jk}$ are not missing and $N^*$ is the count of all pairs where neither $x_{ik}$ and $x_{jk}$ are missing. This is how the the `cov` method deals with missing data

### $\color{purple}{\text{Interpolation}}$

If the rows have a geometric relationship (time-series, spatial relation), interpolation could be applied and the feature in question is thought to be continuous.

The interpolate dataset is constructed by sampling a continuous functions simulating a timeseries data.

Here is a little EDA on this set.

In [None]:
df=pd.read_csv('data/interpolate.csv')
df

We'll create a missing column so when we graph we can see the data that was interpolated. Also create a more attractive colormap for plotting.

In [None]:
df['missing']=df.y.isnull().astype(int)
cmap=matplotlib.colors.ListedColormap(['red', 'deepskyblue'])

#### $\color{purple}{\text{Back Fill/Forward Fill}}$

The simplest interpolation method (only applies to 1 dimension)

 * Forward Fill / Last Observation Forward
 * Back Fill / Next Observation Backward

In [None]:
ndf=df.sort_values('x').fillna(method='ffill')
plt.scatter(ndf.x, ndf.y, cmap=cmap, c=df.missing)

#### $\color{purple}{\text{Other Interpolation Techniques}}$

Some types supported by `scipy`
* linear
* quadratic
* cubic

In [None]:
from scipy import interpolate
kind='linear'
ndf=df.copy()
interpolator_y = interpolate.interp1d(df.dropna().x, df.dropna().y, kind=kind)
ndf['y']=interpolator_y(ndf.x)
interpolator_z = interpolate.interp1d(df.dropna().x, df.dropna().z, kind=kind)
ndf['z']=interpolator_z(ndf.x)
plt.scatter(ndf.y, ndf.z, c=df.missing, cmap=cmap)

### $\color{purple}{\text{Univariate Imputation}}$
* Fill - Not recommended
    * `df.fillna(0)`
* Mean/Median
* Mode/Frequent
* Random

#### $\color{purple}{\text{Mean or Median Imputation}}$
For any column fill each missing value with the mean or the median of that feature
* Easy and quick
* Does not correct bias for MAR (or MNAR data)
* Induces a bias in variance a lower value


In [None]:
imputed_df = mar_df.fillna(mar_df['feature a'].mean())
stat_comparison( pristine_df, imputed_df, 'feature a')

In [None]:
# Use autoimpute
imputer=SingleImputer('mean')
imputed = imputer.fit_transform(mar_df)

#### $\color{purple}{\text{Mode/Frequent Imputation}}$
For any column fill each missing value with the most frequent value 
* Same drawbacks as mean or median imputation for continuous values

In [None]:
cat_mar_df = pd.read_csv('data/categorical_mar.csv')

In [None]:
cat_mar_df.fillna(cat_mar_df['cat feature'].mode().iat[0])

#### $\color{purple}{\text{Random/Normal}}$
Basically pick a random value based on the distribution of the rest of the column
* Advantage over median or mean imputation as it preserves the variance
* Still does not correct bias in MAR

In [None]:
filler=np.random.normal(mar_df['feature a'].mean(), mar_df['feature a'].std(), 20000)
imputed=mar_df.assign(**{'feature a': mar_df['feature a'].where(~mar_df['feature a'].isnull(), filler)})

In [None]:

stat_comparison( pristine_df, imputed_df, 'feature a')

## $\color{purple}{\text{Takeaways}}$
* Interpolation can (and maybe should) be used in cases of spatial or time connected data features that are believed continuous
* Deletion and Univariate Imputation can be used for MCAR missingness or when the expected bias can be tolerated (i.e., very small amount of data missing)

### $\color{purple}{\text{References}}$
 * Autoimpute Documentation: https://autoimpute.readthedocs.io/en/latest/index.html