# $\color{purple}{\text{Understanding Missing Data and How to Deal with It (Part 2)}}$

### $\color{purple}{\text{Colab Environmental Setup}}$

In [None]:
from google.colab import drive
import os
drive.mount('/content/drive')
os.chdir('/content/drive/My Drive/missingness_tutorial')

### $\color{purple}{\text{Libraries for this lesson}}$

In [None]:
import numpy as np
import pandas as pd
from helpers import clobber, stat_comparison

## $\color{purple}{\text{Deeper Dive into Missingness Mechanisms}}$

## $\color{purple}{\text{Setting Up Test Data}}$

$\color{red}{\Large{\text{ ⚠}}}$ We synthensize a statiscally controlled example to more clearly illustrate the concepts. This dataset will satisfy the normality condition set forth in many of the statistical assumptions. These may not carry over to your datasets.

We will cause missingness in approximately 20% of the observations. This may (hopefully) be more that you will experience, but this high proportion will amplify effects such as bias.

`observations` will be the size of our test set. The covariance matrix `cov` supplied shows some nice characteristics with two highly correlated features. But you can generate a completely random covariance matrix using the following:
```
A = np.random.rand(variables, row_size)
cov = np.dot(A, A.transpose())
```
where `variables` is the number of variables and `row_size` is any number greater thanor equal to `variables` to insure a positive semidefinite matrix.

We selected a `mean` to be taken from an normal distribution with a mean between 1 and 5 and a standard deviation between 0 and 5.

This dataset will serve as one of the major datasets for this and subsequent notebooks.

In [None]:
# This covariance matrix has some nice properties to demonstrate. Originally this was generated at random
cov = [
    [1.6545195264181267, 0.6346001403246381, 1.573255077832285, 0.7457615955325402],
    [0.6346001403246381, 0.5636389213610075, 0.5861890592085826, 0.6638139531999303],
    [1.573255077832285, 0.5861890592085826, 1.6461885333121087, 0.4916921086792136],
    [0.7457615955325402, 0.6638139531999303, 0.4916921086792136, 1.0900299890979697],
]
mean = np.random.normal(np.random.uniform(low=1, high=5), np.random.uniform(high=5), 4)
mean

List the covariance matrix and compare to the original. 
This is only important to insure the number of observations selected is sufficient to give the right characteristics.

In [None]:
observations = 20000
df = pd.DataFrame(
    np.random.multivariate_normal(mean, cov, size=observations),
    columns=["feature a", "feature b", "feature c", "feature d"],
)
df.cov()

Now we add one variable that is completely uncorrelated with the other features and show the correlation matrix to confirm. We'll save this dataset off for later use.

In [None]:
df["uncorrelated"] = np.random.rand(observations)
df.to_csv('full_set.csv', index=False)

The two helper functions `clobber` and `stat_comparison` are universally helpful. Detailed description in `helpers.py`

## $\color{purple}{\text{MCAR and MAR Data Set}}$
In this subsection, we will induce missingness to the dataset we just constructed. This will enable demonstrations of missingness mechanism tests as well as to demonstrate treatment techniques in subsequent notebooks

### $\color{purple}{\text{MCAR}}$
We induce MCAR missingness in one and two columns then save off the files for later use

In [None]:
mcar_df = clobber(df, "feature a", 0.2)

mcar_df.to_csv('mcar_set.csv', index=False)
mcar_df["feature a"].isnull().sum()

In [None]:
stat_comparison(df, mcar_df, "feature a")

In [None]:
mcar_df.cov()

Clobber a second column

In [None]:
double_mcar_df = clobber(mcar_df, "feature b", 0.2)

double_mcar_df.to_csv('double_mcar_set.csv', index=False)
double_mcar_df.isnull().sum()

### $\color{purple}{\text{MAR}}$
We induce MAR missingness in one and two columns then save off the files for later use

In [None]:
mar_df = clobber(df, "feature a", 0.4, depends=["feature c"])

mar_df.to_csv('mar_set.csv', index=False)
mar_df["feature a"].isnull().sum()

In [None]:
stat_comparison(df, mar_df, "feature a")

In [None]:
double_mar_df = clobber(mar_df, "feature b", 0.4, depends=['feature d'])

double_mar_df.to_csv('double_mar_set.csv', index=False)
double_mar_df.isnull().sum()

## $\color{purple}{\text{Simple MAR Test}}$

The procedure is simple. For any columns with missing data, construct a new column relating to the missingness of that column

In [None]:
test_df = mcar_df.copy()

In [None]:
test_df['missingness']=test_df['feature a'].isnull()

Then test to see if that new feature is "related" to any of the columns. If it is then the missingness mechanism is MAR. We will use the "eyeball" test by using correlation. There are statistically robust tests such as using Student's t-test or use logistic regression on the other features to predict the missingness, etc.

In [None]:
test_df.corr()

Repeat for MAR

We can do this too with multiple columns missing

In [None]:
test_df = double_mcar_df.copy()

In [None]:
test_df['missingness_a']=test_df['feature a'].isnull()
test_df['missingness_b']=test_df['feature b'].isnull()
test_df.corr()

## $\color{purple}{\text{Poor Man's Version of Little's MCAR Test (or rather not MAR Test)}}$

The test given above is a little awkward if more than one column has missing data. Originally, Little proposed the following test for MCAR. 

$\color{red}{\text ⚠}$ The code below demonstrates the simplified principle behind Little's MCAR Test but a lot of the statistical rigor has been relaxed.

We adopt the "eyeball" test of whether statistics match or not. In principle, some statistical assumptions are made resulting in a $p$-value. In particular, Little used made normality assumptions resulting in a $\chi^2$ distribution.

First the observations are segregated into their various patterns. In our case, there are only two tests, observation is complete. Observation is missing "feature a"

In [None]:
pattern1 = mar_df.dropna(subset=["feature a"])
pattern2 = mar_df[mar_df["feature a"].isnull()]

The formal version of Little's Test uses maximum likelihood estimations to estimate statistcal features of each group and compares them. If they are statistcally the same then he declares the missingness mechanism as MCAR. 
Here we use the eyeball test

In [None]:
pd.concat([pattern1.mean(), pattern2.mean(),pattern1.mean()- pattern2.mean()], axis="columns")

You might want to look at covariances

Create a little helper to look at means

In [None]:
def littles_eyeball_test(df, column):
    """
    "Eyeball" version of Little's MCAR test for missingness in 1 columns 
    """
    pattern1 = df.dropna(subset=[column])
    pattern2 = df[df[column].isnull()]
    return pd.concat([pattern1.mean(), pattern2.mean(), pattern1.mean() - pattern2.mean()], axis="columns")

In [None]:
littles_eyeball_test(mar_df,'feature a')

The "double missingness" sets exhibit 4 patterns so we'll expand our experiment to 4

In [None]:
def littles_eyeball_test_double(df):
    """
    "Eyeball" version of Little's MCAR test for missingness in 2 columns. 4 patterns in all
    """
    pattern1 = df[df['feature a'].isnull() & df['feature b'].isnull()]
    pattern2 = df[df['feature a'].isnull() & ~df['feature b'].isnull()]
    pattern3 = df[~df['feature a'].isnull() & df['feature b'].isnull()]
    pattern4 = df[~df['feature a'].isnull() & ~df['feature b'].isnull()]
    return pd.concat([pattern1.mean(), pattern2.mean(), pattern3.mean(), pattern4.mean()], axis="columns")

In [None]:
littles_eyeball_test_double(double_mcar_df)

## $\color{purple}{\text{MNAR: The missingness you don't want}}$


### $\color{purple}{\text{How NOT to synthesize MNAR Missingness}}$

This is how "MNAR" is often done in the literature to demonstrate imputation techniques. As you will see this is not true MNAR.

In [None]:
fmnar_df = clobber(df, "feature a", 0.4, depends=["feature a"])
fmnar_df["feature a"].isnull().sum()

In [None]:
stat_comparison(df, fmnar_df, "feature a")

In [None]:
littles_eyeball_test(fmnar_df, 'feature a')

This is actually MAR. Recall from the definition in the previous section MNAR is "the probability of data being missing depends on the unobserved data, **even after conditioning on the observed data**". Turns out `feature a` is highly correlated to `feature c` so there is a statistically dependency on `feature c` even though we constructed the missingness based on `feature a`. 

### $\color{purple}{\text{A true NMAR missigness}}$

In order to be NMAR the missingness must be uncorrelated to the visible data.

In [None]:
mnar_df = clobber(df, "uncorrelated", 0.4, depends=["uncorrelated"])

mnar_df.to_csv('mnar_set.csv', index=False)
mnar_df["uncorrelated"].isnull().sum()

In [None]:
littles_eyeball_test(mnar_df, 'uncorrelated')

### $\color{purple}{\text{Final thoughts on MNAR}}$
You'll see it said that there is not statistical test for MNAR which is true, but a better statement is that there is no statistical way to distinguish MCAR and MNAR. You can test to see if missingness is MAR or not.

### $\color{purple}{\text{Conclusion on the Theory Section}}$

*   There is no way from just the data itself to distinguish between MCAR and MNAR. 

*   The so-called MCAR tests are really "not MAR" tests
  * Most those tests assume you have already excluded MNAR
* Recommend if the missingness is not MAR assume the worst and treat it as MNAR.
* If missingness is MAR, you should use multivariate imputation not deletion.
* Be careful synthesizing NMAR missingness for benchmarking


### $\color{purple}{\text{References}}$
 * Little, R. J. A., A test of missing completely at random for multivariate data with missing values. _Journal of the American Statistical Association 83_ 1988 pages 1198–1202. https://doi.org/10.1080/01621459.1988.10478722