# $\color{purple}{\text{Understanding Missing Data and How to Deal with It (Part 2)}}$

In [None]:
from google.colab import drive
import os
drive.mount('/content/drive')
os.chdir('/content/drive/My Drive/missingness_tutorial')

## $\color{purple}{\text{Setting Up Test Data}}$

$\color{red}{\Large{\text{ ⚠}}}$ We synthensize a statiscally controlled example to more clearly illustrate the concepts. This dataset will satisfy the normality condition set forth in many of the statistical assumptions. These may not carry over to your datasets.

We will cause missingness in approximately 20% of the observations. This may (hopefully) be more that you will experience, but this high proportion will amplify effects such as bias.

`observations` will be the size of our test set. The covariance matrix `cov` supplied shows some nice characteristics with two highly correlated features. But you can generate a completely random covariance matrix using the following:
```
A = np.random.rand(variables, row_size)
cov = np.dot(A, A.transpose()
```
where `variables` is the number of variables and `row_size` is any number greater thanor equal to `variables` to insure a positive semidefinite matrix.

We selected a `mean` to be taken from an normal distribution with a mean between 1 and 5 and a standard deviation between 0 and 5.

In [None]:
import numpy as np
import pandas as pd

In [None]:
# This covariance matrix has some nice properties to demonstrate. Originally this was generated at random
cov = [
    [1.6545195264181267, 0.6346001403246381, 1.573255077832285, 0.7457615955325402],
    [0.6346001403246381, 0.5636389213610075, 0.5861890592085826, 0.6638139531999303],
    [1.573255077832285, 0.5861890592085826, 1.6461885333121087, 0.4916921086792136],
    [0.7457615955325402, 0.6638139531999303, 0.4916921086792136, 1.0900299890979697],
]
mean = np.random.normal(np.random.uniform(low=1, high=5), np.random.uniform(high=5), 4)

In [None]:
mean

In [None]:
os.listdir(os.getcwd())

List the covariance matrix and compare to the original. 
This is only important to insure the number of observations selected is sufficient to give the right characteristics.

In [None]:
observations = 20000
df = pd.DataFrame(
    np.random.multivariate_normal(mean, cov, size=observations),
    columns=["feature a", "feature b", "feature c", "feature d"],
)
df.cov()

Now we add one variable that is completely uncorrelated with the other features and show the correlation matrix to confirm.

In [None]:
df["uncorrelated"] = np.random.rand(observations)
df.to_csv('full_set.csv', index=False)

In [None]:
df.corr()

In [None]:
# A function to cause missingness in a given column optionally
def clobber(df, column, probability, depends=[]):
    clob = df[column] == df[column]  # Always True
    for dep_column in depends:
        clob &= df[dep_column] > df[dep_column].median()
    clob *= probability
    rand = np.random.uniform(0, 1, size=len(clob))
    ret = df.copy()  # We copy to avoid clobbering the original
    ret[column] = np.where(clob < rand, df[column], np.nan)
    return ret

In [None]:
def stat_comparison(original, missing, column):
    df = pd.DataFrame.from_dict(
        dict(
            mean=[original[column].mean(), missing[column].mean()],
            median=[original[column].median(), missing[column].median()],
            stdev=[original[column].std(), missing[column].std()],
        ),
        orient="index",
        columns=["Original", "With Missing Data"],
    )
    df["difference"] = (df["Original"] - df["With Missing Data"]).abs()
    df["percentage"] = df["difference"] / df["Original"] * 100
    return df

## $\color{purple}{\text{MCAR and MAR Data Set}}$

In [None]:
mcar_df = clobber(df, "feature a", 0.2)

mcar_df.to_csv('mcar_set.csv', index=False)
mcar_df["feature a"].isnull().sum()

In [None]:
stat_comparison(df, mcar_df, "feature a")

In [None]:
df.cov()

In [None]:
(df.cov() - mcar_df.cov()).abs()/df.cov()*100

In [None]:
mar_df = clobber(df, "feature a", 0.4, depends=["feature c"])

mar_df.to_csv('mar_set.csv', index=False)
mar_df["feature a"].isnull().sum()

In [None]:
stat_comparison(df, mar_df, "feature a")

In [None]:
df.cov()

In [None]:
(df.cov() - mar_df.cov()).abs()/df.cov()*100

## $\color{purple}{\text{Simple MAR Test}}$

The procedure is simple. For any columns with missing data, construct a new column relating to the missingness of that column

In [None]:
test_df = mar_df.copy()

In [None]:
test_df['missingness']=test_df['feature a'].isnull()

Then test to see if that new feature is "related" to any of the columns. If it is then the missingness mechanism is MAR. We will use the "eyeball" test by using correlation. There are statistically robust tests such as using Student's t-test or use logistic regression on the other features to predict the missingness, etc.

In [None]:
test_df.corr()

## $\color{purple}{\text{Poor Man's Version of Little's MCAR Test (or rather not MAR Test)}}$

The test given above is a little awkward if more than one column has missing data. Originally, Little proposed the following test for MCAR. 

$\color{red}{\text ⚠}$ The code below demonstrates the simplified principle behind Little's MCAR Test but a lot of the statistical rigor has been relaxed.

We adopt the "eyeball" test of whether statistics match or not. In principle, some statistical assumptions are made resulting in a $p$-value. In particular, Little used made normality assumptions resulting in a $\chi^2$ distribution.

First the observations are segregated into their various patterns. In our case, there are only two tests, observation is complete. Observation is missing "feature a"

In [None]:
pattern1 = mar_df.dropna(subset=["feature a"])
pattern2 = mar_df[mar_df["feature a"].isnull()]

The formal version of Little's Test uses maximum likelihood estimations to estimate statistcal features of each group and compares them. If they are statistcally the same then he declares the missingness mechanism as MCAR. 
Here we use the eyeball test

In [None]:
pattern1 = mcar_df.dropna(subset=["feature a"])
pattern2 = mcar_df[mcar_df["feature a"].isnull()]

In [None]:
pd.concat([pattern1.mean(), pattern2.mean()], axis="columns")

In [None]:
def littles_eyeball_test(df, column):
  pattern1 = df.dropna(subset=[column])
  pattern2 = df[df[column].isnull()]
  return pd.concat([pattern1.mean(), pattern2.mean()], axis="columns")

In [None]:
littles_eyeball_test(mar_df,'feature a')

In [None]:
pattern1.drop(columns=["feature a"]).cov()

In [None]:
pattern2.drop(columns=["feature a"]).cov()

In [None]:
(
    pattern1.drop(columns=["feature a"]).cov()
    - pattern2.drop(columns=["feature a"]).cov()
).abs()

## $\color{purple}{\text{MNAR: The real painful situation}}$


### $\color{purple}{\text{How NOT to synthesize MNAR Missingness}}$

In [None]:
fmnar_df = clobber(df, "feature a", 0.4, depends=["feature a"])
fmnar_df["feature a"].isnull().sum()

In [None]:
stat_comparison(df, fmnar_df, "feature a")

In [None]:
littles_eyeball_test(fmnar_df, 'feature a')

In [None]:
mnar_df = clobber(df, "uncorrelated", 0.4, depends=["uncorrelated"])

mnar_df.to_csv('mnar_set.csv', index=False)
mnar_df["uncorrelated"].isnull().sum()

In [None]:
pattern1 = mnar_df.dropna(subset=["uncorrelated"])
pattern2 = mnar_df[mnar_df["uncorrelated"].isnull()]

In [None]:
littles_eyeball_test(mnar_df, 'uncorrelated')

In [None]:
pd.concat([pattern1.mean(), pattern2.mean()], axis="columns")

In [None]:
pattern1.drop(columns=["uncorrelated"]).cov()

In [None]:
pattern2.drop(columns=["uncorrelated"]).cov()

In [None]:
(
    pattern1.drop(columns=["uncorrelated"]).cov()
    - pattern2.drop(columns=["uncorrelated"]).cov()
)

In [None]:
fmnar_df.corr()

In [None]:
from sklearn.covariance import EmpiricalCovariance

In [None]:
mcar_df.mean()

In [None]:
mcar_df.dropna().mean()

In [None]:
mcar_df[mcar_df["feature a"].isnull()].mean()

In [None]:
mar_df.dropna().mean()

In [None]:
mar_df.mean()

In [None]:
mar_df[mar_df["feature a"].isnull()].mean()

Conclusion on the Theory Section

*   There is no way from just the data itself to distinguish between MCAR and MNAR. 

*   The so-called MCAR tests are really "not MAR" tests
  * Most those tests assume you have already excluded MNAR
* Recommend if the missingness is not MAR assume the worst and treat it as MNAR.
* If missingness is MAR, you should use multivariate imputation not deletion.


:warning

# $\color{red}{\text{⚠}}$


In [None]:
import autoimpute

Setup (10 minutes)

Before the tutorial, participants will be provided with a link to a GitHub repository containing all the datasets, Jupyter notebooks and for those who have docker, a docker image prebuilt with all the necessary libraries. Using docker images is encourage as it will contain all correct and compatible version of the libraries

Overview (10 minutes)

The overview will outline the agenda for the workshop. Briefly touching on the theory, detection, treatment of missing data. 

Theory of Missing Data and Missingness Mechanisms (40 minutes)

Often missing data is considered an inconvenience and swept away without much consideration. In this section of the workshop we will discuss the statistical basis of missing data including the three missingness mechanisms: MCAR (missing completely at random), MAR (missing at random) and MNAR (missing not at random).

In addition, using these principles participants will construct from existing complete data sets, data sets exhibiting missing data under these mechanisms for use later in the workshop.

Visualizing the Degree of Missing Data in a Dataset (30 minutes)

In this section, participants  will use tools such as missingno to visualize the degree of missingness in a data set. Learning to interpret and assess these visualizations will help determine the usefulness of a dataset, answering questions like “Is there too much data missing?” or “Is the bulk of the missing data in columns I care about?”

– break –

Common Treatment Practices and Pitfalls (30 minutes)

Using the generated incomplete datasets, we will discuss some of the common ways missing data is dealt with and using the exercises seeing the pitfalls of these techniques. The most common practices include dropping missing values, backfilling, and zero filling. The exercises will compare statistically datasets with missing values, treated datasets and the original datasets.

-- break --

Imputation Techniques (80 minutes)

This section takes a deeper dive into conditioning datasets with missing data. Participants will employ a variety of single and multiple imputation techniques to the datasets and apply various metrics to evaluate their properties. 

The simplest single imputation techniques include mean and median imputation. Followed by two more complex multiple imputation techniques including multivariate imputation by chained equations (MICE) and the family of autoencoder based imputation techniques, including one published here http://www.ibai-publishing.org/html/proceedings_2020/pdf/proceedings_book_MLDM_2020.pdf p. 197 by the speaker.

Since some off the shelf libraries are not available for some of these techniques, the workshop will demonstrate and provide code pieces to implement these techniques.

-- break --

Categorical Data  (30 minutes)

Categorical data has always posed a challenge in the numerically dominated data science world. The same is true with respect to imputation. Participants will employ encoding techniques as well as imputation techniques specifically meant for categorical data including predictive mean matching and hot deck imputation.

Conclusion (10 minutes)

We’ll recap the lessons learned and provide references.

