# $\color{purple}{\text{Understanding Missing Data and How to Deal with It (Part 2)}}$

### $\color{purple}{\text{Colab Environmental Setup}}$

### $\color{purple}{\text{Libraries for this lesson}}$

In [1]:
import numpy as np
import pandas as pd

## $\color{purple}{\text{Setting Up Test Data}}$

$\color{red}{\Large{\text{ ⚠}}}$ We synthensize a statiscally controlled example to more clearly illustrate the concepts. This dataset will satisfy the normality condition set forth in many of the statistical assumptions. These may not carry over to your datasets.

We will cause missingness in approximately 20% of the observations. This may (hopefully) be more that you will experience, but this high proportion will amplify effects such as bias.

`observations` will be the size of our test set. The covariance matrix `cov` supplied shows some nice characteristics with two highly correlated features. But you can generate a completely random covariance matrix using the following:
```
A = np.random.rand(variables, row_size)
cov = np.dot(A, A.transpose())
```
where `variables` is the number of variables and `row_size` is any number greater thanor equal to `variables` to insure a positive semidefinite matrix.

We selected a `mean` to be taken from an normal distribution with a mean between 1 and 5 and a standard deviation between 0 and 5.

This dataset will serve as one of the major datasets for this and subsequent notebooks.

In [2]:
# This covariance matrix has some nice properties to demonstrate. Originally this was generated at random
cov = [
    [1.6545195264181267, 0.6346001403246381, 1.573255077832285, 0.7457615955325402],
    [0.6346001403246381, 0.5636389213610075, 0.5861890592085826, 0.6638139531999303],
    [1.573255077832285, 0.5861890592085826, 1.6461885333121087, 0.4916921086792136],
    [0.7457615955325402, 0.6638139531999303, 0.4916921086792136, 1.0900299890979697],
]
mean = np.random.normal(np.random.uniform(low=1, high=5), np.random.uniform(high=5), 4)
mean

array([2.22612584, 3.68112922, 3.57906565, 3.57187154])

List the covariance matrix and compare to the original. 
This is only important to insure the number of observations selected is sufficient to give the right characteristics.

In [3]:
observations = 20000
df = pd.DataFrame(
    np.random.multivariate_normal(mean, cov, size=observations),
    columns=["feature a", "feature b", "feature c", "feature d"],
)
df.cov()

Unnamed: 0,feature a,feature b,feature c,feature d
feature a,1.633559,0.625852,1.553296,0.735798
feature b,0.625852,0.560295,0.576702,0.660979
feature c,1.553296,0.576702,1.624997,0.483581
feature d,0.735798,0.660979,0.483581,1.084855


Now we add one variable that is completely uncorrelated with the other features and show the correlation matrix to confirm. We'll save this dataset off for later use.

In [4]:
df["uncorrelated"] = np.random.rand(observations)
df.to_csv('full_set.csv', index=False)

In [5]:
df.corr()

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated
feature a,1.0,0.654178,0.953368,0.55272,0.004156
feature b,0.654178,1.0,0.60439,0.8478,-0.009361
feature c,0.953368,0.60439,1.0,0.364214,0.0017
feature d,0.55272,0.8478,0.364214,1.0,-0.006177
uncorrelated,0.004156,-0.009361,0.0017,-0.006177,1.0


Add two helper functions that will be universally useful. I've added these to `helpers.py` for use by later notebooks

<font color='red'>need docstrings in functions and more explanation of what function does </font> 

In [6]:
# A function to cause missingness in a given column optionally
def clobber(df, column, probability, depends=[]):
    clob = df[column] == df[column]  # Always True
    for dep_column in depends:
        clob &= df[dep_column] > df[dep_column].median()
    clob *= probability
    rand = np.random.uniform(0, 1, size=len(clob))
    ret = df.copy()  # We copy to avoid clobbering the original
    ret[column] = np.where(clob < rand, df[column], np.nan)
    return ret

<font color='red'>need docstrings in functions</font> 

In [7]:
def stat_comparison(original, missing, column):
    df = pd.DataFrame.from_dict(
        dict(
            mean=[original[column].mean(), missing[column].mean()],
            median=[original[column].median(), missing[column].median()],
            stdev=[original[column].std(), missing[column].std()],
        ),
        orient="index",
        columns=["Original", "With Missing Data"],
    )
    df["difference"] = (df["Original"] - df["With Missing Data"]).abs()
    df["percentage"] = df["difference"] / df["Original"] * 100
    return df

## $\color{purple}{\text{MCAR and MAR Data Set}}$
In this subsection, we will induce missingness to the dataset we just constructed. This will enable demonstrations of missingness mechanism tests as well as to demonstrate treatment techniques in subsequent notebooks

### $\color{purple}{\text{MCAR}}$
We induce MCAR missingness in one and two columns then save off the files for later use

In [8]:
mcar_df = clobber(df, "feature a", 0.2)

mcar_df.to_csv('mcar_set.csv', index=False)
mcar_df["feature a"].isnull().sum()

4040

In [9]:
stat_comparison(df, mcar_df, "feature a")

Unnamed: 0,Original,With Missing Data,difference,percentage
mean,2.225971,2.224831,0.00114,0.051214
median,2.223995,2.221182,0.002813,0.126502
stdev,1.278108,1.278788,0.000681,0.05325


In [10]:
mcar_df.cov()

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated
feature a,1.635299,0.627553,1.555728,0.737193,0.00188
feature b,0.627553,0.560295,0.576702,0.660979,-0.002024
feature c,1.555728,0.576702,1.624997,0.483581,0.000626
feature d,0.737193,0.660979,0.483581,1.084855,-0.001858
uncorrelated,0.00188,-0.002024,0.000626,-0.001858,0.083437


<font color='red'> please explain what the following cell is computing </font> 

In [11]:
(df.cov() - mcar_df.cov()).abs()/df.cov()*100

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated
feature a,0.106528,0.2717151,0.1565769,0.1895896,22.50525
feature b,0.271715,5.548193e-13,3.465222e-13,3.527294e-13,-3.214156e-13
feature c,0.156577,3.465222e-13,1.77636e-13,1.492294e-13,3.930611e-12
feature d,0.18959,3.527294e-13,1.492294e-13,5.730948e-13,-1.516808e-13
uncorrelated,22.505254,-3.214156e-13,3.930611e-12,-1.516808e-13,4.324473e-13


Clobber a second column

In [12]:
double_mcar_df = clobber(mcar_df, "feature b", 0.2)

double_mcar_df.to_csv('double_mcar_set.csv', index=False)
double_mcar_df.isnull().sum()

feature a       4040
feature b       4034
feature c          0
feature d          0
uncorrelated       0
dtype: int64

### $\color{purple}{\text{MAR}}$
We induce MAR missingness in one and two columns then save off the files for later use

In [13]:
mar_df = clobber(df, "feature a", 0.4, depends=["feature c"])

mar_df.to_csv('mar_set.csv', index=False)
mar_df["feature a"].isnull().sum()

3965

In [14]:
stat_comparison(df, mar_df, "feature a")

Unnamed: 0,Original,With Missing Data,difference,percentage
mean,2.225971,1.981865,0.244105,10.966242
median,2.223995,1.921194,0.302801,13.615194
stdev,1.278108,1.25224,0.025868,2.023903


<font color='red'> please explain what the following cell is computing </font> 

In [15]:
df.cov()

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated
feature a,1.633559,0.625852,1.553296,0.735798,0.001534
feature b,0.625852,0.560295,0.576702,0.660979,-0.002024
feature c,1.553296,0.576702,1.624997,0.483581,0.000626
feature d,0.735798,0.660979,0.483581,1.084855,-0.001858
uncorrelated,0.001534,-0.002024,0.000626,-0.001858,0.083437


<font color='red'> please explain what the following cell is computing </font> 

(df.cov() - mar_df.cov()).abs()/df.cov()*100

In [16]:
double_mar_df = clobber(mar_df, "feature b", 0.4, depends=['feature d'])

double_mar_df.to_csv('double_mar_set.csv', index=False)
double_mar_df.isnull().sum()

feature a       3965
feature b       4041
feature c          0
feature d          0
uncorrelated       0
dtype: int64

## $\color{purple}{\text{Simple MAR Test}}$

The procedure is simple. For any columns with missing data, construct a new column relating to the missingness of that column

In [17]:
test_df = mcar_df.copy()

In [18]:
test_df['missingness']=test_df['feature a'].isnull()

Then test to see if that new feature is "related" to any of the columns. If it is then the missingness mechanism is MAR. We will use the "eyeball" test by using correlation. There are statistically robust tests such as using Student's t-test or use logistic regression on the other features to predict the missingness, etc.

In [19]:
test_df.corr()

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated,missingness
feature a,1.0,0.654402,0.953773,0.553083,0.005086,
feature b,0.654402,1.0,0.60439,0.8478,-0.009361,0.001734
feature c,0.953773,0.60439,1.0,0.364214,0.0017,0.004364
feature d,0.553083,0.8478,0.364214,1.0,-0.006177,-0.002379
uncorrelated,0.005086,-0.009361,0.0017,-0.006177,1.0,-0.00481
missingness,,0.001734,0.004364,-0.002379,-0.00481,1.0


Repeat for MAR

In [20]:
test_df = mar_df.copy()

In [21]:
test_df['missingness']=test_df['feature a'].isnull()

In [22]:
test_df.corr()

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated,missingness
feature a,1.0,0.645647,0.951092,0.551391,0.003427,
feature b,0.645647,1.0,0.60439,0.8478,-0.009361,0.245944
feature c,0.951092,0.60439,1.0,0.364214,0.0017,0.404307
feature d,0.551391,0.8478,0.364214,1.0,-0.006177,0.146959
uncorrelated,0.003427,-0.009361,0.0017,-0.006177,1.0,0.000189
missingness,,0.245944,0.404307,0.146959,0.000189,1.0


We can do this too with multiple columns missing

In [23]:
test_df = double_mcar_df.copy()

In [24]:
test_df['missingness_a']=test_df['feature a'].isnull()
test_df['missingness_b']=test_df['feature b'].isnull()
test_df.corr()

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated,missingness_a,missingness_b
feature a,1.0,0.656268,0.953773,0.553083,0.005086,,-0.002164
feature b,0.656268,1.0,0.606866,0.848452,-0.010137,-0.000534,
feature c,0.953773,0.606866,1.0,0.364214,0.0017,0.004364,0.002447
feature d,0.553083,0.848452,0.364214,1.0,-0.006177,-0.002379,-0.001941
uncorrelated,0.005086,-0.010137,0.0017,-0.006177,1.0,-0.00481,0.003029
missingness_a,,-0.000534,0.004364,-0.002379,-0.00481,1.0,-0.00058
missingness_b,-0.002164,,0.002447,-0.001941,0.003029,-0.00058,1.0


## $\color{purple}{\text{Poor Man's Version of Little's MCAR Test (or rather not MAR Test)}}$

The test given above is a little awkward if more than one column has missing data. Originally, Little proposed the following test for MCAR. 

$\color{red}{\text ⚠}$ The code below demonstrates the simplified principle behind Little's MCAR Test but a lot of the statistical rigor has been relaxed.

We adopt the "eyeball" test of whether statistics match or not. In principle, some statistical assumptions are made resulting in a $p$-value. In particular, Little used made normality assumptions resulting in a $\chi^2$ distribution.

First the observations are segregated into their various patterns. In our case, there are only two tests, observation is complete. Observation is missing "feature a"

In [25]:
pattern1 = mar_df.dropna(subset=["feature a"])
pattern2 = mar_df[mar_df["feature a"].isnull()]

The formal version of Little's Test uses maximum likelihood estimations to estimate statistcal features of each group and compares them. If they are statistcally the same then he declares the missingness mechanism as MCAR. 
Here we use the eyeball test

In [26]:
pd.concat([pattern1.mean(), pattern2.mean(),pattern1.mean()- pattern2.mean()], axis="columns")

Unnamed: 0,0,1,2
feature a,1.981865,,
feature b,3.592969,4.05472,-0.46175
feature c,3.320484,4.613192,-1.292708
feature d,3.501761,3.885684,-0.383923
uncorrelated,0.495927,0.496064,-0.000137


You might want to look at covariances

In [27]:
pattern1.drop(columns=["feature a"]).cov()

Unnamed: 0,feature b,feature c,feature d,uncorrelated
feature b,0.551477,0.549309,0.654647,-0.002076
feature c,0.549309,1.550097,0.463217,0.000691
feature d,0.654647,0.463217,1.081674,-0.001982
uncorrelated,-0.002076,0.000691,-0.001982,0.083493


In [28]:
pattern2.drop(columns=["feature a"]).cov()

Unnamed: 0,feature b,feature c,feature d,uncorrelated
feature b,0.425117,0.20896,0.544591,-0.001865
feature c,0.20896,0.588234,0.168066,0.000222
feature d,0.544591,0.168066,0.97979,-0.001402
uncorrelated,-0.001865,0.000222,-0.001402,0.083232


Create a little helper to look at means

In [29]:
def littles_eyeball_test(df, column):
  pattern1 = df.dropna(subset=[column])
  pattern2 = df[df[column].isnull()]
  return pd.concat([pattern1.mean(), pattern2.mean(), pattern1.mean() - pattern2.mean()], axis="columns")

In [30]:
littles_eyeball_test(mar_df,'feature a')

Unnamed: 0,0,1,2
feature a,1.981865,,
feature b,3.592969,4.05472,-0.46175
feature c,3.320484,4.613192,-1.292708
feature d,3.501761,3.885684,-0.383923
uncorrelated,0.495927,0.496064,-0.000137


The "double missingness" sets exhibit 4 patterns so we'll expand our experiment to 4

In [31]:
def littles_eyeball_test_double(df):
    pattern1 = df[df['feature a'].isnull() & df['feature b'].isnull()]
    pattern2 = df[df['feature a'].isnull() & ~df['feature b'].isnull()]
    pattern3 = df[~df['feature a'].isnull() & df['feature b'].isnull()]
    pattern4 = df[~df['feature a'].isnull() & ~df['feature b'].isnull()]
    return pd.concat([pattern1.mean(), pattern2.mean(), pattern3.mean(), pattern4.mean()], axis="columns")

In [32]:
littles_eyeball_test_double(double_mcar_df)

Unnamed: 0,0,1,2,3
feature a,,,2.219328,2.226222
feature b,,3.684886,,3.685884
feature c,3.637705,3.575252,3.569152,3.575182
feature d,3.572827,3.57298,3.57411,3.580387
uncorrelated,0.497588,0.492085,0.497722,0.496383


In [33]:
littles_eyeball_test_double(double_mar_df)

Unnamed: 0,0,1,2,3
feature a,,,2.5671,1.846344
feature b,,3.937308,,3.469348
feature c,4.721765,4.575289,3.687987,3.235383
feature d,4.507087,3.668753,4.395672,3.294761
uncorrelated,0.494836,0.496492,0.494785,0.496191


## $\color{purple}{\text{MNAR: The missingness you don't want}}$


### $\color{purple}{\text{How NOT to synthesize MNAR Missingness}}$

This is how "MNAR" is often done in the literature to demonstrate imputation techniques. As you will see this is not true MNAR.

In [34]:
fmnar_df = clobber(df, "feature a", 0.4, depends=["feature a"])
fmnar_df["feature a"].isnull().sum()

3932

In [35]:
stat_comparison(df, fmnar_df, "feature a")

Unnamed: 0,Original,With Missing Data,difference,percentage
mean,2.225971,1.97675,0.249221,11.19606
median,2.223995,1.903549,0.320446,14.408575
stdev,1.278108,1.251113,0.026994,2.112052


In [36]:
littles_eyeball_test(fmnar_df, 'feature a')

Unnamed: 0,0,1,2
feature a,1.97675,,
feature b,3.589478,4.072865,-0.483387
feature c,3.339796,4.545123,-1.205327
feature d,3.465267,4.038037,-0.57277
uncorrelated,0.495141,0.499274,-0.004133


This is actually MAR. Recall from the definition in the previous section MNAR is "the probability of data being missing depends on the unobserved data, **even after conditioning on the observed data**". Turns out `feature a` is highly correlated to `feature c` so there is a statistically dependency on `feature c` even though we constructed the missingness based on `feature a`. 

### $\color{purple}{\text{A true NMAR missigness}}$

In order to be NMAR the missingness must be uncorrelated to the visible data.

In [37]:
mnar_df = clobber(df, "uncorrelated", 0.4, depends=["uncorrelated"])

mnar_df.to_csv('mnar_set.csv', index=False)
mnar_df["uncorrelated"].isnull().sum()

3913

In [38]:
littles_eyeball_test(mnar_df, 'uncorrelated')

Unnamed: 0,0,1,2
feature a,2.225122,2.229458,-0.004336
feature b,3.684429,3.684849,-0.00042
feature c,3.57777,3.572624,0.005146
feature d,3.575599,3.587227,-0.011628
uncorrelated,0.435574,,


### $\color{purple}{\text{Final thoughts on MNAR}}$
You'll see it said that there is not statistical test for MNAR which is true, but a better statement is that there is no statistical way to distinguish MCAR and MNAR. You can test to see if missingness is MAR or not.

### $\color{purple}{\text{Conclusion on the Theory Section}}$

*   There is no way from just the data itself to distinguish between MCAR and MNAR. 

*   The so-called MCAR tests are really "not MAR" tests
  * Most those tests assume you have already excluded MNAR
* Recommend if the missingness is not MAR assume the worst and treat it as MNAR.
* If missingness is MAR, you should use multivariate imputation not deletion.
* Be careful synthesizing NMAR missingness for benchmarking
