To assist our efforts into understanding missing data, let us generate a complete dataset including a column where missing data is generated by our mechanisms.

# PRELIMINARIES

In [1]:
import random
import numpy as np
import pandas as pd

from numpy import random

Assume our dataset will has something to do with a weight-management program where initial weights are taken (in lbs). Let's not have a discussion on how realistic a person can actually lose safely but for illustration purposes, let us assume that these are all possible. For the next two months, weights are then subsequently measured.

### INITIAL WEIGHTS

In [2]:
random.seed(19)
initial_weights = random.normal(150,30,10)

### FIRST MONTH WEIGH-IN

Assume after the initial weigh-in, some of our participants successfully lost weight and some gained more weight.

In [3]:
random.seed(20)
first_weigh_in = random.normal(145,20,10)

### SECOND MONTH WEIGH-IN

At this point in time, let us assume that those who lost weight are determined to keep losing weight and have therefore a higher probability of losing more weight (around 3 to 5 kgs) or if they did gain some weight, it will be small only (around 1 to 2 kg). Those who gained, however, were either demotivated and gained more or got inspired with those who lost weight and therefore started lost some weight.

In [4]:
first_month_diff = first_weigh_in - initial_weights

second_weigh_in = [None] * 10

random.seed(21)
for i in range(len(first_month_diff)):
    if first_month_diff[i] > 0:
        second_weigh_in[i] = first_weigh_in[i] + random.randint(-3,7)
    else:
        second_weigh_in[i] = first_weigh_in[i] + random.randint(-5,3)
        

In [5]:
df = pd.DataFrame({"Initial":initial_weights,
                  "First Month": first_weigh_in,
                  "Second Month": second_weigh_in})
df

Unnamed: 0,Initial,First Month,Second Month
0,156.630098,162.677862,168.677862
1,139.78605,148.9173,153.9173
2,132.667544,152.15073,153.15073
3,137.879053,98.134762,93.134762
4,131.901317,123.303348,118.303348
5,142.684432,156.193926,161.193926
6,181.060312,163.789387,161.789387
7,167.422602,125.430379,124.430379
8,157.572412,155.061937,152.061937
9,106.123621,153.128289,151.128289


## SIMULATING MISSING DATA MECHANISM

To make sense of this simulation, we will apply the three different mechanisms on the "Second Month Weigh-In" observations. Thus, for each mechanism, the initial-weigh in and first-month weigh in observations are available.

### MISSING COMPLETELY AT RANDOM (MCAR)

Missing Completely at Random is a mechanism where data is missing due to completely random reasons; there are no specific structure as to why a data might me missing. For example, it is quite possible that during the weigh-in for the second month, a participant happens to be sick and just missed it. It may also be possible that something completely unrelated to the phenomena you are studying or measuring such as a car breakdown on the way to the gym. Other reasons would include:

  *  **Data Management** - for example, an accidental deletion.

In [6]:
random.seed(25)
df["MCAR"] = random.binomial(1,0.5, size=10)*df['Second Month']
df["MCAR"] = df["MCAR"].replace(0, np.nan)

As you see with the code, we have generated a vector of bernoulli random variables and used them as an indicator to determine which of the variables will be missing.

### MISSING AT RANDOM (MAR)

Suppose for example that people who gained weight, instead of losing them in the first month, got demotivated and purposedly did not show on the second month weigh-in. 

That is, and this is an important piece for MAR:  **the observations in the initial and first-month, determines whether the observation in the second month would be missing**. Note that the missingness do not depend on the value of second-month weigh in themselves. For example, if you look at person number 10 - he lost some weight on the second-weigh in, but because we are only looking at the initial and first-weigh in information, he never had the chance to find out and chose to not have this information measured.

This systematic relationship can be coded such as:

In [7]:
random.seed(22)

df["MAR"] = [df["Second Month"][i]*random.binomial(1,0.2) if (df["First Month"][i]- df["Initial"][i] > 0) else df["Second Month"][i]\
for i in range(10)]

df["MAR"] = df["MAR"].replace(0, np.nan)
df

Unnamed: 0,Initial,First Month,Second Month,MCAR,MAR
0,156.630098,162.677862,168.677862,168.677862,
1,139.78605,148.9173,153.9173,153.9173,
2,132.667544,152.15073,153.15073,,
3,137.879053,98.134762,93.134762,,93.134762
4,131.901317,123.303348,118.303348,,118.303348
5,142.684432,156.193926,161.193926,,161.193926
6,181.060312,163.789387,161.789387,161.789387,161.789387
7,167.422602,125.430379,124.430379,,124.430379
8,157.572412,155.061937,152.061937,152.061937,152.061937
9,106.123621,153.128289,151.128289,,


### MISSING NOT AT RANDOM (MNAR)

Now, this is where the mechanism becomes a little bit tricky. Suppose that people who gained during the second month purposedly did not show up for the second month weigh-in.

In this scenario, the probability of the data being missing is directly related to the value of the missing data itself. We call this data, "Missing not at random" or MNAR data.

Unlike MAR, which probability of missingness is related to the **other observed data**, MNAR has a structure that is directly related to the **missing observations** themselves.

The following structure can be coded as follows:

In [8]:
random.seed(34)
df["MNAR"] = [df["Second Month"][i]*random.binomial(1,(1/(df["Second Month"][i]*4/df["First Month"][i]))) if (df["Second Month"][i]- df["First Month"][i] > 0) else df["Second Month"][i]\
for i in range(10)]

df["MNAR"] = df["MNAR"].replace(0, np.nan)
df

Unnamed: 0,Initial,First Month,Second Month,MCAR,MAR,MNAR
0,156.630098,162.677862,168.677862,168.677862,,
1,139.78605,148.9173,153.9173,153.9173,,153.9173
2,132.667544,152.15073,153.15073,,,
3,137.879053,98.134762,93.134762,,93.134762,93.134762
4,131.901317,123.303348,118.303348,,118.303348,118.303348
5,142.684432,156.193926,161.193926,,161.193926,
6,181.060312,163.789387,161.789387,161.789387,161.789387,161.789387
7,167.422602,125.430379,124.430379,,124.430379,124.430379
8,157.572412,155.061937,152.061937,152.061937,152.061937,152.061937
9,106.123621,153.128289,151.128289,,,151.128289


Out of the three mechanisms that we have considered, MNAR creates the most difficult situation to overcome. If you look at closely the relationship we have modeled, we see that the greater the weight gained in the second month, the higher the probability of it missing in the second month weigh-in. But the tricky part is actually this: the knowledge of this relationship is not known to the data scientist because these have not been observed.

So this is the challenge in classifying an observation between MAR and MNAR: to classify as MNAR, one must ascertain a relationship between the missing variable and the probability of missing it but for MAR, one can establish the relationship by looking at the observed, available data alone. 

## BONUS: TABLE VISUALIZATION

In [12]:
#Update to show only two decimals
s = df.style.format('{:.2f}', na_rep="N/A")


# #Hover
cell_hover = {  # for row hover use <tr> instead of <td>
    'selector': 'td:hover', #td for cell
    'props': [('background-color', '#ffffb3')]
}

s.set_table_styles([cell_hover])

# #Applying Function to Show NA will apply to the entire dataframe
#s.applymap(lambda x: 'color:blue;background-color:yellow' if pd.isnull(x) else '')

#Customized to highlight per cell (tedious but will do for this article)
s.set_table_styles([  # create internal CSS classes
    {'selector': '.true', 'props': 'background-color: #e6ffe6;'},
    {'selector': '.false', 'props': 'background-color: red;color:yellow'},
], overwrite=False)

cell_color = pd.DataFrame([['true','false', 'false'],
                           ['true','false', 'true'],
                           ['false','false', 'false'],
                           ['false','true', 'true'],
                           ['false','true', 'true'],
                           ['false','true', 'false'],
                           ['true','true', 'true'],
                           ['false','true', 'true'],
                           ['true','true', 'true'],
                           ['false','false', 'true']],
                          index=df.index,
                          columns=df.columns[3:])
s.set_td_classes(cell_color)

s

Unnamed: 0,Initial,First Month,Second Month,MCAR,MAR,MNAR
0,156.63,162.68,168.68,168.68,,
1,139.79,148.92,153.92,153.92,,153.92
2,132.67,152.15,153.15,,,
3,137.88,98.13,93.13,,93.13,93.13
4,131.9,123.3,118.3,,118.3,118.3
5,142.68,156.19,161.19,,161.19,
6,181.06,163.79,161.79,161.79,161.79,161.79
7,167.42,125.43,124.43,,124.43,124.43
8,157.57,155.06,152.06,152.06,152.06,152.06
9,106.12,153.13,151.13,,,151.13


In [10]:
# tuples = [("Weights", "Initial"),
#          ("Weights", "First Month"),
#          ("Weights", "Second Month"),
#          ("Mechanism", "MCAR"),
#          ("Mechanism", "MCAR"),
#          ("Mechanism", "MCAR")]

# idx = pd.MultiIndex.from_tuples(tuples)
# df.columns = idx
# df

In [11]:
pd.__version__

'1.3.0'

### REFERENCES

https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html