In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

# set defaults
plt.style.use('seaborn-white')   # seaborn custom plot style
plt.rc('figure', dpi=100, figsize=(7, 5))   # set default size/resolution
plt.rc('font', size=12)   # font size

### Lecture 8 - Part 1

# Missingness Mechanisms

## Imperfect Data

<img src="imgs/image_0.png">

* The "true" (probability) model is an idealized approximation of the Data Generating Process.
* The data generating process is the phenomenon we want to understand.
* The recorded data is *supposed* to "well represent" the data generating process.

## Imperfect Data

<img src="imgs/image_1.png">

* Problem 1: your data is not representative? (poor sample of events).
* Problem 2: some of the entries are missing (incomplete measurements)

These are only problems when there is *systematic bias* in the result!

## Imperfect Data

* Non-representative samples are identified with domain research!
    - Does the description of the data look like your understanding of the data generating process? (Salaries are only for tech people)
* Incomplete measurements are missing data!
    - Understanding how *portions* of your data are missing affects the quality of your sample.
    - There are techniques to understand when missing measurements are representative of the rest of the sample.

# Types of Missingness

* Missing by Design (MD)
    
* Ignorable Missing Data:
    - Unconditionally ignorable (Missing Completely at Random: MCAR)
    - Conditionally ignorable (Missing at Random: MAR)
    
* Non-Ignorable Missing Data (Not Missing at Random: NMAR)

Important for determining how to *handle* missing data (next lecture).

[(see wikipedia synopsis)](https://en.wikipedia.org/wiki/Missing_data)

## Missing by Design (MD)

<div class="image-txt-container">
    
* Missingness was an intentional choice by designers of the data collection process.
* We can predict when/why a value is missing from only the other columns.
    - e.g., Column X is missing if any only if Column A, B, C are ...


<img src=./imgs/households.png width=50%>

    
[(reference)](https://stats.stackexchange.com/questions/201782/meaning-of-missing-by-design-in-longitudinal-studies)

## Missing by Design


<img src="./imgs/Skiplogic.png"/>

</div>


## Other types of missingness

- Missing Completely at Random
    - Chance of missingness is totally independent of other columns and the actual missing value
- Missing at Random
    - Chance of missingness depends on other columns, but **not** the actual, missing value
- Not Missing at Random
    - Chance of missingness depends on the actual, missing value
    - Weird name, because it's still random

## The dog ate my data

- Surveyed people and asked for "favorite color" and "birthday month"
- Wrote answers on index card:
    - left size: color, right side: birthday month
- Dog grabs 10 cards off the top of the stack, chews off right side (birthday month)
- Now ten people are missing birthday month

<img width=50% src="imgs/doggo.jpg">

## Discussion question

Birthday month is now missing. What is the type of missingness if:

1. cards were sorted by favorite color?
2. cards were sorted by birthday month?
3. cards were shuffled?

Remember:

- Missing Completely at Random
    - Chance of missingness is totally independent of other columns and the actual missing value
- Missing at Random
    - Chance of missingness depends on other columns, but **not** the actual, missing value
- Not Missing at Random
    - Chance of missingness depends on the actual, missing value

## Examples

- Cards were sorted by favorite color
    - The fact that a card is missing a month is related to the favorite color.
    - Missing at Random
- Cards were sorted by birthday month
    - The fact that a card is missing a month is related to the missing month.
    - Missing Not at Random
- Cards were shuffled.
    - The fact that a card is missing a month is related to nothing.
    - Missing Completely at Random

## The Real World...

- ...is messy.
- Sometimes requires domain knowledge to understand what might influence missingness.
- Sometimes can be borderline ("weakly" NMAR).

# Unconditionally ignorable (MCAR: missing completely at random)

<div class="image-txt-container">

* Chance of missingness is totally independent of other columns and the actual missing value
* Example 1: randomly-chosen subset of survey questions
* Example 2: Water damage to paper forms prior to entry (assuming shuffled forms).
* **Non**-example: optional question, "how often do you donate to charity?"
    - people who don't donate are likely to leave blank (NMAR)
<img src="imgs/water.jpg" width="50%"/>

</div>

# Non-Ignorable (NMAR: Not missing at Random), can't model from the data

* A missing value depends on the value of the (actual, unreported) variable that's missing.
* Example 1: people with high income are less likely to report income.
* Example 2: a person doesn't take a drug test because they took drugs the day before.
* This phenomenon cannot be determined from the observed data; it must be reasoned from domain expertise on the data-generating process.


# Conditionally Ignorable (MAR: Missing at Random)

<div class="image-txt-container">

* Chance of missingness depends on other columns, but *not* the actual, missing value
* Example 1: Missing blood test result. For really sick patients, clinicians may not draw blood for routine labs.
* Example 2: Missing income. People working in a Service Industry are less likely to report.

<img src="imgs/tip.jpg" width="50%"/>

</div>

# But wait...

* Can't I argue that these are NMAR (the missingess depends on value of missing data?)
- Example:
    - Missing blood test result. Sick patients will have lower blood oxygen level, for instance.
    - So missingness *does* depend on actual, missing value.
- Yes. But then almost everything is NMAR.
- What is the *main effect*?
- If the other columns *mostly* explain the missing value and missingness, treat it as MAR.

### Discussion Questions

For each of the following datasets, decide whether they are MD, MCAR, MAR, NMAR:

* A table (for a medical study) with column `gender` and column `age`. Age has missing values.
* Measurements from the Hubble Space Telescope (dropped data during transmission).
* SAT scores reported by an institution for College Ranking scores.
* A table with a single column: self-reported education (with missing values).
* Midterm report with three columns (`ver.1`, `ver.2`, `ver.3`). ⅔ of the entries in the report are `NaN`.



### Diagnosis of missingness:

* Depends on the dataset and its attributes.
* Depends on the population / data generating processing under consideration.
* Requires understanding the severity and effect of each possible type of missingness.

Data with missing data is likely not a representative sample of the true population! 

##  Missing Summary

* **MCAR**: Data is *Unconditionally Ignorable* or *Missing Completely at Random* if there is no relationship between the missingness of the data and any values, observed or missing.
    - MCAR doesn't bias the observed data.

* **MAR**:  Data is *conditionally ignorable* or *Missing at Random* if there is a systematic relationship between the propensity of missing values and the observed data, but not the missing data. 
    - MAR biases the observed data, but is fixable.

* **NMAR**: Data is *non-ignorable* or *"Not Missing at Random"* if there is a relationship between the propensity of a value to be missing and its values.
    - non-ignorable missing data biases the observed data in unobservable ways.

## Unconditionally Ignorable (MCAR) formal definition:

Suppose we have:
- a dataset $Y$ with observed values $Y_{obs}$ and missing values $Y_{mis}$.
- a parameter $\psi$ independent of the dataset.

**MCAR**: Data is *Unconditionally ignorable* if 

$$Pr({\rm data\ is\ present\ } | Y_{obs}, Y_{mis}, \psi) = Pr({\rm data\ is\ present\ } |\ \psi)$$

That is, adding information on the dataset doesn't change likelihood data is missing!

## Conditionally ignorable (MAR) formal definition

Suppose we have:
- a dataset $Y$ with observed values $Y_{obs}$ and missing values $Y_{mis}$.
- a parameter $\psi$ independent of the dataset.


**MAR**: Data is *Conditionally ignorable* if 

$$Pr({\rm data\ is\ present\ } | Y_{obs}, Y_{mis}, \psi) = Pr({\rm data\ is\ present\ } |\ Y_{obs}, \psi)$$

That is, *MAR data is actually MCAR, conditional on $Y_{obs}$*

## Non-ignorable missing data (NMAR) formal definition

Suppose we have:
- a dataset $Y$ with observed values $Y_{obs}$ and missing values $Y_{mis}$.
- a parameter $\psi$ independent of the dataset.


**NMAR**: Data is *non-ignorably missing* if 

$$Pr({\rm data\ is\ present\ }| Y_{obs}, Y_{mis}, \psi)$$

does not simplify. That is, in *NMAR* data, missingness is dependent on the missing value itself.

### Part 2

# Assessing Missingness

## Assessing Missingness

- Suppose I believe that missingness is (MCAR, NMAR, MAR).
- Can I check whether this is true?

## How to assess the mechanism of missingness: NMAR
* Cannot determine NMAR from the data alone; it depends on the unobserved.
* Must be reasoned by the data generating process, or more data should be collected.
* How strong the dependence on $Y_{mis}$ influences the strength of NMAR
    - If the dependence on the missing values is weak, then *most* the missingness is explainable by observed values!

### Discussion Question

* Consider a dataset of survey data of people's self-reported happiness.
    - The data contain an identifier and happiness score; nothing else.
* Is the data likely NMAR?

## How to assess the mechanism of missingness: MAR

* Data are MAR if missingness only depends on *obsvered* data.
* Data is MAR if it's determined to not be NMAR (assumption on data generating process).
* Adding further measurements may reduce the effect of NMAR.
    - income in census is NMAR; less so when adding geography, education, race...

## How to assess the mechanism of missingness: MCAR

- Say we have two variables: screen size and weight.
- Some sizes are missing.
- They are MCAR iff the weight when size is missing has same distribution as when heights not missing.
- A/B test!
    - Do the two distributions come from the same underlying distribution?

## How to assess the mechanism of missingness: MCAR

Assuming that the data is NMAR, you can test if data are MCAR.

A column `c_test` is MCAR if its missingness $R$ is independent of the data.
* For each column `c`, check that the missingness rates of `c_test` are the same across values of `c`.
* That is, the distribution of `c` when `c_test.isnull()` is 'the same' as the distribution of `c` when `c_test.notnull()`.
* The phrase 'the same' needs to be made statistically precise!

### Checking data are MCAR: heights data
* Start with complete dataset of child heights, gender of the child, and parent heights.
* Blank out rows to create MCAR data.

In [None]:
heights = pd.read_csv('data/midparent.csv')

heights['child'] = heights.childHeight
heights = heights.drop(['family', 'midparentHeight', 'children', 'childNum', 'childHeight'], axis=1)
heights.head()

In [None]:
heights.isnull().mean()

In [None]:
# What are the data types?
# Gender: categorical
# father, mother and child: numerical

In [None]:
# distribution of heights
pd.plotting.scatter_matrix(heights.drop('gender', axis=1));

## Simulating missing data

In [None]:
# create missing data
# How was it created?
np.random.seed(42)

heights_mcar = heights.copy()
idx = heights_mcar.sample(frac=0.3).index
heights_mcar.loc[idx, 'child'] = np.NaN

In [None]:
heights_mcar.isnull().mean()

### Verifying that child heights are MCAR in `heights_mcar`
* Check the data look the 'same' when `height` is null vs not-null
    - Is the empirical distribution of gender similar for null/not-null?
    - Is the empirical distribution of heights similar for null/not-null?

In [None]:
heights_mcar.sample(n=10)

In [None]:
# conditinal empirical distribution of gender by null and not-null

distr = (
    heights_mcar
    .assign(is_null=heights_mcar.child.isnull())
    .pivot_table(index='is_null', columns='gender', aggfunc='size')
)
distr = (distr.T / distr.sum(axis=1)).T
distr

# rows add up to 1, proportion of male/female

### Comparing Null vs. Non-Null (`child`) distributions: `gender`

* Are the distributions 'similar enough'? 
    - If yes, then missingness of `child` is *not* dependent on `gender`
* We have 
    - two groups: missing and not missing
    - We have distribution for these two groups
    - Are these distributions similar?
* Use a permutation test to assess the two distributions are similar.
* For categorical distributions, what test statistic should we use?

In [None]:
distr.T.plot(kind='bar');

In [None]:
n_repetitions = 500

tvds = []
for _ in range(n_repetitions):
    
    # shuffle the gender column
    shuffled_col = (
        heights_mcar['gender']
        .sample(replace=False, frac=1)
        .reset_index(drop=True)
    )
    
    # put them in a table
    shuffled = (
        heights_mcar
        .assign(**{
            'gender': shuffled_col,
            'is_null': heights_mcar['child'].isnull()
        })
    )
    
    # compute the tvd
    shuffled = (
        shuffled
        .pivot_table(index='is_null', columns='gender', aggfunc='size')
        .apply(lambda x:x / x.sum(), axis=1)
    )
    
    tvd = shuffled.diff().iloc[-1].abs().sum() / 2
    # add it to the list of results
    
    tvds.append(tvd)

In [None]:
obs = distr.diff().iloc[-1].abs().sum() / 2
obs

In [None]:
# The similarity is very high
pval = np.mean(tvds > obs)
pd.Series(tvds).plot(kind='hist', density=True, alpha=0.8, title='p-value: %f' % pval)
plt.scatter(obs, 0, color='red', s=40);

### Comparing Null vs. Non-Null (`child` height) distributions: `father` (height)

* Are the distributions 'similar enough'? 
    - If yes, then missingness of `child` is *not* dependent on height of `father`.
    - If no, then e.g. taller fathers are more likely to not report child height.
* In this case, it's 'clear' the distributions are similar.
- Assess with permutation test.
- As test statistic, use something like difference in means.

In [None]:
# heights: counts
# how much child data is missing?
# Shape? 
(
    heights_mcar
    .assign(is_null=heights_mcar.child.isnull())
    .groupby('is_null')
    .father
    .plot(kind='hist', legend=True, title='father height by missingness of child height')
);

In [None]:
# heights: distributions
(
    heights_mcar
    .assign(is_null=heights_mcar.child.isnull())
    .groupby('is_null')
    .father
    .plot(kind='kde', legend=True, title='father height by missingness of child height')
);

### Child heights data: MAR
* MAR is an *assumption* from the data
    - Is it reasonable to assume that a missing `child` height is explainable using the gender of the child and the height of the parents?
* Once MAR is assumed, can show data is *not* MCAR
    - Show how missingness of `child` depends on other columns.

In [None]:
# build MAR dataset
# blank rows based on conditions


heights_mar = heights.copy()
for i, row in heights.iterrows():
    rand = np.random.uniform()
    if (row['father'] > 72) and rand < 0.5:
        heights_mar.loc[i, 'child'] = np.NaN
    elif (row['gender'] == 'female') and rand > 0.7:
        heights_mar.loc[i, 'child'] = np.NaN


In [None]:
# different missingness rates -- not MCAR

distr = (
    heights_mar
    .assign(is_null=heights_mar.child.isnull())
    .pivot_table(index='is_null', columns='gender', aggfunc='size')
    .apply(lambda x:x / x.sum(), axis=1)
)
distr.T.plot(kind='bar', title='Distribution of Gender when child height is null/not-null');

In [None]:
# How is it going to affect the stats of the child hights?
# It is going to create bias upward (larger than usual), 
# because women are on average shorter than men

In [None]:
# Different missingness rates -- not MCAR
# Distribution of fathers heights when child's height is null/not null 
# Why is the right side of the graph different?

(
    heights_mar
    .assign(is_null=heights_mar.child.isnull())
    .groupby('is_null')
    .father
    .plot(kind='kde', legend=True, title='father height by missingness of child height')
);

### Missingness of `child` attribute: MAR case

* The distributions above are clearly different.
* What if their similarity was harder to determine? 
    - Permutation Test: are the distributions of column `X` when `child` is Null/Not-Null different?
* For dependence on a categorical attribute: use TVD for the test-statistic.
* For dependence on a quantitative attribute: ???

### Part 3

# Kolmogorov-Smirnov Test Statistics

## A/B Tests

- Do empirical distribution A and empirical distribution B actually come from the same underlying distribution?
    - test this with a permutation test
- If A and B are *categorical* distributions, use TVD.
- If A and B are *quantitative* distributions, use, e.g., (absolute) difference in means.

## Difference in means

In [None]:
N = 1000 # number of samples for each distribution

# Distribution 'A'
distr1 = pd.Series(np.random.normal(0, 1, size=N//2))

# Distribution 'B'
distr2 = pd.Series(np.random.normal(3, 1, size=N//2))

In [None]:
data = pd.concat([distr1, distr2], axis=1, keys=['A', 'B']).unstack().reset_index().drop('level_1', axis=1)
data = data.rename(columns={'level_0': 'group', 0: 'data'})


In [None]:
mA, mB = data.groupby('group')['data'].mean().tolist()
title = 'mean of A: %f\n mean of B: %f' % (mA, mB)

data.groupby('group')['data'].plot(kind='kde', legend=True, title=title);

### Discussion Question

* We determined that two distributions were likely different because their means were different.
* Can you think of two *different* distributions with the same mean?
* What would our permutation test say about these distributions?

### Different distributions; same mean

What does a permutation test using the test-statistic "difference in means" differentiate between distributions with similar means?

In [None]:
N = 1000 # number of samples for each distribution

# Distribution 'A'
a = pd.Series(np.random.normal(0, 1, size=N//2))
b = pd.Series(np.random.normal(4, 1, size=N//2))
distr1 = pd.concat([a,b], ignore_index=True)

# Distribution 'B'
distr2 = pd.Series(np.random.normal(distr1.mean(), distr1.std(), size=N))

In [None]:
data = pd.concat([distr1, distr2], axis=1, keys=['A', 'B']).unstack().reset_index().drop('level_1', axis=1)
data = data.rename(columns={'level_0': 'group', 0: 'data'})


In [None]:
mA, mB = data.groupby('group')['data'].mean().tolist()
title = 'mean of A: %f\n mean of B: %f' % (mA, mB)

data.groupby('group')['data'].plot(kind='kde', legend=True, title=title);

In [None]:
n_repetitions = 500

means = []
for _ in range(n_repetitions):
    
    # shuffle the gender column
    shuffled_col = (
        data['data']
        .sample(replace=False, frac=1)
        .reset_index(drop=True)
    )
    
    # put them in a table
    shuffled = (
        data
        .assign(**{
            'data': shuffled_col,
        })
    )
    
    # compute the differences in means
    mean = shuffled.groupby('group')['data'].mean().diff().abs().iloc[-1]
    
    means.append(mean)
    
    
obs = data.groupby('group')['data'].mean().diff().abs().iloc[-1]

pval = np.mean(means > obs)

pd.Series(means).plot(kind='hist', density=True, alpha=0.8, title='p-value: %f' % pval)

plt.scatter(obs, 0, color='red', s=40);

### Telling quantitative distributions apart

* Difference in means works for A/B testing *only* if the two distributions have similar shapes
    - It actually tests to see if one is a shifted version of the other
* Need a better test-statistic to differentiate between shape of distributions!
* Need a 'distance' between quantitative distributions:
    - Measure the (absolute) difference between probabilities for nearby events?
    - Why can't we use TVD?  (It is not categorical)

In [None]:
data.groupby('group')['data'].plot(kind='kde', legend=True);

## KS-Statistic
    
* Kolmogorov-Smirnov test-statistic: similarity between two distributions.
* Defined using the *Cumulative Distribution Function* instead of density function.
* KS-statistics roughly measures the largest difference between two empirical CDFs.
* Python library: `scipy.stats.ks_2samp`

<img src=./imgs/KS2_Example.png width=50%>

In [None]:
from scipy.stats import ks_2samp

In [None]:
help(ks_2samp)

In [None]:
gpA = data.loc[data['group'] == 'A', 'data']
gpB = data.loc[data['group'] == 'B', 'data']

obs = ks_2samp(gpA, gpB).statistic
obs

In [None]:
n_repetitions = 500

ks_list = []
for _ in range(n_repetitions):
    
    # shuffle the gender column
    shuffled_col = (
        data['data']
        .sample(replace=False, frac=1)
        .reset_index(drop=True)
    )
    
    # put them in a table
    shuffled = (
        data
        .assign(**{
            'data': shuffled_col,
        })
    )
    
    # compute the KS
    grps = shuffled.groupby('group')['data']
    ks = ks_2samp(grps.get_group('A'), grps.get_group('B')).statistic
    
    ks_list.append(ks)
    

pval = np.mean(np.array(ks_list) > obs)

pd.Series(ks_list).plot(kind='hist', density=True, alpha=0.8, title='p-value: %f' % pval)

plt.scatter(obs, 0, color='red', s=40);

### The `ks_2samp` function

* `scipy.stats.ks_2samp` actually returns *both* the statistic *and* a p-value.
* The p-value is calculated using the permutation test we just performed!

In [None]:
ks_2samp(gpA, gpB)

### Part 4

# Examples of Assessing Missingness

## Summary: Diagnosing Missingness Mechanisms

### Case: NMAR

* Can you make a reasonable case that the differences in missing vs not missing is largely explainable via *observed* data?
    - If yes, then the missing data (column) 'missing at random' and the missing data is 'ignorable' (when handled properly).
    - If no, then the missing data is 'not missing at random' (NMAR), or 'non-ignorable'. You must explicitly model missingness using assumptions on the data generating process.

## Summary: Diagnosing Missingness Mechanisms

### Case: MAR

* If missingness is explainable via *observed* data, then the missing data is 'missing at random' (MAR).
* The distribution of missing data may still look different than the observed data!
    - MAR requires you to understand how the missingness is dependent on other attributes in your data.
* Use permutation tests to assess the dependence of missing data on other attributes.

## Summary: Diagnosing Missingness Mechanisms

### Case: MCAR

* If missingness doesn't depend on any values in the observed data, it is 'unconditionally ignorable' (MCAR).
* MCAR is equivalent to data being MAR, without dependence on any other columns.
* If permutation tests point toward similar distributions of missing vs not-missing data, for *every* other column, then the data *may* be MCAR.
    - Caution: you can't assert the data *are* MCAR, as permutation tests don't allow you to accept the null hypothesis!

## Example: Assessing Missingness

* Data on ticketed cars: VIN, Make, Year, Color
* Is car color missing at random, dependent on car year?
    * Are the distributions of year similar when color is null vs not null?
    * How similar is similar enough?
    
Use a permutation test!

In [None]:
cars = pd.read_csv('./data/cars.csv')
cars.head()

In [None]:
# proportion of car color missing
cars.car_color.isnull().mean()

In [None]:
cars['car_color_isnull'] = cars.car_color.isnull()

In [None]:
(
    cars
    .pivot_table(index='car_year', columns='car_color_isnull', values=None, aggfunc='size')
    .fillna(0)
    .apply(lambda x:x/x.sum())
    .plot(title='distribution of car years by color=missing/not missing')
);

### Example: assessing missingness of car make on color

* "Are the two distributions (missing/not missing) of car make generated from the same distribution?"
* Car make is categorical. How to measure similarity?
    - use total variation distance

In [None]:
cars['car_make'].isnull().mean()

In [None]:
cars['car_make_isnull'] = cars.car_make.isnull()

In [None]:
cars.head()

In [None]:
emp_distributions = (
    cars
    .pivot_table(columns='car_make_isnull', index='car_color', values=None, aggfunc='size')
    .fillna(0)
    .apply(lambda x:x/x.sum())
)

emp_distributions.plot(kind='bar', title='distribution of car colors');

In [None]:
observed_tvd = np.sum(np.abs(emp_distributions.diff(axis=1).iloc[:,-1])) / 2
observed_tvd

In [None]:
n_repetitions = 500

car_make_color = cars.copy()[['car_color', 'car_make_isnull']]
tvds = []
for _ in range(n_repetitions):
    
    # shuffle the colors
    shuffled_colors = (
        car_make_color['car_color']
        .sample(replace=False, frac=1)
        .reset_index(drop=True)
    )
    
    # put them in a table
    shuffled = (
        car_make_color
        .assign(**{'Shuffled Color': shuffled_colors})
    )
    
    # compute the tvd
    shuffed_emp_distributions = (
        shuffled
        .pivot_table(columns='car_make_isnull', index='Shuffled Color', values=None, aggfunc='size')
        .fillna(0)
        .apply(lambda x:x/x.sum())
    )
    
    tvd = np.sum(np.abs(shuffed_emp_distributions.diff(axis=1).iloc[:,-1])) / 2
    # add it to the list of results
    
    tvds.append(tvd)

In [None]:
#: visualize
pd.Series(tvds).plot(kind='hist', density=True, alpha=0.8)
plt.scatter(observed_tvd, 0, color='red', s=40);

### Example: assessing missingness in payments data

* Payment information for purchases: credit card type, credit card number, date of birth.
* Is the credit card number missing at random dependent on the type of card?

In [None]:
payments = pd.read_csv('data/payment.csv')
payments['cc_isnull'] = payments.credit_card_number.isnull()

In [None]:
payments.head()

In [None]:
emp_distributions = (
    payments
    .pivot_table(columns='cc_isnull', index='credit_card_type', aggfunc='size')
    .fillna(0)
    .apply(lambda x:x / x.sum())
)

emp_distributions.plot(kind='bar', title='distribution of card types');

In [None]:
observed_tvd = np.sum(np.abs(emp_distributions.diff(axis=1).iloc[:,-1])) / 2
observed_tvd

In [None]:
n_repetitions = 500

payments_type = payments.copy()[['credit_card_type', 'cc_isnull']]
tvds = []
for _ in range(n_repetitions):
    
    # shuffle the colors
    shuffled_types = (
        payments_type['credit_card_type']
        .sample(replace=False, frac=1)
        .reset_index(drop=True)
    )
    
    # put them in a table
    shuffled = (
        payments_type
        .assign(**{'Shuffled Types': shuffled_types})
    )
    
    # compute the tvd
    shuffed_emp_distributions = (
        shuffled
        .pivot_table(columns='cc_isnull', index='Shuffled Types', values=None, aggfunc='size')
        .fillna(0)
        .apply(lambda x:x/x.sum())
    )
    
    tvd = np.sum(np.abs(shuffed_emp_distributions.diff(axis=1).iloc[:,-1])) / 2
    # add it to the list of results
    
    tvds.append(tvd)

### Example: assessing missingness in payments data

* Is the credit card number missing at random dependent on the type of card?
* As always, set significance level **beforehand**:
    - How important is the column in the modeling process?
    - How many null values are there?
* Consideration: how important is a faithful imputation?

In [None]:
#: visualize
pd.Series(tvds).plot(kind='hist', density=True, alpha=0.8)
plt.scatter(observed_tvd, 0, color='red', s=40);

In [None]:
# p-value
np.count_nonzero(tvds <= observed_tvd) / len(tvds)

### Example: assessing missingness in payments data

* Is the credit card number missing at random dependent on the age of shopper?
* For quantitative distributions, we've compared means of two groups.

In [None]:
payments['date_of_birth'] = pd.to_datetime(payments.date_of_birth)
payments['age'] = (2019 - payments.date_of_birth.dt.year)

In [None]:
# are the distributions similar?
# Where are the differences? Are they noise, or real?
payments.groupby('cc_isnull').age.plot(kind='kde', title='distribution of ages by missingness of CC', legend=True);

In [None]:
ks_2samp?

In [None]:
ks_2samp(
    payments.groupby('cc_isnull')['age'].get_group(True),
    payments.groupby('cc_isnull')['age'].get_group(False)
)