# Aside: More Missingness Examples

This notebook serves to provide more examples of how to identify missingness mechanisms through data.

In [None]:
import pandas as pd
import numpy as np
import os

import plotly.express as px
import plotly.figure_factory as ff
pd.options.plotting.backend = 'plotly'

from scipy.stats import ks_2samp

# Used for plotting examples.
def create_kde_plotly(df, group_col, group1, group2, vals_col, title=''):
    fig = ff.create_distplot(
        hist_data=[df.loc[df[group_col] == group1, vals_col], df.loc[df[group_col] == group2, vals_col]],
        group_labels=[group1, group2],
        show_rug=False, show_hist=False,
        colors=['#ef553b', '#636efb'],
    )
    return fig.update_layout(title=title)

## Example: Cars 🚗

### Example: Cars

* We have data on cars that were given tickets.
* For each car, we have their `'vin'` number, `'car_make'`, `'car_year'`, and `'car_color'`.
* **Question:** Is `'car_color'` missing at random, **dependent on `'car_year'`**?
    * Is the distribution of `'car_year'` similar when color is missing vs. not missing?
    * How similar is similar enough?
    
Let's use a permutation test!

In [None]:
cars = pd.read_csv(os.path.join('data', 'cars.csv'))
cars.head()

In [None]:
# Proportion of car colors missing.
cars['car_color'].isna().mean()

In [None]:
cars['color_missing'] = cars['car_color'].isna()

In [None]:
cars.head()

In [None]:
(
    cars
    .pivot_table(index='car_year', columns='color_missing', values=None, aggfunc='size')
    .fillna(0)
    .apply(lambda x: x / x.sum())
    .plot(title='Distribution of Car Years by Missingness of Color')
)

- These distributions look pretty similar. We won't run the permutation test here, but if we did, we'd fail to reject the null. It doesn't seem like the missingness of `'car_color'` depends on `'car_year'`.
- To figure out if the missingness of `'car_color'` is MCAR, we'd need to do a similar analysis for all other columns.

### Missingness of `'car_color'` on `'car_make'`

Let's test whether the missingness of `'car_color'` is dependent on `'car_make'`.

In [None]:
cars.head()

In [None]:
emp_distributions = (
    cars
    .pivot_table(index='car_make', columns='color_missing', values=None, aggfunc='size')
    .fillna(0)
    .apply(lambda x: x / x.sum())
)

# There are too many makes to plot them all at once! Instead, we'll take the top 20.
emp_distributions.iloc[:20].plot(kind='barh', title='Distribution of Makes by Missingness of Color', 
                                 barmode='group')

In [None]:
observed_tvd = emp_distributions.diff(axis=1).iloc[:, -1].abs().sum() / 2
observed_tvd

In [None]:
shuffled = cars.copy()[['car_make', 'color_missing']]

n_repetitions = 500
tvds = []

for _ in range(n_repetitions):
    
    shuffled['car_make'] = np.random.permutation(shuffled['car_make'])
    
    pivoted = (
        shuffled
        .pivot_table(index='car_make', columns='color_missing', values=None, aggfunc='size')
        .fillna(0)
        .apply(lambda x: x / x.sum())
    )
    
    tvd = pivoted.diff(axis=1).iloc[:, -1].abs().sum() / 2
    tvds.append(tvd)

In [None]:
fig = px.histogram(pd.DataFrame(tvds), x=0, nbins=50, histnorm='probability', 
                   title='Empirical Distribution of the TVD')
fig.add_vline(x=observed_tvd, line_color='red')
fig.add_annotation(text=f'<span style="color:red">Observed TVD = {round(observed_tvd, 2)}</span>',
                   x=1.08 * observed_tvd, showarrow=False, y=0.1)
fig.update_layout(yaxis_range=[0, 0.15])

In [None]:
np.mean(np.array(tvds) >= observed_tvd)

Here, we fail to reject the null that the distribution of `'car_make'` is the same whether or not `'car_color'` is missing.

## Example: Payments 💰

### Example: Assessing missingness in payments data

* We have payment information for purchases: credit card type, credit card number, date of birth.
* Is the credit card number missing at random dependent on the type of card?

In [None]:
payments = pd.read_csv(os.path.join('data', 'payment.csv'))
payments['cc_isnull'] = payments['credit_card_number'].isna()

In [None]:
payments.head()

In [None]:
emp_distributions = (
    payments
    .pivot_table(columns='cc_isnull', index='credit_card_type', aggfunc='size')
    .fillna(0)
    .apply(lambda x:x / x.sum())
)

emp_distributions.plot(kind='barh', title='Distribution of Card Types', barmode='group')

In [None]:
observed_tvd = emp_distributions.diff(axis=1).iloc[:, -1].abs().sum() / 2
observed_tvd

In [None]:
shuffled = payments.copy()[['credit_card_type', 'cc_isnull']]

n_repetitions = 500
tvds = []

for _ in range(n_repetitions):
    
    shuffled['credit_card_type'] = np.random.permutation(shuffled['credit_card_type'])
    
    pivoted = (
        shuffled
        .pivot_table(index='credit_card_type', columns='cc_isnull', values=None, aggfunc='size')
        .fillna(0)
        .apply(lambda x: x / x.sum())
    )
    
    tvd = pivoted.diff(axis=1).iloc[:, -1].abs().sum() / 2
    tvds.append(tvd)

### Assessing missingness in payments data

* Is the credit card number missing at random dependent on the type of card?
* As always, set significance level **beforehand**:
    - How important is the column in the modeling process?
    - How many null values are there?
* Consideration: how important is a faithful imputation?

In [None]:
fig = px.histogram(pd.DataFrame(tvds), x=0, nbins=50, histnorm='probability', 
                   title='Empirical Distribution of the TVD')
fig.add_vline(x=observed_tvd, line_color='red')
fig.add_annotation(text=f'<span style="color:red">Observed TVD = {round(observed_tvd, 2)}</span>',
                   x=0.06, showarrow=False, y=0.08)
fig.update_layout(xaxis_range=[0, 0.25])

In [None]:
# Same as np.mean(np.array(tvds) >= observed_tvd).
np.count_nonzero(np.array(tvds) >= observed_tvd) / len(tvds)

### Assessing missingness in payments data

* Is the credit card number missing at random dependent on the age of shopper?
* For quantitative distributions, we've compared means of two groups.

In [None]:
payments['date_of_birth'] = pd.to_datetime(payments['date_of_birth'])
payments['age'] = (2023 - payments.date_of_birth.dt.year)

Note that the age column itself has missing values.

In [None]:
create_kde_plotly(payments[['cc_isnull', 'age']].dropna(), 'cc_isnull', True, False, 'age')

In [None]:
ks_2samp(
    payments.groupby('cc_isnull')['age'].get_group(True),
    payments.groupby('cc_isnull')['age'].get_group(False)
)