In [None]:
import pandas as pd
import numpy as np
import os

import util

import plotly.express as px
import plotly.figure_factory as ff
pd.options.plotting.backend = 'plotly'

# Lecture 13 ‚Äì Imputation

## DSC 80, Spring 2023

### Midterm Exam Logistics

- The Midterm Exam is **in-class, in-person on Friday, May 5th**.
- It will cover Lectures 1-13, Labs 1-5, and Projects 1-2.
- You can bring a single, two-sided note sheet.
- To review problems from old exams, go to [practice.dsc80.com](https://practice.dsc80.com).
    - Also look at the [Resources](https://dsc80.com/resources) tab on the course website.

### Agenda

- Recap: Identifying missingness mechanisms.
- Overview of imputation.
- Mean imputation.
- Probabilistic imputation.

## Recap: Identifying missingness mechanisms

### Review: Missingness mechanisms

- **Missing by design (MD)**: Whether or not a value is missing depends entirely on the data in other columns. In other words, if we can always predict if a value will be missing given the other columns, the data is MD.
- **Not missing at random (NMAR)**: The chance that a value is missing **depends on the actual missing value**!
- **Missing at random (MAR)**: The chance that a value is missing **depends on other columns**, but **not** the actual missing value itself.
- **Missing completely at random (MCAR)**: The chance that a value is missing is **completely independent** of other columns and the actual missing value.

### Deciding between MAR and MCAR

Recall, the "[missing value flowchart](https://dsc80.com/resources/lectures/lec12/lec12.html#Flowchart)" says that we should:

- First, determine whether values are **missing by design (MD)**.

- Then, reason about whether values are **not missing at random (NMAR)**.

- Finally, decide whether values are **missing at random (MAR)** or **missing completely at random (MCAR)**.

To decide between MAR and MCAR, we can look at the data itself.

### Deciding between MAR and MCAR

- If the missingness of column $X$ is explainable via the other columns in the data, then the missing data is missing at random (MAR).
    - The distribution of missing values in column $X$ may look different than the distribution of observed data in column $X$ ‚Äì that's fine, as long as the missingness can be explained solely by other columns in the data.

- If the missingness of column $X$ doesn't depend on any values in the observed data, it is missing completely at random (MCAR).
    - MCAR is equivalent to data being MAR, without dependence on any other columns.

- To decide if the missingness in column $X$ looks MCAR, for every other column, compare:
    - The distribution of the other column when $X$ is missing.
    - The distribution of the other column when $X$ is not missing.


- If this pair of distributions looks similar for every other column, then the values in column $X$ _may_ be MCAR.
    - Caution: you can't **prove** that data are MCAR, as permutation tests don't allow you to accept the null hypothesis!
    - See Lab 5, Question 4.

### Example: Heights

Today, we'll use the same `heights` dataset as we did last time.

In [None]:
heights = pd.read_csv(os.path.join('data', 'midparent.csv'))
heights = (
    heights
    .rename(columns={'childHeight': 'child', 'childNum': 'number'})
    .drop('midparentHeight', axis=1)
)
heights.head()

### Example: Missingness of `'child'` heights on `'father'`'s heights (MCAR)

- **Question**: Is the missingness of `'child'` heights dependent on the `'father'` column?

- To answer, we can look at two distributions:
    - The distribution of `'father'` when `'child'` is missing.
    - The distribution of `'father'` when `'child'` is not missing.

- If the two distributions look similar, then the missingness of `'child'` looks to be independent of `'father'`.
    - To test whether two distributions look similar, we use a permutation test.

Aside: In `util.py`, there are several functions that we've created to help us with this lecture. 
- `make_mcar` takes in a dataset and intentionally drops values from a column such that they are MCAR.
- `make_mar` does the same for MAR.
- You wouldn't actually do this in practice ‚Äì in practice, you'll obtain a dataset with no prior knowledge of the missingness mechanism!

In [None]:
# Generating MCAR data.
np.random.seed(42) # So that we get the same results each time (for lecture).
heights_mcar = util.make_mcar(heights, 'child', pct=0.5)
heights_mcar.isna().mean()

### Example: Missingness of `'child'` heights on `'father'`'s heights (MCAR)

In [None]:
heights_mcar['child_missing'] = heights_mcar['child'].isna()
util.create_kde_plotly(heights_mcar[['child_missing', 'father']], 'child_missing', True, False, 'father',
                       "Father's Height by Missingness of Child Height (MCAR example)")

- To test whether the two distributions are similar, we can use a permutation test. 


- Which test statistic should we use?

### Difference in means vs. K-S statistic

- The K-S statistic measures the difference between two numeric distributions.

- It **does not** quantify if one is larger than the other on average, so there are times we still need to use the difference in means.

- Strategy: Always plot the two distributions you are comparing.
    - If the distributions have similar shapes but are centered in different places, use the difference in means (or absolute difference in means).
    - If your alternative hypothesis involves a "direction" (i.e. smoking weights were are on average than non-smoking weights), use the difference in means.
    - If the distributions have different shapes and your alternative hypothesis is simply that the two distributions are different, use the K-S statistic.

### Example: Missingness of `'child'` heights on `'father'`'s heights (MCAR)

In [None]:
util.create_kde_plotly(heights_mcar[['child_missing', 'father']], 'child_missing', True, False, 'father',
                       "Father's Height by Missingness of Child Height (MCAR example)")

- Since the two distributions have slightly different shapes, but roughly the same center, we'll use the K-S statistic.

The `ks_2samp` function from `scipy.stats` can do the entire permutation test for us, if we want to use the K-S statistic!

(If we want to use the difference of means, we'd have to run a `for`-loop.)

In [None]:
# 'father' when 'child' is missing.
father_ch_mis = heights_mcar.loc[heights_mcar['child_missing'], 'father']

# 'father' when 'child' is not missing.
father_ch_not_mis = heights_mcar.loc[~heights_mcar['child_missing'], 'father']

In [None]:
from scipy.stats import ks_2samp

ks_2samp(father_ch_mis, father_ch_not_mis)

- This states that if the missingness of `'child'` is truly unrelated to the distribution of `'father'`, then the chance of seeing two distributions that are as or more different than our two observed `'father'` distributions is 16.8%.

- We fail to reject the null ‚Äì it looks like the missingness of `'child'` is likely unrelated to the distribution of `'father'`.

### Discussion Question

In this MCAR example, if we were to take the mean of the `'child'` column that contains missing values, is the result likely to:

1. Overestimate the true mean?
2. Underestimate the true mean?
3. Be accurate?

In [None]:
util.create_kde_plotly(heights_mcar[['child_missing', 'father']], 'child_missing', True, False, 'father',
                       "Father's Height by Missingness of Child Height (MCAR example)")

### Example: Missingness of `'child'` heights on `'father'`'s heights (MAR)

- **Question:** Is the missingness of `'child'` heights dependent on the `'father'` column?

- We will follow the same procedure as before. The only difference is that the missing values in our simulated data are MAR dependent on `'father'`.

In [None]:
# Generating MAR data.
heights_mar = util.make_mar_on_num(heights, 'child', 'father', pct=0.75)
heights_mar.isna().mean()

### Example: Missingness of `'child'` heights on `'father'`'s heights (MAR)

In [None]:
heights_mar['child_missing'] = heights_mar['child'].isna()
util.create_kde_plotly(heights_mar[['child_missing', 'father']], 'child_missing', True, False, 'father',
                       "Father's Height by Missingness of Child Height (MAR example)")

- The above picture shows us that missing `'child'` heights tend to come from taller `'father'`s heights.

- To determine whether the two distributions are significantly different, we must use a permutation test. This time, the difference in means is a good choice, since the shapes are similar but the centers are different.

### Discussion Question

In this MAR example, if we were to take the mean of the `'child'` column that contains missing values, is the result likely to:

1. Overestimate the true mean?
2. Underestimate the true mean?
3. Be accurate?

In [None]:
util.create_kde_plotly(heights_mar[['child_missing', 'father']], 'child_missing', True, False, 'father',
                       "Father's Height by Missingness of Child Height (MAR example)")

## Handling missing values

### What do we do with missing data?

- Suppose we are interested in a dataset $Y$. 
- We get to **observe** $Y_{obs}$, while the rest of the dataset, $Y_{mis}$, is **missing**.
- Issue: $Y_{obs}$ may look quite different than $Y$.
    - The mean and other measures of central tendency may be different.
    - The variance may be different.
    - The correlations between variables may be different.

### Solution 1: Dropping missing values

- If the data are MCAR (missing completely at random), then dropping the missing values entirely doesn't significantly change the data.
    - For instance, the mean of the dataset post-dropping is an unbiased estimate of the true mean.
    - This is because MCAR data is a **random sample** of the full dataset.
    - From DSC 10, we know that random samples tend to resemble the larger populations they are drawn from.

- **If the data are not MCAR, then dropping the missing values will introduce bias.**
    - MCAR is rare!
    - For instance, suppose we asked people "How much do you give to charity?" People who give little are less likely to respond, so the average response is **biased high**.

### Listwise deletion

- _Listwise deletion_ is the act of dropping entire rows that contain missing values.
- Issue: This can delete perfectly good data in other columns for a given row.
    - Improvement: Drop missing data only when working with the column that contains missing data.

To illustrate, let's generate two datasets with missing `'child'` heights ‚Äì one in which the heights are MCAR, and one in which they are MAR dependent on `'gender'` (**not** `'father'`, as in our previous example).

**In practice, you'll have to run permutation tests to determine the likely missingness mechanism first!**

In [None]:
np.random.seed(42) # So that we get the same results each time (for lecture).
heights_mcar = util.make_mcar(heights, 'child', pct=0.5)
heights_mar = util.make_mar_on_cat(heights, 'child', 'gender', pct=0.5)

### Listwise deletion

Below, we compute the means and standard deviations of the `'child'` column in all three datasets. Remember, `.mean()` and `.std()` ignore missing values.

In [None]:
util.multiple_describe({
    'Original': heights,
    'MCAR': heights_mcar,
    'MAR': heights_mar
})

Observations:

- The `'child'` mean (and SD) in the MCAR dataset is very close to the true `'child'` mean (and SD).

- The `'child'` mean in the MAR dataset is biased **high**.

### Solution 2: Imputation

**Imputation** is the act of filling in missing data with plausable values. Ideally, imputation:

* is quick and easy to do.
* shouldn't introduce bias into the dataset.

These are hard to do at the same time!

### Kinds of imputation

- There are three main types of imputation, two of which we will focus on today:

    - **Imputation with a single value: mean, median, mode.**
    - Imputation with a single value, using a model: regression, kNN.
    - **Probabilistic imputation by drawing from a distribution.**

- Each has upsides and downsides, and **each works differently with different types of missingness**.

## Mean imputation

### Mean imputation

- Mean imputation is the act of filling in missing values in a column with the mean of the observed values in that column.
- This strategy:
    - üëç Preserves the mean of the observed data, for all types of missingness.
    - üëé Decreases the variance of the data, for all types of missingness.
    - üëé Creates a biased estimate of the true mean when the data are not MCAR.

### Example: Mean imputation in the MCAR `heights` dataset

Let's look at two distributions:
- The distribution of the `'child'` column in `heights`, where we have all the data.
- The distribution of the `'child'` column in `heights_mcar`, where some values are MCAR.

In [None]:
# Look in util.py to see how multiple_kdes is defined.
util.multiple_kdes({'Original': heights, 'MCAR, Unfilled': heights_mcar})

- Since the `'child'` heights are MCAR, the <span style='color:rgb(217,95,2)'><b> orange distribution, in which some values are missing</b></span>, has roughly the same shape as the <span style='color:rgb(27,158,119)'><b>turquoise distribution, which has no missing values</b></span>.

### Mean imputation of MCAR data

Let's fill in missing values in `heights_mcar['child']` with the mean of the observed `'child'` heights in `heights_mcar['child']`.

In [None]:
heights_mcar['child'].head()

In [None]:
heights_mcar_mfilled = heights_mcar.fillna(heights_mcar['child'].mean())
heights_mcar_mfilled['child'].head()

In [None]:
df_map = {'Original': heights, 'MCAR, Unfilled': heights_mcar, 'MCAR, Mean Imputed': heights_mcar_mfilled}
util.multiple_describe(df_map)

Observations:

- The mean of the imputed dataset is the same as the mean of the subset of heights that aren't missing (which is close to the true mean).

- The standard deviation of the imputed dataset smaller than that of the other two datasets. **Why?**

### Mean imputation of MCAR data

Let's visualize all three distributions: the original, the MCAR heights with missing values, and the mean-imputed MCAR heights.

In [None]:
util.multiple_kdes(df_map)

**Takeaway**: When data are MCAR and you impute with the mean:
- The mean of the imputed dataset is an **unbiased estimator** of the true mean.
- The variance of the imputed dataset is smaller than the variance of the full dataset.
    - Mean imputation tricks you into thinking your data are more reliable than they are!

### Example: Mean imputation in the MAR `heights` dataset

- When data are MAR, mean imputation leads to biased estimates of the mean across groups.

- The bias may be different in different groups.
    - For example: If the missingness depends on gender, then different genders will have differently-biased means.
    - The overall mean will be biased towards one group.

- Again, let's look at two distributions:
    - The distribution of the `'child'` column in `heights`, where we have all the data.
    - The distribution of the `'child'` column in `heights_mar`, where some values are MAR.

In [None]:
util.multiple_kdes({'Original': heights, 'MAR, Unfilled': heights_mar})

The distributions are not very similar!

Remember that in reality, you won't get to see the <span style='color:rgb(27,158,119)'><b>turquoise distribution, which has no missing values</b></span> ‚Äì instead, you'll try to recreate it, using your sample with missing values.

### Mean imputation of MAR data

Let's fill in missing values in `heights_mar['child']` with the mean of the observed `'child'` heights in `heights_mar['child']` and see what happens.

In [None]:
heights_mar['child'].head()

In [None]:
heights_mar_mfilled = heights_mar.fillna(heights_mar['child'].mean())
heights_mar_mfilled['child'].head()

In [None]:
df_map = {'Original': heights, 'MAR, Unfilled': heights_mar, 'MAR, Mean Imputed': heights_mar_mfilled}
util.multiple_describe(df_map)

Note that the latter two means are biased **high**.

### Mean imputation of MAR data

Let's visualize all three distributions: the original, the MAR heights with missing values, and the mean-imputed MAR heights.

In [None]:
util.multiple_kdes(df_map)

Since the sample with MAR values was already biased high, mean imputation kept the sample biased ‚Äì it did not bring the data **closer to the data generating process**.

With our single mean imputation strategy, the resulting female mean height is biased quite high.

In [None]:
pd.concat([
    heights.groupby('gender')['child'].mean().rename('Original'),
    heights_mar.groupby('gender')['child'].mean().rename('MAR, Unfilled'),
    heights_mar_mfilled.groupby('gender')['child'].mean().rename('MAR, Mean Imputed')
], axis=1).T

### Within-group (conditional) mean imputation

* **Improvement:** Since MAR data are MCAR within each group, we can perform group-wise mean imputation.
    - In our case, since the missingness of `'child'` is dependent on `'gender'`, we can impute separately for each `'gender'`.
    - For instance, if there is a missing `'child'` height for a `'female'` child, impute their height with the mean observed `'female'` height.

- With this technique, the overall mean remains unbiased, as do the within-group means.

- Like with "single" mean imputation, the variance of the dataset is reduced.

### `transform` returns!

- In MAR data, imputation by the overall mean gives a biased estimate of the mean of each group. 
- To obtain an unbiased estimate of the mean within each group, impute using the mean within each group.
- To perform an operation separately to each gender, we `groupby('gender')` and use the `transform` method.

In [None]:
def mean_impute(ser):
    return ser.fillna(ser.mean())

heights_mar_cond = heights_mar.groupby('gender')['child'].transform(mean_impute).to_frame()
heights_mar_cond['child'].head()

In [None]:
df_map['MAR, Conditional Mean Imputed'] = heights_mar_cond
util.multiple_kdes(df_map)

The <span style='color:rgb(231,41,138)'><b>pink distribution</b></span> does a better job of approximating the <span style='color:rgb(27,158,119)'><b>turquoise distribution</b></span> than the <span style='color:rgb(117,112,179)'><b>purple distribution</b></span>.

### Conclusion: Imputation with single values

- Imputing missing data in a column with the mean of the column:
    - faithfully reproduces the mean of the observed dataset,
    - reduces the variance, and
    - biases relationships between the column and other columns if the data are not MCAR.
    
- The same is true with other statistics (e.g. median and mode).

## Probabilistic imputation

### Imputing missing values using distributions

- So far, each missing value in a column has been filled in with a constant value.
    - This creates "spikes" in the imputed distributions.

- **Idea**: We can **probabilistically** impute missing data from a distribution.
    - We can fill in missing data by drawing from the distribution of the **non-missing** data.
    - There are 5 missing values? Pick 5 values from the data that aren't missing.
     - How? Using `np.random.choice` or `.sample`.

### Example: Probabilistic imputation in the MCAR `heights` dataset

Step 1: Determine the number of missing values in the column of interest.

In [None]:
num_null = heights_mcar['child'].isna().sum()
num_null

Step 2: Sample that number of values from the observed values in the column of interest.

In [None]:
fill_values = np.random.choice(heights_mcar['child'].dropna(), num_null)

Step 3: Fill in the missing values with the sample from Step 2.

In [None]:
heights_mcar_pfilled = heights_mcar.copy()
heights_mcar_pfilled.loc[heights_mcar_pfilled['child'].isna(), 'child'] = fill_values

Let's look at the results.

In [None]:
df_map = {'Original': heights, 
          'MCAR, Unfilled': heights_mcar, 
          'MCAR, Probabilistically Imputed': heights_mcar_pfilled}

In [None]:
util.multiple_describe(df_map)

Variance is preserved!

In [None]:
util.multiple_kdes(df_map)

No spikes!

### Observations

- With this technique, the missing values were filled in with observed values in the dataset.

- If a value was never observed in the dataset, it will never be used to fill in a missing value.
    - For instance, if the observed heights were 68, 69, and 69.5 inches, we will never fill a missing value with 68.5 inches even though it's a perfectly reasonable height.

- Solution? Create a histogram (with `np.histogram`) to bin the data, then sample from the histogram.
    - See Lab 5, Question 6.

- **Question**: How would we generalize this process for MAR data?

### Randomness

- Unlike mean imputation, probabilistic imputation is **random** ‚Äì each time you run the cell in which imputation is performed, the results could be different.

- If we're interested in estimating some population **parameter** given our (incomplete) sample, it's best not to rely on just a single random imputation.

- **Multiple imputation**: Generate multiple imputed datasets and aggregate the results!
    - Similar to bootstrapping.

### Multiple imputation

Steps:

0. Start with observed and incomplete data. 

1. Create $m$ **imputed** versions of the data through a probabilistic procedure.
    - The imputed datasets are identical for the observed data entries.
    - They differ in the imputed values. 
    - The differences reflect our **uncertainty** about what value to impute.

2. Then, compute parameter estimates on **each** imputed dataset.
    - For instance, the mean, standard deviation, median, etc.

3. Finally, pool the $m$ parameter estimates into one estimate.

### Multiple imputation

Let's try this procedure out on the `heights_mcar` dataset.

In [None]:
heights_mcar.head()

In [None]:
# This function implements the 3-step process we studied earlier.
def create_imputed(col):
    col = col.copy()
    num_null = col.isna().sum()
    fill_values = np.random.choice(col.dropna(), num_null)
    col[col.isna()] = fill_values
    return col

Each time we run the following cell, it generates a new imputed version of the `'child'` column.

In [None]:
create_imputed(heights_mcar['child']).head()

Let's run the above procedure 100 times.

In [None]:
mult_imp = pd.concat([create_imputed(heights_mcar['child']).rename(k) for k in range(100)], axis=1)
mult_imp.head()

Let's plot some of the imputed columns on the previous slide.

In [None]:
# Random sample of 15 imputed columns.
mult_imp_sample = mult_imp.sample(15, axis=1)
fig = ff.create_distplot(mult_imp_sample.to_numpy().T, list(mult_imp_sample.columns), show_hist=False, show_rug=False)
fig.update_xaxes(title='child')

Let's look at the distribution of means across the imputed columns.

In [None]:
px.histogram(pd.DataFrame(mult_imp.mean()), nbins=15, histnorm='probability',
             title='Distribution of Imputed Sample Means')

## Summary, next time

### Summary of imputation techniques

* Listwise deletion.
* Mean imputation.
* Group-wise (conditional) mean imputation.
* Probabilistic imputation.
* Multiple imputation.

### Summary: Listwise deletion

* Procedure: `df = df.dropna()`.
* If data are MCAR, listwise deletion doesn't change most summary statistics (mean, median, SD) of the data.

### Summary: Mean imputation 

* Procedure: `df[col] = df[col].fillna(df[col].mean())`.
* If data are MCAR, the resulting mean is an unbiased estimate of the true mean, but the variance is too low.
* Analogue for categorical data: imputation with the mode.

### Summary: Conditional mean imputation

* Procedure: for a column `c1`, conditional on a second categorical column
`c2`:

```py
means = df.groupby('c2').mean().to_dict()
imputed = df['c1'].apply(lambda x: means[x] if np.isnan(x) else x)
```

* If data are MAR, the resulting mean is an unbiased estimate of the true mean, but the variance is too low.
* This increases correlations between the columns.
* If the column with missing values were dependent on *more than one* column, we can use linear regression to predict the missing value.

### Summary: Probabilistic imputation

* Procedure: draw from the distribution of **observed data** to fill in missing values.
* If data are MCAR, the resulting mean and variance are unbiased estimates of the true mean and variance.
* Extending to the MAR case: draw from **conditional empirical distributions**.
    - If data are conditional on a single categorical column `c2`, apply the MCAR procedure to the groups of `df.groupby(c2)`.

### Summary: Multiple imputation

* Procedure:
    - Apply probabilistic imputation multiple times, resulting in $m$ imputed datasets.
    - Compute statistics separately on the $m$ imputed datasets (e.g. compute the mean or correlation coefficient).
    - Plot the distribution of these statistics and create confidence intervals.
* If a column is missing conditional on multiple columns, your "multiple imputations" should include probabilistic imputations for each!

### Next time

- Introduction to HTTP.
- Making requests.