# Missing Data Frequency Estimation

Missing data is a common occurrence in data analysis. In this notebook we will demonstrate
how to estimate contingency table frequencies when some of the records are incomplete.

In [1]:
# The usual suspects.
%matplotlib widget
import pandas as pd
import matplotlib.pyplot as mp
import numpy as np
import ipywidgets as wg
import scipy.stats as st

## Data Load
For this exercise we are using the Acute Bacterial Meningitis dataset available from the
Vanderbilt Department of Biostatistics [data page](https://hbiostat.org/data/).

Local copy of the data:
* File `Data\abm.xlsx`
* Sheet `abm`

After loading we decode the numerical levels of the categories into succinctly named fields,
and add indicator fields for missing values.
* `pathogen` - `missing`, `bacterial`, `viral`
* `gender` - `missing`, `female`, `male`

In [2]:
# Load our data
abmsource = pd.read_excel(
    "Data\\abm.xlsx",
    sheet_name = "abm"
)

# Recode values and add missing indicator
pathogen = ["viral", "bacterial", "missing"]
abmsource["pathogen"] = abmsource["abm"].fillna(2).astype(int).apply(lambda p: pathogen[p])
abmsource["pathogenmissing"] = abmsource["abm"].isna()
abmsource["gender"] = abmsource["sex"].fillna("missing")
abmsource["gendermissing"] = abmsource["sex"].isna()
abmtarget = abmsource[["pathogen", "pathogenmissing", "gender", "gendermissing"]]

## Actual Contingency Table
Having added a value to indicate a missing value we can cross tabulate with the missing
value indicator as the first value in the row and column.

In [3]:
# Actual record counts with missing values
missingactual = st.contingency.crosstab(
    
    # Gender is the rows
    abmtarget["gender"],
    
    # Diagnosis in the columns
    abmtarget["pathogen"],

    # Reorder the levels
    levels = [
        ["missing", "female", "male"],
        ["missing", "bacterial", "viral"]
    ]
)

# Row and column margins
missingmargins = st.contingency.margins(missingactual.count)

# Output
print("Rows")
print(missingactual.elements[0])
print(missingmargins[0].reshape(1, -1))
print("\nColumns")
print(missingactual.elements[1])
print(missingmargins[1].reshape(1, -1))
print("\nActual")
print(missingactual.count)

Rows
['missing', 'female', 'male']
[[ 81 221 279]]

Columns
['missing', 'bacterial', 'viral']
[[ 80 217 284]]

Actual
[[  0   0  81]
 [ 32 104  85]
 [ 48 113 118]]


## Omnibus Test
The omnibus test compares complete independence between all values and missing values to
complete dependence. A small tail probability is evidence that we can rule out independence
between values and missing values.

In [4]:
# The Pearson's Text. Counts are large enough that we do not need the correction.
missingtest = st.chi2_contingency(
    missingactual.count,
    correction = False
)

# Output
print("Test")
print(f"Statistic: {missingtest.statistic}")
print(f"P-Value: {missingtest.pvalue}")
print(f"Degrees of Freedom: {missingtest.dof}")

Test
Statistic: 100.88537166563069
P-Value: 6.372994233826082e-21
Degrees of Freedom: 4


In [5]:
# Number of complete values
completecount = np.sum(missingactual.count[1:3, 1:3])

# Display
print(f"Complete Records: {completecount}")
print("\nJoint estimate conditioned on complete records")
print(100 * missingactual.count[1:3, 1:3] / completecount)

Complete Records: 420

Joint estimate conditioned on complete records
[[24.76190476 20.23809524]
 [26.9047619  28.0952381 ]]


## Expectation Maximization
Update steps
$$
\begin{array}{rl}
\operatorname{\hat{\mathbb{P}}^{(i+1)}_{11}} & = \displaystyle\frac{1}{n_\text{tot}} \cdot \left(
    n_{11} + 
    n_{10} \cdot \frac{\operatorname{\hat{\mathbb{P}}^{(i)}_{11}}}
    {\operatorname{\hat{\mathbb{P}}^{(i)}_{11}} + \operatorname{\hat{\mathbb{P}}^{(i)}_{12}}} +
    n_{01} \cdot \frac{\operatorname{\hat{\mathbb{P}}^{(i)}_{11}}}
    {\operatorname{\hat{\mathbb{P}}^{(i)}_{11}} + \operatorname{\hat{\mathbb{P}}^{(i)}_{21}}} +
    n_{00} \cdot \operatorname{\hat{\mathbb{P}}^{(i)}_{11}}
\right)\\\\
\operatorname{\hat{\mathbb{P}}^{(i+1)}_{12}} & = \displaystyle\frac{1}{n_\text{tot}} \cdot \left(
    n_{12} +
    n_{10} \cdot \frac{\operatorname{\hat{\mathbb{P}}^{(i)}_{12}}}
    {\operatorname{\hat{\mathbb{P}}^{(i)}_{11}} + \operatorname{\hat{\mathbb{P}}^{(i)}_{12}}} +
    n_{02} \cdot \frac{\operatorname{\hat{\mathbb{P}}^{(i)}_{12}}}
    {1 - \operatorname{\hat{\mathbb{P}}^{(i)}_{11}} - \operatorname{\hat{\mathbb{P}}^{(i)}_{21}}} +
    n_{00} \cdot \operatorname{\hat{\mathbb{P}}^{(i)}_{12}}
\right)\\\\
\operatorname{\hat{\mathbb{P}}^{(i+1)}_{21}} & = \displaystyle\frac{1}{n_\text{tot}} \cdot \left(
    n_{21} +
    n_{20} \cdot \frac{\operatorname{\hat{\mathbb{P}}^{(i)}_{21}}}
    {1 - \operatorname{\hat{\mathbb{P}}^{(i)}_{11}} - \operatorname{\hat{\mathbb{P}}^{(i)}_{12}}} +
    n_{01} \cdot \frac{\operatorname{\hat{\mathbb{P}}^{(i)}_{21}}}
    {\operatorname{\hat{\mathbb{P}}^{(i)}_{11}} + \operatorname{\hat{\mathbb{P}}^{(i)}_{21}}} +
    n_{00} \cdot \operatorname{\hat{\mathbb{P}}^{(i)}_{21}}
\right)
\end{array}
$$

In [6]:
# Count the total record once
totalrecords = np.sum(missingactual.count)

# Update our estimated joint probabilities with the actual data and the
# previous estimate
def emupdate(oldestimate):

    # Initialize return
    newestimate = np.zeros(4)

    # Estimate Female, Bacterial frequency
    newestimate[0] = (
        missingactual.count[1, 1] +
        missingactual.count[1, 0] * oldestimate[0] / (oldestimate[0] + oldestimate[1]) +
        missingactual.count[0, 1] * oldestimate[0] / (oldestimate[0] + oldestimate[2]) +
        missingactual.count[0, 0] * oldestimate[0]
    ) / totalrecords

    # Estimate Female, Viral frequency
    newestimate[1] = (
        missingactual.count[1, 2] +
        missingactual.count[1, 0] * oldestimate[1] / (oldestimate[0] + oldestimate[1]) +
        missingactual.count[0, 2] * oldestimate[1] / (1 - oldestimate[0] - oldestimate[2]) +
        missingactual.count[0, 0] * oldestimate[1]
    ) / totalrecords

    # Estimate Male, Bacterial frequency
    newestimate[2] = (
        missingactual.count[2, 1] +
        missingactual.count[2, 0] * oldestimate[2] / (1 - oldestimate[0] - oldestimate[1]) +
        missingactual.count[0, 1] * oldestimate[2] / (oldestimate[0] + oldestimate[2]) +
        missingactual.count[0, 0] * oldestimate[2]
    ) / totalrecords

    # Last value is 1 minus all the values
    newestimate[3] = 1 - np.sum(newestimate[0:3])

    # Send
    return newestimate


In [12]:
# Initial estimate of the joint probabilities from the complete data
initialjoint = missingactual.count[1:3, 1:3] / completecount
initialestimate = np.zeros(4)
initialestimate[0] = initialjoint[0, 0]
initialestimate[1] = initialjoint[0, 1]
initialestimate[2] = initialjoint[1, 0]
initialestimate[3] = initialjoint[1, 1]

# Display the start
print("Step 0")
print(initialestimate)

# Update the estimate
for i in range(1, 15):
    initialestimate = emupdate(initialestimate)
    print(f"Step {i}")
    print(initialestimate)

Step 0
[0.24761905 0.20238095 0.26904762 0.28095238]
Step 1
[0.20930889 0.22944542 0.23490623 0.32633946]
Step 2
[0.20527656 0.23265689 0.22907079 0.33299575]
Step 3
[0.20481868 0.23290226 0.22816256 0.3341165 ]
Step 4
[0.2047736  0.23286952 0.22801639 0.33434049]
Step 5
[0.20477251 0.23284325 0.22799028 0.33439396]
Step 6
[0.20477398 0.23283258 0.22798481 0.33440863]
Step 7
[0.20477471 0.23282882 0.22798346 0.33441301]
Step 8
[0.20477498 0.23282757 0.22798308 0.33441437]
Step 9
[0.20477507 0.23282716 0.22798297 0.33441481]
Step 10
[0.2047751  0.23282702 0.22798293 0.33441494]
Step 11
[0.20477511 0.23282698 0.22798292 0.33441499]
Step 12
[0.20477512 0.23282696 0.22798292 0.334415  ]
Step 13
[0.20477512 0.23282696 0.22798291 0.33441501]
Step 14
[0.20477512 0.23282696 0.22798291 0.33441501]
