# Missingness example

In [None]:
import pandas as pd
import numpy as np
from pace.missingness import *

In [None]:
df = pd.read_csv("../data/test_data_merged_10000.csv", low_memory=False)

In [None]:
df.tail()

## Construction from a dataframe

In [None]:
m = Missingness.from_data_frame(df)
display(m)

## Counting missingness patterns

`Missingness.counts()` returns each distinct missingness pattern, along with the number of records satisfying the pattern (in the final column, `_count`)

In [None]:
m.combination_counts().head()

Convenience function to make the heatmap data (based on `counts`)

In [None]:
heatmap_data(m)

A helper function (based on `matches` - see below) to extract the missingness counts for each column:

In [None]:
value_bar_chart_data(m)

## Selecting a subset of records by missingness pattern

The argument to select patterns refers to `pattern_key` in the dataframe above.  This could be determined by an interactive selection

In [None]:
m.select_combinations([2,5]).combination_counts()

## Selecting a subset of columns

This will return a new Missingness object, based on the column selection. Patterns that are distinct before the selection may need to be merged, so `pattern_key` is reset.

In [None]:
m.select_columns(["Key", "Num_DIAG", "DIAG_01", "DIAG_02", "DIAG_03"]).combination_counts()

## Selecting a subset of records explicitly

In [None]:
# restrict to the first 30 records in the data frame
m.select_records(range(30)).combination_counts()

In [None]:
m.records()

## Describing missingness patterns

Describe and match on particular missingness patterns as follows. For example, the following describes any pattern with `DIAG_01` and `DIAG_02` missing (ignoring the other columns). This gives a convenient way of representing any intersection.

In [None]:
Col("DIAG_02") & Col("OPDATE_02")

Can use the description to select individual records from the dataframe. The index of the dataframe, `pattern_key`, refers to the particular missingness pattern, and `_index` is the index in the original dataframe.

In [None]:
matches = np.sort(m.matches(Col("DIAG_02") & Col("MYOPDATE_02")))
display(matches)

Extracting the matching records from the original dataframe

In [None]:
df.loc[matches]