# Missingness example

In [None]:
import pandas as pd
import numpy as np
from pace.missingness import *

In [None]:
df = pd.read_csv("../examples/datasets/Nutrients.csv", low_memory=False)

In [None]:
df.tail()

## Construction from a dataframe

In [None]:
m = Missingness.from_data_frame(df)
display(m)

## Show distinct missingness patterns

`Missingness.combinations()` returns each distinct missingness pattern as a dataframe

In [None]:
m.combinations()

`Missingness.count_combinations()` is the same, along with the number of records satisfying the pattern (in the final column, `_count`)

In [None]:
m.count_combinations()

## Selecting a subset of records by missingness pattern

The argument to select patterns refers to `pattern_key` in the dataframe above.  This could be determined by an interactive selection

In [None]:
m.select_combinations([2,5]).count_combinations()

## Selecting a subset of columns

This will return a new Missingness object, based on the column selection. Patterns that are distinct before the selection may need to be merged, so `pattern_key` is reset.

In [None]:
m.select_columns(["name", "group", "protein (g)", "calcium (g)", "fiber (g)"]).count_combinations()

## Selecting a subset of records explicitly

In [None]:
# restrict to the first 30 records in the data frame
m_first_30 = m.select_records(range(30))

m_first_30.count_combinations()

In [None]:
m_first_30.records()

## Select combinations matching a pattern

Describe and match on particular patterns of missingness as follows. For example, the following describes any pattern with `DIAG_01` and `DIAG_02` missing (ignoring the other columns). This gives a convenient way of representing any intersection.

In [None]:
Col("protein (g)") & Col("fiber (g)")

Use the description to select matching combinations.  The `combination_id` of matching combinations is returned.

In [None]:
matches = m.matching_combinations(Col("protein (g)") & Col("fiber (g)"))
matches

The combinations dataframe can be indexed with the returned `combination_id`s

In [None]:
m.combinations().loc[matches]

Select individual records from the dataframe:

In [None]:
records = np.sort(m.matching_records(Col("protein (g)") & Col("fiber (g)")))
display(records)

Extracting the matching records from the original dataframe

In [None]:
df.loc[records]