# Tutorial example (programmatic selection)

In this notebook, we repeat the previous example (`Tutorial Example.ipynb`). But instead of selecting unexpected missing combinations in the interactive "Combination heatmap", we will choose a programmatic approach to selecting combinations. 

In [None]:
from pace.missingness import *
from pace.history import *
from pace.plots import *

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

Read the data into a Pandas dataframe to explore the data with PACE.

In [None]:
df = pd.read_csv("../../data/Synthetic_APC_DIAG_Fields.csv")

## Plot session

First, we create a `PlotSession` object from the dataset.

In [None]:
session = PlotSession(df)

With the `PlotSession` object, we can then get a dataframe with all the missingness combinations. This dataframe contains the patterns for all missing combinations. A row represents one unqiue combination which can be identified by its `combination_id`. Within the row, a boolean value for each column indicates if the column is missing (`True`) or not (`False`) in this particular combination.

In [None]:
combinations = session.missingness().combinations()
combinations.head()

## Making a programmatic selection

In our example, we are interested in unexpected missingness behaviour in the diagnosis fields DIAG_02 to DIAG_10. Hence, we look at the subset of the combinations dataframe that contains only these fields.

In [None]:
combinations_subset = combinations[['DIAG_01','DIAG_02','DIAG_03','DIAG_04','DIAG_05','DIAG_06','DIAG_07','DIAG_08','DIAG_09','DIAG_10']]
# plt.matshow(combinations_subset)
combinations_subset.head()

As we learned in the previous example, it is expected in this dataset that if any diagnosis column from DIAG_02 to DIAG_10 is missing then all of the subsequent diagnosis columns should also be missing. Unexpected missingness combinations have gaps in the diagnosis columns.

The next cell contains a function that takes a row of the combinations dataframe as input and returns `True` if there is a gap in this combination. A gap is identified when the values in the row are not monotonic increasing.

In [None]:
def has_gap(row: pd.Series) -> bool:
    """Does 'row' have a 'gap' (is it non-monotonic)?"""

    return not row.is_monotonic_increasing

## equivalent to

# def has_gap(row: pd.Series) -> bool:
#    found_missing = False
#    for m in row:
#        found_missing |= m
#        if found_missing and not m:
#            return True
#    return False

The function `has_gap()` is then applied to each row of the combinations subset dataframe, which gives us an array that contains the `combination_id`of all combinations with gaps.

In [None]:
combinations_with_gaps = np.where(combinations_subset.apply(has_gap, axis=1))[0]
combinations_with_gaps

With the function `add_selection()` we can then add a selection to the PlotSession object based on the identified combinations. Note that we need to give the selection a name ("gaps" in our example).

In [None]:
session.add_selection("gaps", combinations=combinations_with_gaps)

Even though we made a selection based on missingness combinations we can retrieve the corresponding record indices of the original dataframe with the `select_records()` function. The function requires the name of the selection.

In [None]:
gaps_records = session.selected_records("gaps")

We can visualise our selection with the `add_plot()` function. The first input argument is the name for the new plot. The second input argument is the name of the selection on which we base the plots.

In [None]:
session.add_plot("gaps_plot", "gaps")

## Explaining unexpected missing combinations – Data mining

The notebook `Information Gain Ratio.ipynb` contains helper functions to calculate the information gain ratio (IGR), which we will use for our example dataset.

In [None]:
%run 'Information Gain Ratio.ipynb'

`igr()` allows us to rank multiple columns (given by `igr_columns`) based on the correlation of their values with records that either are or
are not members of selected missing combinations (`gaps`).

In [None]:
igr_columns = ['ADMIAGE','ADMIMETH','Mortality','PROCODE3','SEX']
igr_result = igr(df[igr_columns], gaps_records)
igr_result

In [None]:
plt.bar(igr_result.keys(), igr_result.values())
plt.ylabel("IGR")
plt.show()