# PACE Tutorial 2: Analysing Missingness in Synthetic APC Data – Programmatic selection

## Introduction

This tutorial notebook builds on (`Tutorial Example.ipynb`), and takes another approach to analysing the data missingness from the same synthetic APC dataset.

In the previous tutorial, we used an interactive "Combination heatmap" visualization to select unexpected missingness combinations to analyse.

In this tutorial, instead of selecting these combinations interactively, we will see how the same combinations can be selected programmatically with PACE.

## Preamble: PACE and other libraries

In [None]:
from pace.missingness import *
from pace.history import *
from pace.plots import *

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

## Load the data and create a PlotSession object

As before, we read the data as a Pandas dataframe, and create a PACE `PlotSession` object from it.

In [None]:
df = pd.read_csv("../../data/Synthetic_APC_DIAG_Fields.csv")

session = PlotSession(df)

From the `PlotSession` object, we can extract all distinct missingness combinations into a dataframe.

One row of this dataframe represents one unique combination which can be identified by its `combination_id`.

Within a row, a boolean value in a column indicates if the column is missing (`True`) or not (`False`) in this particular combination.

In [None]:
combinations = session.missingness().combinations()
combinations

> **Note**
>
> `PlotSession.missingness()` returns an object containing the missingness data for a particular selection.
>
> Passing no argument to `missingness()`, as above, returns the missingness data for the entire dataframe (`df`).  An optional `name` argument would allow us to visualise the combinations present in a particular named selection (for example, made interactively from a plot).  See the previous tutorial for more on named selections.
>
> The call to `missingness()` returns a `Missingness` object, which supports a number of methods for querying the missingness properties of the data, including `combinations()` that we called above.  See the PACE documentation for more details.

We see that there are sixteen distinct missingness patterns in the data, including combination_id `0` with no missing fields.  We visualized these in the last tutorial (along with their count) with the Combination heatmap plot.

## Making a programmatic selection

In our example, we are interested in unexpected missing values in the diagnosis fields `DIAG_01` to `DIAG_10`. To make the job of identifying these more straightforward, we will work with a subset of the combinations dataframe containing only these fields.

In [None]:
combinations_subset = combinations[['DIAG_01','DIAG_02','DIAG_03','DIAG_04','DIAG_05','DIAG_06','DIAG_07','DIAG_08','DIAG_09','DIAG_10']]

combinations_subset.head()

As we learned in the previous example, it is expected in this dataset that if any diagnosis column from `DIAG_02` to `DIAG_10` is missing, then all of the subsequent diagnosis columns should also be missing. Unexpected missingness combinations are those with such 'gaps' in the diagnosis columns.

The next cell contains a function that takes a row of the combinations dataframe as input and returns `True` if there is a gap in this combination. A gap is identified when the values in the row are not monotonically increasing (interpreting `True` as 1 and `False` as 0).

Conveniently, a pandas Series has an `is_monotonic_increasing` property that we can use to determine if this is the case.

In [None]:
def has_gap(row: pd.Series) -> bool:
    """Does 'row' have a 'gap' (is it non-monotonic)?"""

    return not row.is_monotonic_increasing

## equivalent to

# def has_gap(row: pd.Series) -> bool:
#    found_missing = False
#    for m in row:
#        found_missing |= m
#        if found_missing and not m:
#            return True
#    return False

We next apply `has_gap()` to each row of the combinations subset dataframe, to give an array that contains the `combination_id`of all combinations with gaps.

In [None]:
combinations_with_gaps = np.where(combinations_subset.apply(has_gap, axis=1))[0]
combinations_with_gaps

With the function `PlotSession.add_selection()` we can then add a selection to the PlotSession object based on the identified combinations. 

Note that we need to give the selection a name (`"gaps"` in our example).  This is similar to the `add_plot` function that we encountered last time, and allows us to refer back to the selection.

In [None]:
session.add_selection(name="gaps", combinations=combinations_with_gaps)

Even though we made a selection based on missingness combinations we can retrieve the corresponding record indices of the original dataframe with the `select_records()` function. The function takes the name of the selection as an argument and returns a boolean series.

In [None]:
gaps_records = session.selected_records(name="gaps")

We can visualise our selection with the `add_plot()` function.  We call it below with two arguments:
  - The argument `name` is the name for the new plot (and can be used to refer to any more refined selection that we make interactively within it).
  - The argument `based_on` is the name of the selection on which we take the data to plot.  Notice that the plot below shows only the combinations that we selected in "gaps".

In [None]:
session.add_plot(name="gaps_plot", based_on="gaps")

This Value bar chart highlights a very different pattern of missingness compared with the same visualization of all of the data (see the first plot in the previous tutorial notebook).

From this plot, the number of missing values in `DIAG_03` immediately indicates a problem, because it is missing more often than the subsequent diagnosis fields.

## Explaining unexpected missing combinations – Data mining

Now that we have selected the records with gaps, we continue as in the previous tutorial notebook.

The notebook `Information Gain Ratio.ipynb` contains helper functions to calculate the information gain ratio (IGR), which we will use with our example dataset.

In [None]:
%run 'Information Gain Ratio.ipynb'

`igr()` allows us to rank multiple columns (in our case, those we name in `igr_columns` below) based on the correlation of their values with records that either are or
are not members of selected missing combinations (`gaps`).

In [None]:
igr_columns = ['ADMIAGE','ADMIMETH','Mortality','PROCODE3','SEX']
igr_result = igr(df[igr_columns], gaps_records)
igr_result

In [None]:
igr_result_sorted = dict(sorted(igr_result.items(), key=lambda kv: kv[1], reverse=True))

plt.bar(igr_result_sorted.keys(), igr_result_sorted.values())
plt.xlabel("Column")
plt.ylabel("Information Gain Ratio (IGR)")
plt.show()