# PACE Tutorial 1: Analysing Missingness in Synthetic APC Data – Interactive selection

## Introduction

This tutorial example focuses on the high-level functionality of PACE and uses a synthetic dataset that mirrors the missingness patterns that were found in an extract of an Admitted Patient Care (APC) dataset from Hospital Episode Statistics (HES).

**Objectives for this tutorial:**
  - Get a basic overview of PACE functionality for exploring missingness in data
  - Introduction to the interactive workflow and `PlotSession`
  - Explain unexpected patterns of missing data by using data mining techniques

<div class="alert alert-success"><b>Note: </b>To produce the expected results, this tutorial notebook requires you to interact with it in a few places.  When this is necessary, a box like this will let you know.

Given the partly interactive nature, this notebook is intended to be executed **cell by cell**. If you run the whole notbook in one go, you won't get the expected results.
</div>

<div class="alert alert-info"><b>Solution: </b>After an interactive part, you will find a box like this before the cells that contain hints or solutions.</div>

## Preamble

### Includes: PACE and other libraries

In [None]:
from pace.membership import *
from pace.history import *
from pace.plots import *

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Loading the data

Read the data into a Pandas dataframe, which can be read by PACE to explore the missingness of the data.

In [None]:
df = pd.read_csv("../examples/datasets/Synthetic_APC_DIAG_Fields.csv")

In [None]:
df.tail()

## PACE Plotting session

The first step of use PACE is to create a `PlotSession` object.  

`PlotSession` is the core PACE class that provides the functionality to analyse and explore the missingness patterns found in a dataset. This can be done:
  - **programatically**, via several methods of the class;
  - with several **interactive plots** designed for use within a Jupyter notebook;
  - using a combination of these modes.

As this tutorial example will show, a `PlotSession` can be used to slice and select data and to create interactive plots. Moreover, the current session (including the interactive selections and active plots) can be saved and re-loaded, so that these do not have to be re-made when the notebook is restarted or shared.

A `PlotSession` can be constructed from a Pandas dataframe, as shown in the cell below:

In [None]:
session = PlotSession(df)

To visualise the dataset, call `PlotSession.add_plot`, providing a name.

Naming the plot is important: It allows any interactive selection made in the plot to be referred to later.

The result is a [Bokeh](https://docs.bokeh.org/en/latest/index.html) widget with a number of tabs, each with a different visualisation of the data missingness.

The tab that is selected when a plot is first added is the **Value bar chart**. It displays the number of missing values (y-axis) in each column (x-axis). Like any other visualization in PACE, it supports interactive selections via the tap tool (left-click or `Shift` + left-click) and via box selection (click, drag and release).
In case of our APC dataset, this chart shows us that, as expected, the primary diagnosis field (`DIAG_01`) and the five categorical fields (admission age (`ADMIAGE`), admission method (`ADMIMETH`), mortality status (`MORTALITY`), health care provider (`PROCODE3`) and patient's sex (`SEX`)) are never missing. Further, the secondary diagnosis fields are missing progressively more often from `DIAG_02` to `DIAG_10`.

In [None]:
session.add_plot(name="all data")

## Making a selection

To investigate missingness, we have a closer look at the **Combination heatmap**. First we make a new plot and call it "combinations" and switch to the heatmap tab. 
The Combination heatmap displays a matrix of fields on the x-axis and the missing combinations on the y-axis. The number of records that are associated with each missing combination is encoded using a sequential colour
map (darker color indicates that a combination is missing more often).

It is expected in this dataset that if any diagnosis field from `DIAG_02` to `DIAG_10` is missing then all of
the subsequent diagnosis fields should also be missing. The Combination Heatmap above highlights that
this is indeed true for most of the missing records. However, there are also 7 unexpected missing combinations, which have gaps in the diagnosis fields.

<div class="alert alert-success"><b>Try it</b>: 
<ul>
  <li> Run the code cell below and switch to the 'Combination heatmap' tab.
  <li> Use one of the interactive tools to select all rows (combinations) that have unexpected gaps.</li>
</ul>

</div>

In [None]:
session.add_plot(name="combinations")

<div class="alert alert-block alert-info"> <b>Solution:</b> If you made the suggested selection, the plot will look like the solution below. </div>

The following PlotSession is pre-populated with the suggested selection of combinations for the "combination" heatmap plot.

In [None]:
# load solution
session_solution = PlotSession(df, session_file="tutorial_example_session_solution.json")
# plot solution
session_solution.add_plot(name="combinations")

We can retrieve the records present in the selection with `PlotSession.selected_records()`, passing the name of the selection.  Recall that our plot was named "combinations".

Notice that:
  - Even though we selected particular *missingness combinations* in the plot, `selected_records()` returns the indices of *records* in the dataframe that are present in the selection.  The indices returned refer to the original dataframe (`df` in our case).
  - The function takes a name of a selection as its argument: in this case it is the name of the plot where the selection was made.

In [None]:
gaps = session.selected_records("combinations")

`selected_records()` returns a boolean Pandas series which is `True` when the record is included in the selection and `False` otherwise:

In [None]:
gaps

<div class="alert alert-block alert-info"> <b>Solution:</b> Check if you made the right selection by running the next code cell. </div>

In [None]:
solution_gaps = session_solution.selected_records("combinations")

if (solution_gaps == gaps).all():
    print(f"Correct selection. {sum(gaps)} of {len(gaps)} records are included in the selection.")
else:
    print("Selection not quite right. Try it again.")

We can extract the full records that correspond to the selection from the original Pandas dataframe straightforwardly, if required:

In [None]:
df[gaps]

## Explaining unexpected missing combinations – Data mining

This section contains an example of how PACE can be combined with data mining methods to gain further insight into the missing data. 
The notebook `Information Gain Ratio.ipynb` contains helper functions to calculate the information gain ratio (IGR), which we will use for our example dataset. The notebook also includes a brief description of IGR. These helper functions are not part of the core PACE functionality, and illustrate how PACE can work alongside other libraries. 

In [None]:
%run "Information Gain Ratio.ipynb"

We use the function `igr()` loaded from `Information Gain Ratio.ipynb`. 

In [None]:
help(igr)

We want to investigate how homogeneous selected fields are, in terms of the presence (or absence) of the records showing unexpected gaps (patterns of missingness). We will compute the IGR for each column/field specified in `igr_columns` using the function `igr()`. `gaps` represents our target.

In [None]:
igr_columns = ['ADMIAGE','ADMIMETH','Mortality','PROCODE3','SEX']
igr_result = igr(df[igr_columns], gaps)

In [None]:
igr_result

<div class="alert alert-block alert-info"> <b>Solution:</b> If you made the suggested selection from the combination heatmap, you should have obtained the following values for the IGR of each column. </div>

In [None]:
# solution records for all gaps
igr_result_solution = igr(df[igr_columns], solution_gaps)
igr_result_solution

The values obtained using `igr()` allow us to rank multiple columns (given by `igr_columns`) based on the correlation of their values with records that either are or are not members of selected missing combinations (`gaps`).

In [None]:
igr_result_sorted = dict(sorted(igr_result.items(), key=lambda kv: kv[1], reverse=True))

plt.bar(igr_result_sorted.keys(), igr_result_sorted.values())
plt.xlabel("Column")
plt.ylabel("Information Gain Ratio (IGR)")
plt.show()

## Identifying the cause of the missing data for a particular combination

Next, we want to learn more about the cause of the unexpected missingness. For this, we visualise only the data records with unexpected gaps that we identified earlier.

Note how the plots we obtain in doing so highlight a very different pattern of missingness compared to the visualisation of the full missing data.

We select the most common missing combination from the heatmap, retrieve the corresponding records and apply data mining methods on them.

<div class="alert alert-success"><b>Try it:</b><br />
Run the code cell below and in the plot that pops up do the following:
<ul>
  <li> Select the 'Combination heatmap' tab
  <li> Use one of the interactive tools to select the most common combination (the row with the darkest coloured boxes in the heatmap)</li>
</ul>
</div>

In [None]:
session.add_plot(name="gaps", based_on="combinations")

<div class="alert alert-block alert-info"> <b>Solution:</b> If you selected the suggested most common combination, your plot should match the one below. </div>

In [None]:
# solution plot
session_solution.add_plot(name="gaps", based_on="combinations")

Having selected the most common combination with unexpected missingness pattern, we can recover the corresponding records.

In [None]:
gaps_most_freq = session.selected_records("gaps", base_selection="combinations")

Next, we compute the entropy of each admission method, i.e. each value in the field `ADMIMETH`, for the selected missing combination `gaps_most_freq`, using the provided function `entropy_table()`.

In [None]:
def entropy_table(feature: pd.Series, target: pd.Series) -> pd.DataFrame:
    """
    Calculate the entropy for a given feature.
    
    Parameters
    ----------
    feature: pd.Series
        The features for which the entropy will be calculated
    target: pd.Series
        The target for which the entropy given the feature will be calculated
    Returns
    -------
    pd.DataFrame
        Returns number of records with a particular feature value (count), how many of these records 
        are in the target class (sum) and the entropy.
    """
    df_target = pd.DataFrame({
        feature.name: feature,
        "_target": target,
    })
    df_split = (
        df_target
        .groupby(feature.name)
        .agg({"_target": ["sum", "count"]})
    )

    df_split[("_target", "p")] = (
        df_split[("_target", "sum")] / df_split[("_target", "count")]
    )

    p = df_split[("_target", "p")]
    df_split[("_target", "entropy")] = (p * np.log(1/p)).fillna(0.0)

    df_split.columns = df_split.columns.get_level_values(1)
        
    return df_split[["sum", "count", "entropy"]]

In each row of the entropy table, `count` contains the number of records have this particular feature value (here admission method). The field `sum` indicates how many of these records are in the target class, i.e. in our case have the most common combination with unexpected missingness pattern.
A closer look at the column `sum` reveals that all 6 records with the selected unexpected missing combination have the same admission method.

In [None]:
entropy_table(df.loc[gaps, "ADMIMETH"], gaps_most_freq)

<div class="alert alert-block alert-info"> <b>Solution:</b> Your entropy table given the feature ADMIMETH should look like this. </div>

In [None]:
# solution records for most common gap
gaps_most_freq_solution = session_solution.selected_records("gaps", base_selection="combinations")
# solution entropy table 1
entropy_table(df.loc[solution_gaps, "ADMIMETH"], gaps_most_freq_solution)

We repeat this step with for the `PROCODE3` field.

We learn that all records with the selected unexpected missing combination not only share the same admission method, but were also submitted by a single provider.

In [None]:
entropy_table(df.loc[gaps, "PROCODE3"], gaps_most_freq)

<div class="alert alert-block alert-info"> <b>Solution:</b> Your entropy table given the feature PROCODE3 should look like this. </div>

In [None]:
# solution entropy table 2
entropy_table(df.loc[solution_gaps, "PROCODE3"], gaps_most_freq_solution)

## Saving the session

If you were to re-run the notebook at this point, the interactive selection made in the Combination Heatmap above ("combinations") would be lost.  To avoid this, PACE can save any user-made selections to a file, and load it to restore the state of the session.

The following cell will write the current selections in every plot in the session to the indicated json file.

This file can be re-loaded into a PlotSession object, or shared with others along with the notebook and any data it uses.

In [None]:
session.save("tutorial_example_session.json")

<div class="alert alert-success"><b>Try it:</b><br />
Replace the code in input cell 3 of this notebook (the cell named '<tt>In [3]</tt>' if it has been run from the start) with
    
    session = PlotSession(df, session_file="tutorial_example_session.json")
    
then restart the notebook kernel and re-run it from the beginning.  The interactive selections you made within the plot will be restored.
</div>