# PACE Tutorial 1: Analysing Missingness in Synthetic APC Data – Interactive selection

## Introduction

This tutorial example focuses on the high-level functionality of PACE and uses a synthetic dataset that mirrors the missingness patterns that were found in an extract of an Admitted Patient Care (APC) dataset from Hospital Episode Statistics (HES).

<div class="alert alert-success"><b>Note: </b>To produce the expected results, this tutorial notebook requires you to interact with it in a few places.  When this is necessary, a box like this will let you know.</div>

## Preamble

### Includes: PACE and other libraries

In [None]:
from pace.missingness import *
from pace.history import *
from pace.plots import *

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Loading the data

Read the data into a Pandas dataframe, which can be read by PACE to explore the missingness of the data.

In [None]:
df = pd.read_csv("../../data/Synthetic_APC_DIAG_Fields.csv")

In [None]:
df.tail()

## PACE Plotting session

The first step of use PACE is to create a `PlotSession` object.  A `PlotSession` can be constructed from a Pandas dataframe, as shown in the cell below:

In [None]:
session = PlotSession(df)

`PlotSession` is the core PACE class that provides the functionality to analyse and explore the missingness patterns found in a dataset. This can be done:
  - **programatically**, via several methods of the class;
  - with several **interactive plots** designed for use within a Jupyter notebook;
  - using a combination of these modes.

As this tutorial example will show, a `PlotSession` can be used to slice and select data and to create interactive plots. Moreover, the current session (including the interactive selections and active plots) can be saved and re-loaded, so that these do not have to be re-made when the notebook is restarted or shared.

To visualise the dataset, call `PlotSession.add_plot`, providing a name.

Naming the plot is important: It allows any interactive selection made in the plot to be referred to later.

In [None]:
session.add_plot(name="all data")

The result is a [Bokeh](https://docs.bokeh.org/en/latest/index.html) widget with a number of tabs, each with a different visualisation of the data missingness. All plots allow the user to make interactive selections. 

The tab that is selected when a plot is first added is the **Value bar chart**. It displays the number of missing values (y-axis) in each column (x-axis). Like any other visualization in PACE, it supports selection via the tap tool (left-click or `Shift` + left-click) and via box selection (click, drag and release).

This chart shows us that, as expected, the primary diagnosis field (`DIAG_01`) and the five categorical fields (`ADMIAGE`, `ADMIMETH`, `MORTALITY`, `PROCODE3` and `SEX`) are never missing. Further, the secondary diagnosis fields are missing progressively more often from `DIAG_02` to `DIAG_10`.

## Making a selection

To investigate missingness we make a new plot and call it "combinations". 
The *Combination heatmap* displays a matrix of fields on the x-axis and missing combinations on the y-axis. The number of records that are associated with each missing combination is encoded using a sequential colour
map (darker color indicates that a combination is missing more often).

<div class="alert alert-success"><b>Try it</b>: Run the code cell below and switch to the 'Combination heatmap' tab.</div>

In [None]:
session.add_plot(name="combinations")

It is expected in this dataset that if any diagnosis field from DIAG_02 to DIAG_10 is missing then all of
the subsequent diagnosis fields should also be missing. The Combination Heatmap above highlights that
this is indeed true for most of the missing records. However, there are also 7 unexpected missing combinations, which have gaps in the diagnosis fields.

<div class="alert alert-success"><b>Try it:</b> In the Combination heatmap plot above, use one of the interactive tools to select all rows (combinations) that have unexpected gaps.</div>

> <b>Hint</b>: If you made the suggested selection, the plot will look like the one in the Solution section at the end of this notebook.

We can retrieve the records present in the selection with `PlotSession.selected_records()`, passing the name of the selection.  Recall that our plot was named "combinations".

Notice that:
  - Even though we selected particular *missingness combinations* in the plot, `selected_records()` returns the indices of *records* in the dataframe that are present in the selection.  The indices returned refer to the original dataframe (`df` in our case).
  - The function takes a name of a selection as its argument: in this case it is the name of the plot where the selection was made.

In [None]:
gaps = session.selected_records("combinations")

`selected_records()` returns a boolean Pandas series which is `True` when the record is included in the selection and `False` otherwise:

In [None]:
gaps

In [None]:
print(f"{sum(gaps)} of {len(gaps)} records are included in the selection.")

We can extract the full records from the original Pandas dataframe straightforwardly, if required:

In [None]:
df[gaps]

## Explaining unexpected missing combinations – Data mining

This section contains an example of how PACE can be combined with data mining methods to gain further insight into the missing data. 
The notebook `Information Gain Ratio.ipynb` contains helper functions to calculate the information gain ratio (IGR), which we will use for our example dataset. These helper functions are not part of the core PACE functionality, and illustrate how PACE can work alongside other libraries.

In [None]:
%run 'Information Gain Ratio.ipynb'

We use the function `igr()` loaded from `Information Gain Ratio.ipynb`.

In [None]:
help(igr)

`igr()` allows us to rank multiple columns (given by `igr_columns`) based on the correlation of their values with records that either are or
are not members of selected missing combinations (`gaps`).

In [None]:
igr_columns = ['ADMIAGE','ADMIMETH','Mortality','PROCODE3','SEX']
igr_result = igr(df[igr_columns], gaps)

If the suggested combinations have been selected from the combination heatmap, then `igr_result` should contain the following values for the IGR of each column:

 | Column      | IGR    |
 |-------------|--------|
 | `ADMIAGE`   | 0.0852 |
 | `ADMIMETH`  | 0.0508 |
 | `Mortality` | 0.0352 |
 | `PROCODE3`  | 0.0999 |
 | `SEX`       | 0.0330 |

In [None]:
igr_result

In [None]:
igr_result_sorted = dict(sorted(igr_result.items(), key=lambda kv: kv[1], reverse=True))

plt.bar(igr_result_sorted.keys(), igr_result_sorted.values())
plt.xlabel("Column")
plt.ylabel("Information Gain Ratio (IGR)")
plt.show()

## Saving the session

If you were to re-run the notebook at this point, the interactive selection made in the Combination Heatmap above ("combinations") would be lost.  To avoid this, PACE can save any user-made selections to a file, and load it to restore the state of the session.

The following cell will write the current selections in every plot in the session to the indicated json file.

In [None]:
session.save("tutorial_example_session.json")

This file can be re-loaded into a PlotSession object, or shared with others along with the notebook and any data it uses.

<div class="alert alert-success">
Try replacing the code in input cell 3 of this notebook (the cell named '<tt>In [3]</tt>' if it has been run from the start) with
    
    session = PlotSession(df, session_file="tutorial_example_session.json")
    
then restarting the notebook kernel and re-running it from the beginning.  The interactive selections you made within the plot will be restored.
</div>

## Solution

The following PlotSession is pre-populated with the suggested selection of combinations for the "combination" heatmap plot.

In [None]:
solution_session = PlotSession(df, session_file="tutorial_example_session_solution.json")

In [None]:
solution_session.add_plot(name="combinations")

In [None]:
solution_gaps = solution_session.selected_records("combinations")
igr_columns = ['ADMIAGE','ADMIMETH','Mortality','PROCODE3','SEX']

In [None]:
igr_result = igr(df[igr_columns], solution_gaps)
igr_result