In [None]:
from pace.missingness import *
from pace.history import *
from pace.plots import *
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

Read the data into a Pandas dataframe to explore the data with PACE.

In [None]:
df = pd.read_csv("../../data/Synthetic_APC_DIAG_Fields.csv")

## Plot session

The first step to use PACE is to create a `PlotSession` object. `PlotSession` is the core PACE class that provides the functionality for the user to analyse the missingness of the dataset or create plots. `PlotSession()` takes the Pandas dataframe as input and computes the missingness of the provided data. As the example in this tutorial will show, the `PlotSession` object is used to slice/select data and to create interactive plots. Moreover, the current session (including selections and active plots) can be saved and re-loaded for later use. 



In [None]:
session = PlotSession(df)

To visualise the dataset, use `add_plot()` and provide a name. Each tab shows a plot visualising different aspects of the data missingness. All plots allow the user to make interactive selections. 

The first tab that shows when a plot is added is the **Value bar chart**. It displays the number of missing values (y-axis) in each field (x-axis). Like any other visualization in PACE, it supports selection via the tap tool (left-click or `Shift` + left-click) and via box selection (hold left-click, drag and then release left-click).

In [None]:
session.add_plot(name="all data")

This chart shows us that, as expected, the primary diagnosis field (DIAG_01) and the 5 categorical fields (ADMIAGE, ADMIMETH, MORTALITY, PROCODE3 and SEX) are never missing. Further, the secondary diagnosis fields are missing progressively more often from DIAG_02 to DIAG_10.

## Making a selection

To investigate missingness we make a new plot and call it "combinations". 
The "Combination heatmap" displays a matrix of fields on the x-axis and missing combinations on the y-axis. The number of records that are associated with each missing combination is encoded using a sequential colour
map (darker color indicates that a combination is missing more often).

**Run the code cell below and switch to the "Combination heatmap" tab.**

In [None]:
session.add_plot(name="combinations")

It is expected in this dataset that if any diagnosis field from DIAG_02 to DIAG_10 is missing then all of
the subsequent diagnosis fields should also be missing. The “combination heatmap” highlights that
this is indeed true for most of the missing records. However, there are also 7 unexpected missing combinations, which have gaps in the diagnosis fields.

**In the "Combination heatmap" use one of the interactive tools to select all combinations that have unexpected gaps.**

If you made the right selection, the "Combination heatmap" should look like the one in the **Solution** section.

Even though we selected combinations in the "Combination heatmap" we can retrieve the corresponding record indices of the original dataframe with the `select_records()` function. The function requires the name of the plot in which the selection was made.

In [None]:
gaps = session.selected_records("combinations")

`selected_records()` returns a boolean Pandas series which is `True` when the record is included in the selection and `False` otherwise.

In [None]:
gaps

In [None]:
print(f"{sum(gaps)} of {len(gaps)} selected records")

## Explaining unexpected missing combinations – Data mining

This section contains an example of how PACE can be combined with data mining methods to gain further insight into the missing data. 
The notebook `Information Gain Ratio.ipynb` contains helper functions to calculate the information gain ratio (IGR), which we will use for our example dataset. These helper functions are not part of the core PACE functionality and can be replaced by other data mining methods.

In [None]:
%run 'Information Gain Ratio.ipynb'

We use the function `igr()` loaded from `Information Gain Ratio.ipynb`.

In [None]:
help(igr)

`igr()` allows us to rank multiple columns (given by `igr_columns`) based on the correlation of their values with records that either are or
are not members of selected missing combinations (`gaps`).

In [None]:
igr_columns = ['ADMIAGE','ADMIMETH','Mortality','PROCODE3','SEX']
igr_result = igr(df[igr_columns], gaps)

Assuming the correct combinations have been selected from the "Combination heatmap", `igr_result` contains the following values for the IGR:

In [None]:
igr_result

In [None]:
plt.bar(igr_result.keys(), igr_result.values())
plt.ylabel("IGR")
plt.show()

## Save session

If you re-run the notebook, the interactive selection made in the "Combination heatmap" will be lost. But it is possible to save and re-load the session.

The following cell will write the current session (interactive plot selection) to a json file.

In [None]:
session.save("tutorial_example_session.json")

This file can be re-loaded into a PlotSession object. 

**Try to replace the code in cell 3 with** 

`session = PlotSession(df, session_file="tutorial_example_session.json")`

## Solution

In [None]:
solution_session = PlotSession(df, session_file="tutorial_example_session_solution.json")

In [None]:
solution_session.add_plot(name="combinations")

In [None]:
solution_gaps = solution_session.selected_records("combinations")
igr_columns = ['ADMIAGE','ADMIMETH','Mortality','PROCODE3','SEX']

In [None]:
igr_result = igr(df[igr_columns], solution_gaps)
igr_result