# Predictive Process Monitoring - Data Loading \& Exploration

This notebook is part of the [starter package](https://github.com/fmannhardt/starter-predictive-process-monitoring) for predictive process monitoring. It contains examples for data loading from event logs for the purpose of developing and applying predictive process monitoring techniques.

## Setup
The following Python libraries are used, please refer to the installation instructions to prepare your environment:

* [PM4Py](https://pm4py.fit.fraunhofer.de/)
* [Pandas](https://pandas.pydata.org/)

In [None]:
import pandas as pd
import pm4py

## Event Log 

We are using a publicly available event log called Sepsis Cases as an example.

### CSV File
Event logs can be loaded with Pandas from CSV files and are then converted to an event log by PM4Py by specifying columns for the three main requirements of an event log: case identifier, activity identifier, and timestamps. 

In [None]:
sepsis = pd.read_csv("../data/sepsis.csv", sep=';')
sepsis_log = pm4py.format_dataframe(sepsis, case_id='case_id', activity_key='activity', timestamp_key='timestamp')
sepsis_log = pm4py.convert_to_event_log(sepsis_log)

Ignore the Pandas warning which is due to some internal issue in PM4Py and let us check if the import succeeded:

In [None]:
len(sepsis_log)

### XES File

Alternatively, PM4PY can also load the event log directly from a file in the standardized XES file format for event log.

In [None]:
from urllib.request import urlretrieve
import os

# download from 4tu.nl
urlretrieve('https://data.4tu.nl/file/33632f3c-5c48-40cf-8d8f-2db57f5a6ce7/643dccf2-985a-459e-835c-a82bce1c0339', 'sepsis0.xes.gz')
sepsis_log = pm4py.read_xes('sepsis0.xes.gz')
os.unlink('sepsis0.xes.gz') # clean up

In [None]:
len(sepsis_log)

## Data Exploration

It is a good idea to perform some data exploration to investigate the properties of the event log. Please note that there are much more capabilities in ProM. So, do not limit yourself to using PM4Py.

In [None]:
# number of distinct trace variants
len(pm4py.get_variants_as_tuples(sepsis_log))

In [None]:
# how does the process start
pm4py.get_start_activities(sepsis_log)

In [None]:
# how does the process end
pm4py.get_end_activities(sepsis_log)

In [None]:
from pm4py.objects.log.util.log import project_traces
def print_nth(log, index):
    print(str(project_traces(sepsis_log)[index]))

In [None]:
print_nth(sepsis_log, 0)

In [None]:
print_nth(sepsis_log, 1)

Let us look at the directly-follows graph of the whole event log:

In [None]:
dfg, start_activities, end_activities = pm4py.discover_dfg(sepsis_log)
pm4py.view_dfg(dfg, start_activities, end_activities)

A quite messy process. There are countless filtering options in PM4Py that allow you to focus on a subset of the data. For example, what happens if we remove the very frequently repeating activities: Leucocytes, CRP, LacticAcid.

In [None]:
sepsis_log_filtered = pm4py.filter_event_attribute_values(sepsis_log, 
                                    attribute_key = 'concept:name', # special column for the activity name always added by PM4Py
                                    values = ['LacticAcid', 'CRP', 'Leucocytes'], 
                                    level = 'event',    # we want to keep all traces and modify events
                                    retain = False)     # remove matching events

In [None]:
dfg, start_activities, end_activities = pm4py.discover_dfg(sepsis_log_filtered)
pm4py.view_dfg(dfg, start_activities, end_activities)

A bit more comprehensible but more filtering or pre-processing could be performed, explore the PM4Py documentation on what is possible:

https://pm4py.fit.fraunhofer.de/documentation#filtering

You can also look at the trace or event attributes in an event log to generate suitable subsets.

In [None]:
pm4py.get_event_attributes(sepsis_log)

The set of values of a specific event attribute can be extracted. A 'nan' indicates that there are many events without such attribute:

In [None]:
pm4py.get_event_attribute_values(sepsis_log, 'Diagnose')