This notebook is part of a course on Applied Process Mining. The collection of notebooks is a *living document* and subject to change. 

# Lecture 1 - 'Event Logs and Process Discovery' (Python / PM4Py)

## Setup

<img src="https://pm4py.fit.fraunhofer.de/static/assets/images/pm4py-site-logo-padded.png" alt="PM4Py" style="width: 200px;"/>

In this notebook, we are using several libraries:

* [PM4Py](https://pm4py.fit.fraunhofer.de/)
* [pandas](https://pandas.pydata.org/)
* [plotnine](https://plotnine.readthedocs.io/en/stable/)

Often used dependencies are imported:

In [None]:
import pandas as pd
import pm4py
import plotnine
from plotnine import ggplot, geom_point, aes, theme_bw, coord_flip, scale_y_discrete, theme, element_text, geom_bin2d, ylab, scale_x_datetime

## Event Logs

This part introduces event logs and their unique properties that provide the basis for any Process Mining method. We use the same event logs as provided by `bupaR`. However, we need to load them from the CSV files in the `data` directory of this repository. In this lecture we are going to make use of the following datasets:

* Patients, a synthetically generated example event log in a hospital setting.
* Sepsis, a real-life event log taken from a Dutch hospital. The event log is publicly available here: https://doi.org/10.4121/uuid:915d2bfb-7e84-49ad-a286-dc35f063a460 and has been used in many Process Mining related publications.

### Import Patients Data

In [None]:
patients = pd.read_csv("../data/patients.csv", sep=';')
patients['time'] = pd.to_datetime(patients['time'])
num_rows = len(patients)
print("Number of rows: {}".format(num_rows))

### Import Sepsis Data

In [None]:
sepsis = pd.read_csv("../data/sepsis.csv", sep=';')
num_rows = len(sepsis)
sepsis['timestamp'] = pd.to_datetime(sepsis['timestamp'])
print("Number of rows: {}".format(num_rows))

### Exploring Event Data

Let us first explore the event data without any prior knowledge about event log structure or properties. We use standard Pandas and plotnine features to do so. Regarding the choice for plotnine, any other plotting library such as Matplotlib could also be used and it is simply used to deviate as little as possible from the exploration performed with ggplot2 in the R version of this lecture. 

In [None]:
patients.head()

The most important ingredient of an event log is the timestamps column `time`. This allows us to establish a sequence of events.

In [None]:
patients_sample = patients[patients['time'] < '2017-01-31']
(ggplot(patients_sample, aes('time', 0))
 + geom_point() 
 + theme_bw()
 + ylab("Event")
 + scale_x_datetime(date_breaks = "1 days"))

We also need to have information on the kind of actions or `activities` performed:

In [None]:
patients.drop_duplicates(subset='handling')[["handling"]]

Let us have a look at what other data is available:

In [None]:
patients.drop_duplicates(subset='patient')[["patient"]].head()

Maybe the patient identifier could be a good candidate for defining a process `case` since this is an 'entity' that we would like to follow. When counting the events that occurred per individual patient it seems that there is a similar number of events for each patient, which is generally a good indicator for a process case identifier:

In [None]:
patients.groupby(['patient'])["patient"].agg(['count']).head()

Let use decide that we want to look at the process by following the patient identifier as `case identifier`:

In [None]:
patients_sample = patients[patients['time'] < '2017-01-31']
(ggplot(patients_sample, aes('time', 'patient', color = 'handling'))
 + geom_point() 
 + theme_bw()
 + scale_x_datetime(date_breaks = "7 days"))

The scatterplot above is known as `Dotted Chart` in the process mining community and provides an 'at a glance' overview on the events and their temporal relation when grouped by a case. It seems that each of the sequences of events (also known as `traces`) start with the `Registration` event. Let us have a look at the event data sorted by patient identifier and by time:

In [None]:
patients.sort_values(['patient', 'time']).head(14)

An individual process execution (e.g., for patient 1) consists of several activities that are done in a sequence. However, we have more information available than simply the sequence of events. For each occurrence of an activity we have two events: a `start` event and a `complete` event as captured in the column `registration_type`. These event refer to the lifecycle of an activity and allow us to capture the `duration` of an activity. Much more complex lifecycles of activities are possible, a general model is described here: http://bupar.net/creating_eventlogs.html#Transactional_life_cycle

### Further resources

* [XES Standard](http://xes-standard.org/)
* [Importing CSV event logs](https://pm4py.fit.fraunhofer.de/documentation#item-import-csv)
* [Importing XES event logs](https://pm4py.fit.fraunhofer.de/documentation#item-impoort-xes)

#### Reflection Questions

* What could be the reason a column `.order` is included in this dataset?
* How could the column `employee` be used?
* What is the use of the column `handling_id` and in which situation is it required?

## Basic Process Visualization

### Set of Traces 

Exploring traces as a set visualization is currently not implemented in PM4Py. 
**Challenge** implement a visualization similar to that in bupaR with Python and open a pull request. Here are some reference implementations of a 'trace explorer':

* http://bupar.net/trace_explorer.html
* https://fmannhardt.de/blog/software/prom/explorer (ProM)


In [None]:
# implement a view of the event log as a set of traces 

### Dotted Chart

The `Dotted Chart` adds the timing aspect of the individual traces and visualized all of them at-a-glance. It can be configured in many different ways and provides a good insight into time-related aspects of the process behavior. PM4Py provides a basic  Dotted Chart visualization:

In [None]:
patients_log = pm4py.format_dataframe(patients, case_id='patient', activity_key='handling', timestamp_key='time')
pm4py.view_dotted_chart(pm4py.filter_time_range(patients_log, "1970-01-01 00:00:00", "2017-01-31 00:00:00", mode='events'))

Alternatively, to allow for more customization of the visuals, the same view can be simply reproduced using plotnine and allows for some more flexibility in choosing the perspectives:

In [None]:
# Necessary pre-processing on the data frame
patients_sorted = patients.sort_values(['time'])
# Creating categories for the case identifier
patients_sorted['patient'] = pd.Categorical(patients_sorted['patient'], 
                                            categories = patients_sorted['patient'].drop_duplicates().tolist()[::-1], ordered= True)

#### Absolute Time Dimension

In [None]:
(ggplot(patients_sorted[patients_sorted['time'] < '2017-01-31'], 
        aes('time', 'patient', color = 'handling'))
 + geom_point()
 + theme_bw()
 + scale_y_discrete(labels = "")
 + theme(axis_text_x=element_text(rotation=45, hjust=1)))

#### Relative Time Dimension

We meed to make the time relative and add a new column `time_relative` for that purpose:

In [None]:
patients_sorted['time_relative'] = patients_sorted['time'].sub( patients_sorted.groupby('patient')['time'].transform('first'))

In [None]:
(ggplot(patients_sorted, aes('time_relative', 'patient', color = 'handling'))
 + geom_point()
 + theme_bw()
 + scale_y_discrete(labels = "")
 + theme(axis_text_x=element_text(rotation=45, hjust=1)))

We still need to sort by the overall duration to replicate the `bupaR`

In [None]:
patients_sorted['duration'] = patients_sorted.groupby('patient')['time_relative'].transform('max')

In [None]:
patients_sorted_duration = patients_sorted.sort_values(['duration'])
patients_sorted_duration['patient'] = pd.Categorical(patients_sorted_duration['patient'], categories = patients_sorted_duration['patient'].drop_duplicates().tolist()[::-1], ordered= True)

(ggplot(patients_sorted_duration, aes('time_relative', 'patient', color = 'handling'))
 + geom_point()
 + theme_bw()
 + scale_y_discrete(labels = "")
 + theme(axis_text_x=element_text(rotation=45, hjust=1)))

Check out other basic process visualization options using PM4Py:

* [Basic Process Statistics](https://pm4py.fit.fraunhofer.de/documentation#statistics)

## Process Map Visualization

Again, there is no built-in precedence matrix visualization in PM4Py, but it can be replicated easily:

In [None]:
patients_sorted['antecedent'] = patients_sorted.groupby(["patient"])['handling'].shift(1).fillna("Start")
patients_sorted['consequent'] = patients_sorted['handling']

In [None]:
(ggplot(patients_sorted, aes('consequent', 'antecedent', ))
 + geom_bin2d() 
 + theme_bw()
 + theme(axis_text_x=element_text(rotation=45, hjust=1)))

### Directly-follows Graph / Process Map

The process map or directly-follows graph visualization in PM4Py cannot deal yet with `activity instances`, so we need to only focus on the `complete` events.

In [None]:
patients_log = patients_log[patients_log['registration_type'] == 'complete']

In [None]:
dfg, sa, ea = pm4py.discover_directly_follows_graph(patients_log)

In [None]:
pm4py.view_dfg(dfg, sa, ea)

In [None]:
from pm4py.algo.discovery.dfg import algorithm as dfg_discovery
from pm4py.algo.discovery.dfg import algorithm as dfg_discovery
from pm4py.visualization.dfg import visualizer as dfg_visualization

dfg = dfg_discovery.apply(patients_log)

dfg = dfg_discovery.apply(patients_log, variant=dfg_discovery.Variants.PERFORMANCE)
gviz = dfg_visualization.apply(dfg, log=patients_log, variant=dfg_visualization.Variants.PERFORMANCE)
dfg_visualization.view(gviz)

### Further Perspectives and Animation

No such feature in PM4Py yet.

## Real-life Processes

In [None]:
sepsis

In [None]:
sepsis_sorted = sepsis.sort_values(['timestamp'])
sepsis_sorted['timestamp'] = pd.to_datetime(sepsis_sorted['timestamp'])

In [None]:
sepsis_sorted['antecedent'] = sepsis_sorted.groupby(["case_id"])['activity'].shift(1).fillna("Start")
sepsis_sorted['consequent'] = sepsis_sorted['activity']

In [None]:
(ggplot(sepsis_sorted, aes('consequent', 'antecedent', ))
 + geom_bin2d() 
 + theme_bw()
 + theme(axis_text_x=element_text(rotation=45, hjust=1)))