This notebook is part of an Applied Process Mining module. The collection of notebooks is a *living document* and subject to change. 

# Lecture 1 - 'Event Logs and Process Visualization' (R / bupaR)

## Setup

<img src="http://bupar.net/images/logo_text.PNG" alt="bupaR" style="width: 200px;"/>

In this notebook, we are going to need the `tidyverse` and the `bupaR` packages. If you run this notebook in the recommended Docker environment then there is no need to install any packages. Otherwise, you may need to install the requirements that are commented out below:

In [None]:
## Perform the commented out commands below in a separate R session
# install.packages("tidyverse")
# install.packages("bupaR")
# install.packages("processmapR")
# install.packages("processanimateR")

We are setting up some convenicence options for the notebook and import dependencies:

In [None]:
# for larger and readable plots
options(jupyter.plot_scale=1.25)

In [None]:
# the initial execution of these may give you warnings that we can safely ignore
suppressPackageStartupMessages (library(tidyverse)) 
suppressPackageStartupMessages (library(bupaR))
library(processmapR)
library(processanimateR)

## Event Logs

This part introduces event logs and their unique properties that provide the basis for any Process Mining method. Together with `bupaR` several event logs are distributed that can be loaded without further processing. 
In this lecture we are going to make use of the following datasets:

* Patients, a synthetically generated example event log in a hospital setting.
* Sepsis, a real-life event log taken from a Dutch hospital. The event log is publicly available here: https://doi.org/10.4121/uuid:915d2bfb-7e84-49ad-a286-dc35f063a460 and has been used in many Process Mining related publications.

### Exploring Event Data

Let us first explore the event data without any prior knowledge about event log structure or properties. We convert the `patients` event log below to a standard `tibble` (https://tibble.tidyverse.org/) and inspect the first rows.

In [None]:
patients %>%
    as_tibble() %>%
    head()

The most important ingredient of an event log is the timestamps column `time`. This allows us to establish a sequence of events.

In [None]:
patients %>% 
  filter(time < '2017-01-31') %>% 
  ggplot(aes(time, "Event")) + 
  geom_point() + 
  theme_bw()

We also need to have information on the kind of actions or `activities` performed:

In [None]:
patients %>%
    as_tibble() %>% 
    distinct(handling)

Let us have a look at what other data is available:

In [None]:
patients %>%
    as_tibble() %>% 
    distinct(patient)  %>% 
    head()

Maybe the patient identifier could be a good candidate for defining a process `case` since this is an 'entity' that we would like to follow. When counting the events that occurred per individual patient it seems that there is a similar number of events for each patient, which is generally a good indicator for a process case identifier:

In [None]:
patients %>%
    as_tibble() %>% 
    count(patient) %>% 
    head()

Let use decide that we want to look at the process by following the patient identifier as `case identifier`:

In [None]:
patients %>% 
  filter(time < '2017-01-31') %>% 
  ggplot(aes(time, patient, color = handling)) + 
  geom_point() + 
  theme_bw()

The scatterplot above is known as `Dotted Chart` in the process mining community and provides an 'at a glance' overview on the events and their temporal relation when grouped by a case. It seems that each of the sequences of events (also known as `traces`) start with the `Registration` event. Let us have a look at the event data sorted by patient identifier and by time:

In [None]:
patients %>% 
  as_tibble() %>% 
  arrange(patient, time) %>% 
  head(14)

An individual process execution (e.g., for patient 1) consists of several activities that are done in a sequence. However, we have more information available than simply the sequence of events. For each occurrence of an activity we have two events: a `start` event and a `complete` event as captured in the column `registration_type`. These event refer to the lifecycle of an activity and allow us to capture the `duration` of an activity. Much more complex lifecycles of activities are possible, a general model is described here: http://bupar.net/creating_eventlogs.html#Transactional_life_cycle

### Further resources

* [XES Standard](http://xes-standard.org/)
* [Creating event logs from CSV files in bupaR](http://bupar.net/creating_eventlogs.html)
* [Changing the case, activity notiions in bupaR](http://bupar.net/mapping.html)

### Reflection Questions

* What could be the reason a column `.order` is included in this dataset?
* How could the column `employee` be used?
* What is the use of the column `handling_id` and in which situation is it required?

## Basic Process Visualization

There are several generic visualizations that can be used to get a basic understanding of the process behavior.

### Set of Traces

Since a process, in our basic definition, is a set of event sequences or traces, we can simply visualize the set of distinct trace variants. Here we only consider the `trace variant` which means that we only consider the order of activities executed disregarding any other aspect (timing, lifecycles).

In [None]:
patients %>% 
  trace_explorer(coverage = 1.0, abbreviate = T, type = ) # abbreviated here due to poor Jupyter notebook output scaling

### Dotted Chart

The `Dotted Chart` adds the timing aspect of the individual traces and visualized all of them at-a-glance. It can be configured in many different ways and provides a good insight into time-related aspects of the process behavior.

In [None]:
patients %>%
    filter(time < '2017-01-31') %>% 
    dotted_chart(add_end_events = T)

In [None]:
patients %>%    
    dotted_chart("relative", add_end_events = T)

We can also use `plotly` to get an interactive visualization:

In [None]:
patients %>%    
    dotted_chart("relative", add_end_events = T, plotly = TRUE)

In [None]:
sepsis %>% 
    dotted_chart("relative_day",
                 sort = "start_day", 
                 units = "hours")

Check out other process visualization options using bupaR:

* [Further Dotted Charts](http://bupar.net/dotted_chart.html)
* [Exploring Time, Resources, Structuredness](http://bupar.net/exploring.html)

## Process Map Visualization

In [None]:
patients %>% 
    precedence_matrix() %>% 
    plot()

In [None]:
patients %>% 
    process_map()

In [None]:
patients %>% 
    process_map(type = performance(units = "hours"))

## Real-life Processes

In [None]:
sepsis %>% 
  precedence_matrix() %>% 
  plot()