# Applied Process Mining Module

This notebook is part of an Applied Process Mining module. The collection of notebooks is a *living document* and subject to change. 

# Assignment - BPI Challenge 2020

## Setup

<img src="http://bupar.net/images/logo_text.PNG" alt="bupaR" style="width: 200px;"/>

In this notebook, we are going to need the `tidyverse` and the `bupaR` packages.

In [None]:
## Perform the commented out commands below in a separate R session
# install.packages("tidyverse")
# install.packages("bupaR")

In [None]:
# for larger and readable plots
options(jupyter.plot_scale=1.25)

In [4]:
# the initial execution of these may give you warnings that you can safely ignore
library(tidyverse)

library(bupaR)
library(processanimateR)


Attaching package: 'bupaR'


The following object is masked from 'package:stats':

    filter


The following object is masked from 'package:utils':

    timestamp




## Assignment

In the first hands-on session, you are going to explore a real-life dataset and apply what was presented in the lecture about event logs and basic process mining visualizations. The objective is to explore your dataset and as an event log and with the learned process mining visualizations in mind.

* Analyse basic properties of the the process (business process or other process) that has generated it. 
    * What are possible case notions / what is the or what are the case identifiers?
    * What are the activities? Are all activities on the same abstraction level? Can activities be derived from other data?
    * Can activities or actions be derived from other (non-activity) data?
* Discovery a map of the process (or a sub-process) behind it.
    * Are there multiple processes that can be discovered?
    * What is the effect of taking a subset of the data? 

## Dataset

The proposed real-life dataset to investigate is the *BPI Challenge 2020* dataset. The dataset is captured from the travel reimbursment process of Eindhoven University of Technolog and has been collected for usage in the BPI challenge. The BPI challenge is a yearly event in the Process Mining research community in which an event log is released along with some business questions that shall be addressed with process analytics techniques.

Here is more informaation on the dataset and downloads links to the data files:

* [Overview of the Case](https://icpmconference.org/2020/bpi-challenge/)
* [Dataset](https://doi.org/10.4121/uuid:52fb97d4-4588-43c9-9d04-3604d4613b51)

On the BPI Challenge 2020 website above, there are several reports (including the winners of the challenge) that describe and analyze the dataset in detail. However, we suggest that you first try to explore the dataset without reading the reports. The business questions and a description of the process flow can be also found at the BPI Challenge 2020 website. We repeat it here for convenience:

### Process Flow

The various declaration documents (domestic and international declarations, pre-paid travel costs and requests for payment) all follow a similar process flow. After submission by the employee, the request is sent for approval to the travel administration. If approved, the request is then forwarded to the budget owner and after that to the supervisor. If the budget owner and supervisor are the same person, then only one of the these steps it taken. In some cases, the director also needs to approve the request.

In all cases, a rejection leads to one of two outcomes. Either the employee resubmits the request, or the employee also rejects the request.

If the approval flow has a positive result, the payment is requested and made.

The travel permits follow a slightly different flow as there is no payment involved. Instead, after all approval steps a trip can take place, indicated with an estimated start and end date. These dates are not exact travel dates, but rather estimated by the employee when the permit request is submitted. The actual travel dates are not recorded in the data, but should be close to the given dates in most cases.

After the end of a trip, an employee receives several reminders to submit a travel declaration.

After a travel permit is approved, but before the trip starts, employees can ask for a reimbursement of pre-paid travel costs. Several requests can be submitted independently of each other. After the trip ends, an international declaration can be submitted, although sometimes multiple declarations are seen for specific cases.

It’s important to realize that the process described above is the process for 2018. For 2017, there are some differences as this was a pilot year and the process changed slightly on several occasions.

### Business Questions

The following questions are of interest:

* What is the throughput of a travel declaration from submission (or closing) to paying?
* Is there are difference in throughput between national and international trips?
* Are there differences between clusters of declarations, for example between cost centers/departments/projects etc.?
* What is the throughput in each of the process steps, i.e. the submission, judgement by various responsible roles and payment?
* Where are the bottlenecks in the process of a travel declaration?
* Where are the bottlenecks in the process of a travel permit (note that there can be mulitple requests for payment and declarations per permit)?
* How many travel declarations get rejected in the various processing steps and how many are never approved?

Then there are more detailed questions

* How many travel declarations are booked on projects?
* How many corrections have been made for declarations?
* Are there any double payments?
* Are there declarations that were not preceded properly by an approved travel permit? Or are there even declarations for which no permit exists?
* How many travel declarations are submitted by the traveler and how many by a mandated person?
* How many travel declarations are first rejected because they are submitted more than 2 months after the end of a trip and are then re-submitted?
* Is this different between departments?
* How many travel declarations are not approved by budget holders in time (7 days) and are then automatically rerouted to supervisors?
* Next to travel declarations, there are also requests for payments. These are specific for non-TU/e employees. Are there any TU/e employees that submitted a request for payment instead of a travel declaration?

Similar to the task at the BPI challenge, we are aware that not all questions can be answered on this dataset and we encourage you to come up with new and interesting insights.

## Data Loading

Several datasets have been released as part of the BPI challenge. The data is split into travel permits and several request types, namely domestic declarations, international declarations, prepaid travel costs and requests for payment, where the latter refers to expenses which should not be related to trips (think of representation costs, hardware purchased for work, etc.). At Eindhoven University of Technology (TU/e), this is no different. The TU/e staff travels a lot to conferences or to other universities for project meetings and/or to meet up with colleagues in the field. And, as many companies, they have procedures in place for arranging the travels as well as for the reimbursement of costs.

To make your life a bit easier, we have created the initial code to load the dataset that is already stored in the [XES format](http://xes-standard.org/) for event logs.

In [50]:
read_xes_gzip <- function(xes_url) {
    temp <- tempfile(fileext = ".xes.gz")
    download.file(xes_url, temp, mode = "wb")
    temp_xes <- tempfile()
    R.utils::gunzip(temp, temp_xes)
    xes <- xesreadR::read_xes(temp_xes)
    unlink(temp)
    unlink(temp_xes)
    return(xes)
}

In [51]:
# some warnings are expected here (bupaR needs an updating)
rfp_data <- read_xes_gzip("https://data.4tu.nl/ndownloader/files/24061154")
ptc_data <- read_xes_gzip("https://data.4tu.nl/ndownloader/files/24043835")
int_decl_data <- read_xes_gzip("https://data.4tu.nl/ndownloader/files/24023492")
dom_decl_data <- read_xes_gzip("https://data.4tu.nl/ndownloader/files/24031811")

"No lifecycle transition id specified in xes-file"
"No activity instance identifier specified in xes-file. By default considered each event as a different activity instance. Please check!"
"No lifecycle transition id specified in xes-file"
"No activity instance identifier specified in xes-file. By default considered each event as a different activity instance. Please check!"
"No lifecycle transition id specified in xes-file"
"No activity instance identifier specified in xes-file. By default considered each event as a different activity instance. Please check!"
"No lifecycle transition id specified in xes-file"
"No activity instance identifier specified in xes-file. By default considered each event as a different activity instance. Please check!"


In [52]:
rfp_data %>% summary()

Number of events:  36796
Number of cases:  6886
Number of traces:  89
Number of distinct activities:  19
Average trace length:  5.343596

Start eventlog:  2017-01-09 08:17:18
End eventlog:  2019-08-08 12:57:18



 CASE_concept_name  CASE_Activity      CASE_Cost Type    
 Length:36796       Length:36796       Length:36796      
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         
 CASE_OrganizationalEntity CASE_Project       CASE_RequestedAmount
 Length:36796              Length:36796       Length:36796        
 Class :character          Class :character   Class :character    
 Mode  :character          Mode  :character   Mode  :character    
                                                                  
                                                                  
                                                                  
                                                                  


In [53]:
ptc_data %>% summary()

Number of events:  18246
Number of cases:  2099
Number of traces:  202
Number of distinct activities:  29
Average trace length:  8.692711

Start eventlog:  2017-01-09 13:48:43
End eventlog:  2019-02-21 10:11:10



 CASE_concept_name  CASE_Activity      CASE_Cost Type    
 Length:18246       Length:18246       Length:18246      
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         
 CASE_OrganizationalEntity CASE_Permit ActivityNumber CASE_Permit BudgetNumber
 Length:18246              Length:18246               Length:18246            
 Class :character          Class :character           Class :character        
 Mode  :character          Mode  :character           Mode  :character        
                                                                              
                                                                              
                                                              

In [54]:
int_decl_data %>% summary()

Number of events:  72151
Number of cases:  6449
Number of traces:  753
Number of distinct activities:  34
Average trace length:  11.18794

Start eventlog:  2016-10-04 22:00:00
End eventlog:  2020-05-09 22:00:00



 CASE_concept_name  CASE_AdjustedAmount CASE_Amount        CASE_BudgetNumber 
 Length:72151       Length:72151        Length:72151       Length:72151      
 Class :character   Class :character    Class :character   Class :character  
 Mode  :character   Mode  :character    Mode  :character   Mode  :character  
                                                                             
                                                                             
                                                                             
                                                                             
 CASE_DeclarationNumber   CASE_id          CASE_OriginalAmount
 Length:72151           Length:72151       Length:72151       
 Class :character       Class :character   Class :character   
 Mode  :character       Mode  :character   Mode  :character   
                                                              
                                                             

In [55]:
dom_decl_data %>% summary()

Number of events:  56437
Number of cases:  10500
Number of traces:  99
Number of distinct activities:  17
Average trace length:  5.374952

Start eventlog:  2017-01-09 08:49:50
End eventlog:  2019-06-17 15:30:58



 CASE_concept_name  CASE_Amount        CASE_BudgetNumber 
 Length:56437       Length:56437       Length:56437      
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         
 CASE_DeclarationNumber   CASE_id         
 Length:56437           Length:56437      
 Class :character       Class :character  
 Mode  :character       Mode  :character  
                                          
                                          
                                          
                                          
                                   activity_id         id           
 Declaration SUBMITTED by EMPLOYEE       :11531   Length:56437      
 Declaration FINAL_APPROVED by SUPERVISOR:10131   Clas