In [None]:
import pm4py
pm4py.__version__

# Decision Points

> Investigate how patients are referred for further treatment by means of a decision tree. Describe the factors that you observe.

## Data loading

Import the original log and modify column names and datatypes for the following analysis.

In [1]:
import os
import pandas as pd
import numpy as np
from pm4py.objects.log.importer.xes import factory as xes_import_factory
from pm4py.objects.petri.importer import factory as pnml_importer
from pm4py.objects.conversion.log import factory as log_converter
from pm4py.util import constants

PROJ_ROOT = os.path.abspath(os.path.pardir)

# load csv from disk
df_log_Q4 = pd.read_csv(PROJ_ROOT+"/data/log.csv")

# convert timestamp columns to datetime friendly format
df_log_Q4['Timestamp'] = pd.to_datetime(df_log_Q4['Timestamp'])
df_log_Q4['start_timestamp'] = pd.to_datetime(df_log_Q4['start_timestamp'])

#rename some column for better algorithm compatibillity
df_log_Q4 = df_log_Q4.rename(columns={"Age": "case:Age", "Insurance": "case:Insurance", "PatientName": "case:PatientName", "Timestamp": "time:timestamp"})


---

# a)
> Create a decision tree of reasonable complexity using the available attributes in the log. 

In this section we are interested in the kind of treatment that a patient will undergo based on certain properties.  For this we want to create decision trees based on case attributes that give can give us deeper insights into which patients recieve which treatment.

To do so we are interested in which cases contain certain treatment related events. PM4Py offers a decision tree module, that creates a decision tree predicting the end event of a case from the case properties. Since we know that eventually all patients will be discharged from our previous analysis, most of the cases will have a discharge end event. From this we cannot infer which treatment was performed before the discharge. Because of this, before creating the decision tree, we will cut off all traces when a certain event occurs. This event is one of a set of events that describe the treatment of the patient. The events included in this set can be seen below. They are either related to a certain kind of treatment, or a discharge. The discharge events are included for cases that do not recieve any treatment.

In [2]:
NEW_END_ACTIVITIES = ["Treatment A1", "Treatment A2", "Treatment B", "Discharge", "Discharge Test", "Discharge Init Exam"]

In the following step the original event log is filtered in such a way, that all events that occurred after one of the above events in a case will be discarded.

In [4]:
#remove all events after a final decision was made (slow)
treatment_df = pd.DataFrame(index=np.arange(0, len(df_log_Q4)), columns=["case:concept:name", "concept:name", "org:resource", "case:PatientName", "case:Age", "case:Insurance", "start_timestamp", "Timestamp", "@@duration"])

current_id = -1
keep = True
count = 0

# go through the sorted events on a case basis
# if an event from the new end activities is found, discard all further events from that case
for row in df_log_Q4.itertuples():
    if current_id != row[1]:
        current_id = row[1]
        keep = True

    if keep:
        if row[2] in NEW_END_ACTIVITIES:
            keep = False
            
        treatment_df.loc[count] = row[1:]
        count += 1         

# drop nil rows that will occur because after filtering there are fewer events
treatment_df = treatment_df.dropna()

In [5]:
treatment_df.head(13)

Unnamed: 0,case:concept:name,concept:name,org:resource,case:PatientName,case:Age,case:Insurance,start_timestamp,Timestamp,@@duration
0,1,Register,Alexander,Hermann the 1.,51,STAT,2020-06-01 06:00:00,2020-06-01 06:08:53,533
1,1,Initial Exam,Anna,Hermann the 1.,51,STAT,2020-06-01 06:10:48,2020-06-01 06:25:43,895
2,1,Initial Exam Decision,"Amelie,Anna",Hermann the 1.,51,STAT,2020-06-01 06:26:43,2020-06-01 06:31:52,309
3,1,Inform about Isolation,Alexander,Hermann the 1.,51,STAT,2020-06-01 06:33:45,2020-06-01 06:33:45,0
4,1,Test III,Anna,Hermann the 1.,51,STAT,2020-06-01 06:35:35,2020-06-01 07:03:47,1692
5,1,Test III Decision,"Adrian,Anna",Hermann the 1.,51,STAT,2020-06-01 07:03:47,2020-06-01 07:08:06,259
6,1,Inform Authority Fill Form,Alina,Hermann the 1.,51,STAT,2020-06-01 07:09:57,2020-06-01 07:17:47,470
7,1,Referral,Adrian,Hermann the 1.,51,STAT,2020-06-01 07:09:57,2020-06-01 07:18:48,531
8,1,Inform Authority Send Form,Alina,Hermann the 1.,51,STAT,2020-06-01 07:17:57,2020-06-01 07:20:07,130
9,1,Register Facility,Bernhard,Hermann the 1.,51,STAT,2020-06-01 08:21:45,2020-06-01 08:31:18,573


We can see that case 1 now ends with Treatment B instead of an discharge event. In the next step we will convert the log into an PM4Py event log as seen before.

In [6]:
#convert to event log
# map dataset columns to PM4Py keys
param_keys_Q4 = {constants.PARAMETER_CONSTANT_CASEID_KEY: 'case:concept:name',
            constants.PARAMETER_CONSTANT_RESOURCE_KEY: 'org:resource', 
            constants.PARAMETER_CONSTANT_ACTIVITY_KEY: 'concept:name',
            constants.PARAMETER_CONSTANT_TIMESTAMP_KEY: 'time:timestamp',
            constants.PARAMETER_CONSTANT_START_TIMESTAMP_KEY: 'start_timestamp'}

treatment_log = log_converter.apply(treatment_df, parameters=param_keys_Q4)

Since not all cases contain one of the new end activities, there are still traces in the event log that do not end with one of the specified activities. In order to keep just the cases that end on one of the specified events we use the end activitiy filter provided by PM4Py.

In [7]:
from pm4py.algo.filtering.log.end_activities import end_activities_filter

treatment_log = end_activities_filter.apply(treatment_log, NEW_END_ACTIVITIES)

After preprocessing the log we can now create a decision tree from it. In the first step of this, we use PM4Py to create the data, targets and classes that will be passed to the sklearn decision tree algorithm. In this step we need to specify which properties of the event log should be used for the decision tree creation. We can speecifiy trace based and event based attributes.

For our first iteration we included all of the sensible log attributes (for example patient name was removed as it provides no information).

In [53]:
from pm4py.objects.log.util import get_log_representation
from pm4py.objects.log.util import get_class_representation

# preprocess the log for decision tree mining
str_trace_attributes = ["Insurance"]
str_event_attributes = ["org:resource", "concept:name"]
num_trace_attributes = ["Age"]
num_event_attributes = ["@@duration"]

data, feature_names = get_log_representation.get_representation(treatment_log, str_trace_attributes, str_event_attributes,
                                                              num_trace_attributes, num_event_attributes)

target, classes = get_class_representation.get_class_representation_by_str_ev_attr_value_value(treatment_log, "concept:name")

In [56]:
from sklearn import tree

#calculate the decision tree

# 0 treatmentB, 1 treatment A1, 2 discharge test, 3 discharge init, 4 treatment a2
classifier = tree.DecisionTreeClassifier()
classifier.fit(data, target)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [57]:
from pm4py.visualization.decisiontree import factory as dt_vis_factory
#visualize the obtained decision tree

decision_tree_vis = dt_vis_factory.apply(classifier, feature_names, classes)

figures_dir = os.path.join(PROJ_ROOT, 'report', 'figures')
decision_tree_vis.render(os.path.join(figures_dir, 'q4_tree_all'),
                 format='pdf',
                 view=True)

'/Users/Tom/Documents/Uni/4. Semester M/Advanced Process Mining/Assignments/Assignment 1/APM-A1/report/figures/q4_tree_all.pdf'

The obtained tree is able to perfectly classify all cases, which is unexpected. A closer investigation shows that the tree simply uses the target events themselves to classify the cases, which of course is not wanted. As some events are always performed by the same resources, the same holds for the resource attribute. The two attributes are therefore not suited to be included in a sensible decision tree for this taks.

Furthermore we see that the tree once splits using the duration of an event. Since the tree holds no information about which event is related to this duration, we cannot derive meaningful results from this split. The attribute is therefore removed as well. With the remaining age and insurance attributes, we create a new decision tree in the following.

In [60]:
# preprocess the log for decision tree mining
str_trace_attributes = ["Insurance"]
str_event_attributes = []
num_trace_attributes = ["Age"]
num_event_attributes = []

data, feature_names = get_log_representation.get_representation(treatment_log, str_trace_attributes, str_event_attributes,
                                                              num_trace_attributes, num_event_attributes)

target, classes = get_class_representation.get_class_representation_by_str_ev_attr_value_value(treatment_log, "concept:name")


Since the _Discharge Init Exam_ event is disproportionally more frequent than all other classes, we need to introduce some normalization in order to obtain a sensible tree. (Else all nodes would have the same label for a small tree) Additionally, since the tree is not able to perfectly classify all cases anymore, we have to limit the depth and maximum number of child nodes in order to obtain readable results.

In [61]:
#calculate the decision tree

# 0 treatmentB, 1 treatment A1, 2 discharge test, 3 discharge init, 4 treatment a2
classifier = tree.DecisionTreeClassifier(max_depth=7,max_leaf_nodes=8,class_weight={0: 0.6, 1: 1, 2: 1, 3: 0.5, 4: 2})
classifier.fit(data, target)

DecisionTreeClassifier(class_weight={0: 0.6, 1: 1, 2: 1, 3: 0.5, 4: 2},
                       criterion='gini', max_depth=7, max_features=None,
                       max_leaf_nodes=8, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort=False, random_state=None, splitter='best')

In [63]:
#visualize the obtained decision tree

decision_tree_vis = dt_vis_factory.apply(classifier, feature_names, classes)

figures_dir = os.path.join(PROJ_ROOT, 'report', 'figures')
decision_tree_vis.render(os.path.join(figures_dir, 'q4_tree_min'),
                 format='pdf',
                 view=True)

'/Users/Tom/Documents/Uni/4. Semester M/Advanced Process Mining/Assignments/Assignment 1/APM-A1/report/figures/q4_tree_min.pdf'

The new decision tree with an reduced attribute set is much more interpretable. For example we see that patients older than 63.5 years that have a state insurance are more often discharged after the initial exam than ones with a private insurance. This group recieves treatment B more frequently.

Furthermore we can observe, that really young patients (<=15.5) are more often discharged after having been tested, while older patients between 15.5 and 39.5 years are more often directly discharged after thee initial exam.

In general we can observe that privately insured patients recieve any kind of treatment more frequently and are less often discharged without treatment.


---

# b)
> Since it is likely that the resources at the treatment facilities are limited, implement a function that assigns a(n) (estimate) of the number of patients at each facility to each event. To this end, you have to decide which event occurs at which facility based on your analysis in question 2. Create a decision tree of reasonable complexity using this derived attribute.