# Predictive Process Monitoring - Data Preparation

This notebook is part of the [starter package](https://github.com/fmannhardt/starter-predictive-process-monitoring) for predictive process monitoring. It contains examples for prefix extraction and prefix encoding from event logs for the purpose of developing and applying predictive process monitoring techniques.

## Setup
The following Python libraries are used, please refer to the installation instructions to prepare your environment:

* [PM4Py](https://pm4py.fit.fraunhofer.de/)
* [Pandas](https://pandas.pydata.org/)
* [Numpy](https://numpy.org/)

In [None]:
import pandas as pd
import pm4py
import numpy as np

## Event Log & Data Loading

We continue with the data loaded in the [previous notebook](./0_data_loading.ipynb).

In [None]:
from urllib.request import urlretrieve
import os

# download from 4tu.nl
urlretrieve('https://data.4tu.nl/file/33632f3c-5c48-40cf-8d8f-2db57f5a6ce7/643dccf2-985a-459e-835c-a82bce1c0339', 'sepsis1.xes.gz')
sepsis_log = pm4py.read_xes('sepsis1.xes.gz')
os.unlink('sepsis1.xes.gz') # clean up

## Prefix Extraction

Many different prediction tasks are possible based on an event log. Often, the assumption is made that only a prefix of a trace is known and that a prediction on some future state of the process instance represented by that trace should be made.

The first step is to generate suitable prefixes of the traces contained in the event log to be used as the training samples. As a *simple example*, we may be interested in predicting whether the patient in the process returns ot the emergency room indicated by the event *Return ER* as the last event. Since the event *Return ER* is part of the event log, we need to remove that event and remember in which trace it occurred. 

In [None]:
sepsis_returns = [len(list(filter(lambda e: e["concept:name"] == "Return ER" ,trace))) > 0 for trace in sepsis_log]

In [None]:
# check if this worked
print(sepsis_log[3][-1])
print(sepsis_returns[3])

print(sepsis_log[0][-1])
print(sepsis_returns[0])

At the same time, we may be interested in how well we can predict whether a patient returns for different sizes of the prefix, e.g., we can generate a new event log keeping only prefixes of each trace with at most size 10 (*10-prefix*).

**Note that this is just a simple example with 10 chosen as arbitrary prefix length and in the general case you generate not only prefixes of a specific size but of variables or all sizes. Also, some traces are less than 10 events long in which case we would use the full trace for the prediction, which would not be very useful in practice.**

In [None]:
# remove Return ER event
sepsis_log = pm4py.filter_event_attribute_values(sepsis_log, "concept:name", "Return ER", level = "event", retain=False)

from pm4py.objects.log.obj import EventLog, Trace
# generate prefixes, note that we need to add the casts to EventLog and Trace to make sure that the result is a PM4Py EventLog object
sepsis_prefixes = EventLog([Trace(trace[0:10], attributes = trace.attributes) for trace in sepsis_log])

In [None]:
# check the trace length
print([len(trace) for trace in sepsis_log][0:15])
print([len(trace) for trace in sepsis_prefixes][0:15])

## Prefix Encoding

For training a prediction model, the traces or sequences of events need to be often transformed to a vector representation. We show how to compute three basic encodings+ using the built-in PM4Py [feature selection and processing](https://pm4py.fit.fraunhofer.de/documentation#decision-trees) functionality.

Of course, more complex encodings such as representing each trace as a sequence of features are possible, e.g., for sequential models such as LSTMs. This is left as exercise. 

### Feature Selection \& Engineering

Before we do prefix encoding, we need to select which features we will use for the prediction. In this example we will only use the "activity" of the events as feature. Depending on your prediction problem, you might want to include additional trace/event attributes.

Additionally, you can also derive new trace-level features (e.g., day of week, time since case start) or log-based features (e.g., workload of resources, number of active cases at a certain time). This is left as exercise.

### Encoding as Set of Events

In [None]:
from pm4py.algo.transformation.log_to_features import algorithm as log_to_features

# log_to_feature provides a flexible interface to compute features on an event and trace level
# see the documentation for more information: https://pm4py.fit.fraunhofer.de/documentation#item-7-0-2 
data, feature_names = log_to_features.apply(sepsis_prefixes, parameters={"str_ev_attr": ["concept:name"]})

The standard encoding of the `concept:name` attribute (i.e., the event label) is a one-hot encoded vector. Let us have a look at the encoding. The index of the number corresponds to the index in the feature label vector.

In [None]:
from pm4py.objects.log.util.log import project_traces
def project_nth(log, index):
    print(str(project_traces(log)[index]))

In [None]:
project_nth(sepsis_prefixes, 0)

In [None]:
print(feature_names)

In [None]:
print(data[0])

The overall data shape is:

In [None]:
np.asarray(data).shape

So, PM4Py gives us a *one-hot encoding* of the so called *set abstraction* of the event log. This means there are 16 distinct activities in the event log and the feature vector simply encodes whether that activity is present or not in the data. 

Let us have a look at the distribution of these feature vectors:

In [None]:
# look at the unique vectors and their occurrence frequency/count
dist_features = np.unique(data, return_counts= True, axis = 0)
print(dist_features)

What is the most common feature vector?

In [None]:
# argmax give use the index of the most frequent vector
dist_features[0][np.argmax(dist_features[1])]

Makes sense, almost all activities actually are bound to occur in this process. There are only few choices.
So, this encoding is likely not the most useful one but a very simple one.

### Encoding as Bi-Grams / Succession Relation

In [None]:
data_2gram, feature_names = log_to_features.apply(sepsis_prefixes, 
                                                  parameters={"str_ev_attr": [], 
                                                        "str_tr_attr": [], 
                                                        "num_ev_attr": [], 
                                                        "num_tr_attr": [], 
                                                        "str_evsucc_attr": ["concept:name"]})
feature_names

Each feature represents the succession relation (or bigram) between any two activities of the event log. We transform the features into a tensor.

In [None]:
data_2gram = np.asarray(data_2gram)

Let us, again, have a look at the encoding of the first trace.

In [None]:
project_nth(sepsis_log, 0)

In [None]:
print(data_2gram[0])

### Encoding as Bag of Words / Multiset of Events

Another option would be to use the encoding known as [bag-of-words model](https://en.wikipedia.org/wiki/Bag-of-words_model) in Natural Language Processing, which is constructing a multiset of the one-hot encoded events. So, the frequency with which each activity occurs is reflected. This encoding is not provided in PM4Py but can be easily computed with Pandas and Numpy.

We first need to transform the PM4Py event log to a Pandas data frame.

In [None]:
sepsis_df = pm4py.convert_to_dataframe(sepsis_prefixes)
sepsis_df.head(25)

We build a bag of words representation by grouping our data and then counting the number of events refering to the individual activities.

In [None]:
# concept:name refers to the activity
# case:concept:name refers to the case identifier
sepsis_case_act = sepsis_df.loc[:,["case:concept:name", "concept:name"]]
sepsis_case_act

In [None]:
# Count the occurrence of activities in a trace (no sorting to keep order of traces stable!)
sepsis_act_count = sepsis_case_act.groupby(["case:concept:name", "concept:name"], sort=False).size()
sepsis_act_count

We have the count of each activity for each trace and still need to convert this to a tensor format such that we have one feature vector (columns) per case (row).

In [None]:
sepsis_bag = np.asarray(sepsis_act_count.unstack(fill_value=0))
sepsis_bag

In [None]:
sepsis_bag.shape

Let us, again, have a look at the encoding of the first trace.

In [None]:
project_nth(sepsis_log, 0)
print(sepsis_bag[0])

In [None]:
project_nth(sepsis_log, 1)
print(sepsis_bag[1])

This already gives us much more information to work with.