# Applied Process Mining Module

This notebook is part of an Applied Process Mining module. The collection of notebooks is a *living document* and subject to change. 

# Lecture 4 - 'Predictive Process Mining' (Python / PM4Py)

## Setup

<img src="https://pm4py.fit.fraunhofer.de/static/assets/images/pm4py-site-logo-padded.png" alt="PM4Py" style="width: 200px;"/>

In this notebook, we are using the [PM4Py library](https://pm4py.fit.fraunhofer.de/) in combination with several standard Python data science libraries:

* [pandas](https://pandas.pydata.org/)
* [PyTorch](https://pytorch.org/)

In [None]:
## Perform the commented out commands to install the dependencies
# %pip install pandas
# %pip install matplotlib
# %pip install pm4py
# %pip install torch
# %pip install tqdm

In [None]:
import numpy as np
import pandas as pd
import pm4py
import os
import torch
import torch.nn as nn
from tqdm.autonotebook import tqdm

# Predictive Process Mining

## Event Log 

We are using the Sepsis event log as an example.

In [None]:
sepsis = pd.read_csv("../data/sepsis.csv", sep=';')
sepsis_log = pm4py.format_dataframe(sepsis, case_id='case_id', activity_key='activity', timestamp_key='timestamp')
sepsis_log = pm4py.convert_to_event_log(sepsis_log)

In [None]:
len(sepsis_log)

## Feature Extraction / Encoding

We are using the PM4Py functionality to do [feature selection and processing](https://pm4py.fit.fraunhofer.de/documentation/1.5#item-7-0-1) of the event log.

### Set of Events

In [None]:
from pm4py.algo.transformation.log_to_features import algorithm as log_to_features
from pm4py.objects.log.util.log import project_traces

data, feature_names = log_to_features.apply(sepsis_log, parameters={"str_ev_attr": ["concept:name"]})

The standard encoding of the `concept:name` attribute (i.e., the event label) is a one-hot encoded vector.

In [None]:
print("1st trace : " + str(project_traces(sepsis_log)[0]))

In [None]:
print("1st trace encoded: " + str(data[0]))

The index of the number corresponds to the index in the following feature label vector:

In [None]:
feature_names

Let us try again for a different trace:

In [None]:
print("2nd trace : " + str(project_traces(sepsis_log)[1]))
print("2nd trace encoded: " + str(data[1]))

The overall data shape is:

In [None]:
np.asarray(data).shape

So, PM4Py gives us a *one-hot encoding* of the so called *set abstraction* of the event log. This means there are 16 distinct activities in the event log and the feature vector simply encodes whether that activity is present or not in the data. 

Let us have a look at the distribution of these feature vectors:

In [None]:
dist_features = np.unique(data, return_counts= True, axis = 0)
dist_features

What is the most common feature vector?

In [None]:
dist_features[0][np.argmax(dist_features[1])]

Makes sense, almost all activities actually are bound to occur in this process. There are only few choices.
So, this encoding is likely not the most useful one but a very simple one.

### 2-grams

In [None]:
data_2gram, feature_names = log_to_features.apply(sepsis_log, parameters={"str_ev_attr": None, 
                                                        "str_tr_attr": None, 
                                                        "num_ev_attr": None, 
                                                        "num_tr_attr": None, 
                                                        "str_evsucc_attr": ["concept:name"]})

There is a bug in PM4Py (https://github.com/pm4py/pm4py-core/issues/293) that causes too many feature to be returned.
So, we need to disregard the initial features extracted.

In [None]:
feature_names = feature_names[41:]
len(feature_names)

In [None]:
feature_names

In [None]:
data_2gram = np.asarray(data_2gram)[:,41:]

In [None]:
print("1st trace : " + str(project_traces(sepsis_log)[0]))

In [None]:
print("1st trace encoded: " + str(data_2gram[0]))

### Bag of Words / Multiset

Another option would be to use the encoding known as `bag of words` in Natural Language Processing, which is constructing a multiset of the one-hot encoded events. So, the frequency with which each activity occurs is reflected.-

In [None]:
print(sepsis.loc[:,["case_id"]].nunique())

We seem to have a NA value in the case attribute let us fix this by replacing with a String (in real world data you should be looking for the underlying reason)

In [None]:
sepsis = sepsis.fillna("MISSING")

In [None]:
print(sepsis.loc[:,["case_id"]].nunique())

This looks better. Now lets build a bag of words representation by grouping our data and then counting the number of events refering to the individual activities.

In [None]:
sepsis_multiset_pd = sepsis.loc[:,["case_id", "activity"]].groupby(["case_id", "activity"]).size().unstack(fill_value=0)
sepsis_multiset_pd

In [None]:
data_multiset = np.asarray(sepsis_multiset_pd)
data_multiset.shape

In [None]:
print("1st trace : " + str(project_traces(sepsis_log)[0]))
print("1st trace encoded: " + str(data_multiset[0]))

In [None]:
print("2nd trace : " + str(project_traces(sepsis_log)[1]))
print("2nd trace encoded: " + str(data_multiset[1]))

This already gives us more information.

# Prediction

Let us try to build a basic prediction model based on this information.

### Throughput time

We aim to predict the throughput time of a case. So let us look at the distribution of throughput time.

#### Data Pre-processing

In [None]:
from pm4py.statistics.traces.generic.log import case_statistics

durations = np.asarray(case_statistics.get_all_case_durations(sepsis_log, parameters={ case_statistics.Parameters.TIMESTAMP_KEY: "time:timestamp"} ))
durations = np.expand_dims(durations, 1)
len(durations)
durations = durations / 60 / 60 / 24 # in days

In [None]:
pd.DataFrame(durations).boxplot().set_ylabel('Throughput time (days)')

In [None]:
durations = durations.flatten()

In [None]:
durations

Remove outliers

In [None]:
min_duration = 0.5
max_duration = 100

data = np.asarray(data)[np.where((durations < max_duration) & (durations > min_duration))]
data_2gram = np.asarray(data_2gram)[np.where((durations < max_duration) & (durations > min_duration))]
data_multiset = data_multiset[np.where((durations < max_duration) & (durations > min_duration))]
durations = durations[np.where((durations < max_duration) & (durations > min_duration))]

In [None]:
pd.DataFrame(durations).boxplot().set_ylabel('Throughput time (days)')

Choose the encoding (here the 2-grams) and scale the data (if necessary):

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler_x = MinMaxScaler()
data_scaled = scaler_x.fit_transform(data_2gram)

scaler_y = MinMaxScaler()
durations_scaled = scaler_y.fit_transform(durations.reshape(-1, 1))

We make use of PyTorch to build a simple Neural Network.

In [None]:
from torch.utils.data import TensorDataset, DataLoader

# We need float32 data
x = torch.from_numpy(data_scaled.astype('float32'))
y = torch.from_numpy(durations_scaled.astype('float32'))

# Always check the shapes
print(x.shape)
print(y.shape)

ds = TensorDataset(x, y)
train_dataloader = DataLoader(ds, batch_size=32, shuffle=True)

Let us check a random single sample from our data loader (always a good idea!)

In [None]:
inputs, classes = next(iter(train_dataloader))
print(inputs[0])
print(classes[0])

#### Model Definition

Let's define a simple network and try to overfit:

In [None]:
class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.linear_relu_stack = nn.Sequential(            
            torch.nn.Linear(x.shape[1], 8),
            nn.BatchNorm1d(num_features=8),
            nn.LeakyReLU(),            
            torch.nn.Linear(8, 32),
            nn.BatchNorm1d(num_features=32),
            nn.LeakyReLU(),
            torch.nn.Linear(32, 64),
            nn.BatchNorm1d(num_features=64),
            nn.LeakyReLU(),
            torch.nn.Linear(64, 32),
            nn.BatchNorm1d(num_features=32),
            nn.LeakyReLU(),            
            torch.nn.Linear(32, 1)
        )

    def forward(self, x):
        logits = self.linear_relu_stack(x)
        return logits

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Using {} device'.format(device))
model = NeuralNetwork().to(device)
print(model)

In [None]:
def train(dataloader, model, loss_fn, measure_fn, optimizer, epochs, print_interval = 10):
    
    losses = []
    size = len(dataloader.dataset)
    
    for epoch in range(epochs):    
        
        loop = tqdm(dataloader)

        for batch, (X, y) in enumerate(loop):
            X, y = X.to(device), y.to(device)

            optimizer.zero_grad()

            # Compute prediction error
            pred = model(X)
            
            loss = loss_fn(pred, y)
            measure = measure_fn(pred, y)

            # Backpropagation
            loss.backward()
            optimizer.step()
            
            losses.append([loss.item(), measure.item()])

            loop.set_description('Epoch {}/{}'.format(epoch + 1, epochs))
            loop.set_postfix(loss=loss.item(), measure=measure.item())
    
    return losses

#### Results / Evaluation

In [None]:
loss_fn = nn.MSELoss()
measure_fn = nn.L1Loss() # MAE
optimizer = torch.optim.Adam(model.parameters())

results = train(train_dataloader, model, loss_fn, measure_fn, optimizer, 200)
print("Done!")

In [None]:
results_data = pd.DataFrame(results).rolling(window=32).mean()
results_data.columns = ['loss', 'measure']
ax = results_data.plot(subplots=True);

In [None]:
print("MAE: " + str(scaler_y.inverse_transform(np.asarray(results[len(results)-1][1]).reshape(-1, 1))))

data_mean = durations.mean()
print("Data mean: " + str(data_mean))
print("Data MAE for prediction simply the mean: " + str((np.absolute(durations - data_mean).mean())))

Now, extend this example with other encodings and a proper (!) evaluation for sequential event log data!

### Binary Process Outcome

Now, we change our goal and aim to predict whether a Patient is going to return to the emergency room.

#### Data Pre-processing

Let us take just the binary vector (0-1) of whether the activity `Return ER` occured or not:

In [None]:
data_return = data[:,15]

Here are the first 3 cases:

In [None]:
data_return[0:3]

In [None]:
print("1st trace : " + str(project_traces(sepsis_log)[0]))
print("2nd trace : " + str(project_traces(sepsis_log)[1]))
print("3rd trace : " + str(project_traces(sepsis_log)[2]))

In [None]:
data_without_return = data[:,0:15]
data_without_return[0:3,:]

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler_x = MinMaxScaler()
data_scaled = scaler_x.fit_transform(data_without_return)

scaler_y = MinMaxScaler()
return_scaled = scaler_y.fit_transform(data_return.reshape(-1, 1))

We make use of PyTorch to build a simple Neural Network.

In [None]:
from torch.utils.data import TensorDataset, DataLoader

# We need float32 data
x = torch.from_numpy(data_scaled.astype('float32'))
y = torch.from_numpy(return_scaled.astype('float32'))

# Always check the shapes
print(x.shape)
print(y.shape)

ds = TensorDataset(x, y)
train_dataloader = DataLoader(ds, batch_size=32, shuffle=True)

Let us check a random single sample from our data loader (always a good idea!)

In [None]:
inputs, classes = next(iter(train_dataloader))
print(inputs[0])
print(classes[0])

#### Model Definition

In [None]:
class NeuralNetworkBinaryOutcome(nn.Module):
    def __init__(self):
        super(NeuralNetworkBinaryOutcome, self).__init__()
        self.linear_relu_stack = nn.Sequential(            
            torch.nn.Linear(x.shape[1], 8),
            nn.BatchNorm1d(num_features=8),
            nn.LeakyReLU(),            
            torch.nn.Linear(8, 32),
            nn.BatchNorm1d(num_features=32),
            nn.LeakyReLU(),
            torch.nn.Linear(32, 64),
            nn.BatchNorm1d(num_features=64),
            nn.LeakyReLU(),
            torch.nn.Linear(64, 32),
            nn.BatchNorm1d(num_features=32),
            nn.LeakyReLU(),            
            torch.nn.Linear(32, 1),
            nn.Sigmoid()
        )

    def forward(self, x):
        logits = self.linear_relu_stack(x)
        return logits

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Using {} device'.format(device))
model = NeuralNetworkBinaryOutcome().to(device)
print(model)

The training loop is identical:

In [None]:
def train(dataloader, model, loss_fn, measure_fn, optimizer, epochs, print_interval = 10):
    
    losses = []
    size = len(dataloader.dataset)
    
    for epoch in range(epochs):    
        
        loop = tqdm(dataloader)

        for batch, (X, y) in enumerate(loop):
            X, y = X.to(device), y.to(device)

            optimizer.zero_grad()

            # Compute prediction error
            pred = model(X)
           
            loss = loss_fn(pred, y)
            measure = measure_fn(pred, y)

            # Backpropagation
            loss.backward()
            optimizer.step()
            
            losses.append([loss.item(), measure.item()])

            loop.set_description('Epoch {}/{}'.format(epoch + 1, epochs))
            loop.set_postfix(loss=loss.item(), measure=measure.item())
    
    return losses

#### Results / Evaluation

In [None]:
loss_fn = nn.BCELoss()

def get_accuracy(y_prob, y_true):    
    y_true = y_true.flatten()
    y_prob = y_prob.flatten()
    assert y_true.ndim == 1 and y_true.size() == y_prob.size()
    y_prob = y_prob > 0.5
    return (y_true == y_prob).sum() / y_true.size(0)
measure_fn = get_accuracy

optimizer = torch.optim.Adam(model.parameters())

results = train(train_dataloader, model, loss_fn, measure_fn, optimizer, 200)
print("Done!")

In [None]:
results_data = pd.DataFrame(results).rolling(window=32).mean()
results_data.columns = ['loss', 'measure']
ax = results_data.plot(subplots=True);

In [None]:
print("Accuracy: " + str(np.asarray(results[len(results)-1][1]).reshape(-1, 1)))