# Applied Process Mining Module

This notebook is part of an Applied Process Mining module. The collection of notebooks is a *living document* and subject to change. 

# Hands-on 4 - 'Predictive Process Mining' (Python / PM4Py)

## Setup

<img src="https://pm4py.fit.fraunhofer.de/static/assets/images/pm4py-site-logo-padded.png" alt="PM4Py" style="width: 200px;"/>

In this notebook, we are using the following libraries: 

* [numpy](https://numpy.org/)
* [pandas](https://pandas.pydata.org/)
* [PM4Py library](https://pm4py.fit.fraunhofer.de/)
* [PyTorch](https://pytorch.org/)

In [None]:
## Perform the commented out commands to install the dependencies
# %pip install numpy
# %pip install pandas
# %pip install matplotlib
# %pip install pm4py
# %pip install pytorch

In [None]:
import numpy as np
import pandas as pd
import pm4py
import os
import torch
import torch.nn as nn
from tqdm import tqdm

## Event Log

In [None]:
sepsis = pd.read_csv("../data/sepsis.csv", sep=';')
sepsis_log = pm4py.format_dataframe(sepsis, case_id='case_id', activity_key='activity', timestamp_key='timestamp')
sepsis_log = pm4py.convert_to_event_log(sepsis_log)

In [None]:
len(sepsis_log)

## Feature Extraction / Encoding

We are using the PM4Py functionality here:

https://pm4py.fit.fraunhofer.de/documentation/1.5#item-7-0-1

### Set of Events / 2-grams

In [None]:
data, feature_names = get_log_representation.get_representation(sepsis_log, 
                                                                str_ev_attr=["concept:name"],
                                                                str_tr_attr=[],
                                                                num_ev_attr=[],
                                                                num_tr_attr=[],
                                                                str_evsucc_attr=[])

In [None]:
feature_names

In [None]:
data[0]

In [None]:
data.shape

So, PM4Py gives us a *one-hot encoding* of the so called *set abstraction* of the event log. This means there are 16 distinct activities in the event log and the feature vector simply encodes whether that activity is present or not in the data. 

Let us have a look at the distribution of these feature vectors:

In [None]:
dist_features = np.unique(data, return_counts= True, axis = 0)
dist_features

What is the most common feature vector?

In [None]:
dist_features[0][np.argmax(dist_features[1])]

Makes sense, almost all activities actually are bound to occur in this process. There are only few choices.
So, this encoding is likely not the most useful one but let's anyway try to use it for an initial predictive model and iterate later.

## Bag of Words / Multiset

In [None]:
print(sepsis.loc[:,["case_id"]].nunique())
data = np.asarray(sepsis.loc[:,["case_id", "activity"]].groupby(["case_id", "activity"]).size().unstack(fill_value=0))
data.shape

## Prediction

### Throughput time

In [None]:
from pm4py.statistics.traces.log import case_statistics
durations = np.asarray(case_statistics.get_all_casedurations(sepsis_log, parameters={ case_statistics.Parameters.TIMESTAMP_KEY: "time:timestamp"} ))
durations = np.expand_dims(durations, 1)
len(durations)
durations = durations / 60 / 60 / 24

In [None]:
pd.DataFrame(durations).boxplot()

In [None]:
from torch.utils.data import TensorDataset, DataLoader

data = data.astype('float32')
durations = durations.astype('float32')

print(data.shape)
print(durations.shape)

ds = TensorDataset(torch.from_numpy(data), 
                   torch.from_numpy(durations))
train_dataloader = DataLoader(ds, batch_size=64, shuffle=True)

Let's define a simple network and try to overfit:

In [None]:
class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.linear_relu_stack = nn.Sequential(            
            torch.nn.Linear(16, 512),
            nn.ReLU(),
            torch.nn.Linear(512, 256),
            nn.ReLU(),
            torch.nn.Linear(256, 128),
            nn.ReLU(),            
            torch.nn.Linear(128, 1)
        )

    def forward(self, x):
        logits = self.linear_relu_stack(x)
        return logits

In [None]:
device = 'cpu' #'cuda' if torch.cuda.is_available() else 'cpu'
print('Using {} device'.format(device))
model = NeuralNetwork().to(device)
print(model)

In [None]:
def train(dataloader, model, loss_fn, measure_fn, optimizer, epochs, print_interval = 10):
    
    losses = []
    size = len(dataloader.dataset)
    
    for epoch in range(epochs):    
        
        loop = tqdm(dataloader)

        for batch, (X, y) in enumerate(loop):
            X, y = X.to(device), y.to(device)

            # Compute prediction error
            pred = model(X)
            
            loss = loss_fn(pred, y)
            measure = measure_fn(pred, y)

            # Backpropagation
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            losses.append([loss.item(), measure.item()])

            loop.set_description('Epoch {}/{}'.format(epoch + 1, epochs))
            loop.set_postfix(loss=loss.item(), measure=measure.item())
    
    return losses

In [None]:
loss_fn = nn.MSELoss()
measure_fn = nn.L1Loss()
optimizer = torch.optim.Adam(model.parameters())

results = train(train_dataloader, model, loss_fn, measure_fn, optimizer, 500)
print("Done!")

In [None]:
results_data = pd.DataFrame(results).rolling(window=15).mean()
results_data.columns = ['loss', 'measure']
ax = results_data.plot(subplots=True);