# Process Mining Project
This Jupyter notebook contains the implementation of a recommendation system for process mining, based on decision tree classifiers.
We compute the predictions of the decision tree on a test set. In case of a negative prediction, we provide the recommended activities to perform to reach a positive outcome, based on the decision tree structure.

## Data Loading and Preprocessing
The following code block loads the necessary libraries and datasets, preprocesses the data, and prepares it for training the decision tree classifier.

#### List of Imports

In [None]:
# Importing libraries
import warnings
warnings.filterwarnings('ignore')

import os
import src.utils as utils
import src.plotting as plotting
from sklearn.tree import DecisionTreeClassifier
from hyperopt import hp

#### Setting Up the Environment and Global parameters
The path to the training and testing logs are defined below.
Also the parameters for the decision tree classifier and the prefix length for trace encoding are set.

In [None]:
# Path to logs
TRAIN_LOG_PATH = os.path.join("logs", "Production_avg_dur_training_0-80.xes")
TEST_LOG_PATH = os.path.join("logs", "Production_avg_dur_testing_80-100.xes")

# Tree parameters
prefix_lengths = [5, 10]
prefix_length = prefix_lengths[1]

#Choosen hyperparameters
MAX_DEPTH = 3
MAX_FEATURES = 20
CRITERION = 'entropy'
RANDOM_SEED = 42

#### Import the Event Logs
Import the training event logs in XES format using the `pm4py` library.
For each trace in the log, its prefix of length `prefix_length` is created to form a pruned log used for training the decision tree classifier.

In [None]:
# Load the training log
log = utils.import_log(TRAIN_LOG_PATH)
print(f"Number of traces in the log: {len(log)}")

# Prune the log to the desired prefix length
pruned_log = utils.create_prefixes_log(log, prefix_length=prefix_length)
print(f"Pruned the log to length: {prefix_length}")

#### Boolean Encoding
The pruned event log is then transformed into a boolean-encoded DataFrame where each column represents
the presence or absence of a specific activity in the traces.
To do this, we:
1. Extract unique activities from the pruned log.
2. Create a boolean DataFrame where each row corresponds to a trace and each column corresponds to an activity.
3. For each trace, set the corresponding activity columns to 1 if the activity is present in the trace, and 0 otherwise.
4. Add a target column indicating the label assigned to that trace.

In [None]:
# Retrieve unique activity names
activity_names = utils.get_activity_names(log)
print(f"# Unique activity names in the log: {len(activity_names)}")

# Binary encode the pruned log
encoded_log = utils.boolean_encode(pruned_log, activity_names)
print(f"Encoded activities:\n{encoded_log.iloc[0].to_frame(name='Value')}")

## Decision Tree Classifier
Since the recommendation system need to be given to still incomplete traces, the decision tree classifier need to be trained on the prefixes of the traces only.

#### Taking Optimized Hyperparameters

In [None]:
# In the mapping section, replace:
space = {
    'max_depth': hp.choice('max_depth', [2,3,4]),  # Use range(2, 4) to get [3, 4] after +1
    'max_features': hp.choice('max_features', ['sqrt', 'log2', None]),
    'criterion': hp.choice('criterion', ['gini', 'entropy']),
    'random_state': 42
}

best_params = utils.hyperparameter_optimization(encoded_log, max_evals=300, space=space)

# Fixed parameters set from the user
choosen_params = {
    'max_depth': MAX_DEPTH,
    'max_features': MAX_FEATURES,
    'criterion': CRITERION,
    'random_state': RANDOM_SEED
}

params = best_params  # You can switch to choosen_params if desired

# Set the array containing the feature used to train the decision tree
features = ['prefix_length'] + activity_names

#### Train the Decision Tree Classifier
The decision tree classifier is trained on the boolean-encoded DataFrame created from the pruned event log, using the specified hyperparameters. The input features are the activity columns, and the target variable is the label column.

In [None]:
# Set the hyperparameters for the decision tree
clf = DecisionTreeClassifier(max_depth=params['max_depth'], max_features=params['max_features'], criterion=params['criterion'], random_state=params['random_state'])

# Train the decision tree classifier
clf.fit(encoded_log.drop(['trace_id', 'label'], axis=1), encoded_log['label'])

#### Plot the Decision Tree
The trained decision tree classifier is visualized using the `plot_tree` function from `sklearn.tree`, displaying the structure of the tree, including feature names and class names.

In [None]:
plotting.plot_decision_tree(clf, features, save=False)

We can see that the blue nodes represent positive outcomes, while the orange nodes represent negative outcomes.
From a node, we can see the feature (activity) used for the split.
- the left child node corresponds to the case where the activity is not present (feature value = 'false');
- the right child node corresponds to the case where the activity is present (feature value = 'true').

Traversing the tree from the root to a leaf node provides the sequence of conditions leading to that outcome. 

## Evaluation of the Classifier
The performance of the decision tree classifier is evaluated on the test set using accuracy, precision, recall, and F1-score metrics.


#### Importing the Test Log
Again, the test log is imported in XES format, pruned to obtain prefixes, and boolean-encoded in the same way as the training log.

In [None]:
# Load the test log
test_log = utils.import_log(TEST_LOG_PATH)

# Create prefixes for the test log as well
test_log_prefix = utils.create_prefixes_log(test_log, prefix_length=prefix_length)

# Binary encode the test log prefixes
test_encoded_log = utils.boolean_encode(test_log_prefix, activity_names)


#### Compute the Predictions
The trained decision tree classifier is used to predict the labels of the traces in the test set.

In [None]:
predictions = clf.predict(test_encoded_log.drop(['trace_id', 'label'], axis=1))

#### Confusion Matrix
The confusion matrix is computed and visualized to show the performance of the classifier:
- True Positives (TP): Correctly predicted positive cases (bottom-right cell).
- True Negatives (TN): Correctly predicted negative cases (top-left cell).
- False Positives (FP): Incorrectly predicted positive cases (top-right cell).
- False Negatives (FN): Incorrectly predicted negative cases (bottom-left cell).

In [None]:
true_labels = test_encoded_log['label'].values
plotting.plot_confusion_matrix(true_labels, predictions, save=False)

#### Metrics Calculation
Compute the accuracy, precision, recall, and F1-score based on the predictions and true labels from the test set.

In [None]:
result = plotting.compute_all_metrics(true_labels, predictions)

## Recommendations
Given the predictions obtained from the decision tree classifier, using the structure of the tree, we can provide recommendations for traces predicted to have a negative outcome.

#### Extract Recommendations from the Decision Tree
To do this, we need to:
1. Extract the positive paths from the decision tree, which are the paths leading to leaf nodes with a positive outcome.
2. For each prefix trace predicted as negative:
    - Filter the positive paths to find the ones that are compliant with the current trace (no condition along the path is violated by the current trace);
    - From the compliant paths, take the one with the highest confidence score;
    - The recommended activities are the ones that need to be added to the current trace to follow the selected positive path.

In [None]:
# Drop the true labels and add predicted labels for recommendation extraction
test_encoded_log_with_predictions = test_encoded_log.copy().drop('label', axis=1)
test_encoded_log_with_predictions['predicted_label'] = predictions

# Extract recommendations
recommendations = utils.extract_recommendations(clf, features, test_encoded_log_with_predictions)

#### Analysis of the recommendation

In [None]:
plotting.print_recommendations(recommendations, max_display=5)

We can see that:
- Trace 1 and 5 are predicted as true -> their reccomendation set will be empty
- ALl the other traces are predicte

**Visual Example**

In [None]:
# Taking one specific recommendation to plot on the tree
trace_index = 2
# Extract prefix features and recommendation
prefix_features, recommendation = list(recommendations.items())[trace_index]
# Convert prefix frozenset to regular set of activity names
prefix_trace_features = set(prefix_features)
# Convert BooleanConditions to (activity_name, value) tuples
recommended_conditions = {(cond.feature, cond.value) for cond in recommendation}

plotting.plot_recommendation_on_tree(clf, activity_names, prefix_trace_features=prefix_trace_features, recommended_conditions=recommended_conditions, save=False)

#### Evaluation recommendations

In [None]:
full_trace_test_encoded_log = utils.boolean_encode(test_log, activity_names)
evaluation_metrics = utils.evaluate_recommendations(full_trace_test_encoded_log, recommendations)
plotting.print_recommendations_metrics(evaluation_metrics)