# Process Mining Project
In this project we aim to compute predictions and, in case of a negative outcome prediction,
providing recommendations on the best activities to perform to achieve a positive outcome by
leveraging the transparency of decision trees.

## Data Loading and Preprocessing

### List of Imports

In [None]:
# Importing libraries
import warnings
warnings.filterwarnings('ignore')

import os
import src.utils as utils
import src.plotting as plotting
import sklearn.tree as tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

### Setting Up the Environment

In [None]:
TRAIN_LOG_PATH = os.path.join("logs", "Production_avg_dur_training_0-80.xes")
TEST_LOG_PATH = os.path.join("logs", "Production_avg_dur_testing_80-100.xes")

### Setting global parameters

In [None]:
prefix_length = 5

### Import the Event Logs
Import the event logs in XES format using the `pm4py` library.
Each trace in the log is then cut at a specified index to create prefixes for analysis.

In [None]:
log = utils.import_log(TRAIN_LOG_PATH)
print(f"Number of traces in the log: {len(log)}")
log = utils.create_prefixes_log(log, prefix_length=prefix_length)
print(f"\nFirst 5 traces:\n {log[:5]}")

### Boolean Encoding
The event log is transformed into a boolean-encoded DataFrame where each column represents
the presence or absence of specific activities in the traces.

In [None]:
activity_names = utils.get_activity_names(log)
print(f"Unique activity names in the log: {activity_names}")
encoded_log = utils.boolean_encode(log, activity_names)
print(f"Encoded activities (first 5 cases):\n{encoded_log.head()}")

## Decision Tree 

**Taking optimized hyperparameters**

In [None]:
best_params = utils.hyperparameter_optimization(encoded_log, max_evals=100)

choosen_params = {
    'max_depth': 5,
    'max_features': 40,
    'criterion': 'gini',
    'random_state': 42
}

params = choosen_params  # You can switch to choosen_params if desired

In [None]:
clf = DecisionTreeClassifier(max_depth=params['max_depth'], max_features=params['max_features'], criterion=params['criterion'], random_state=params['random_state'])
clf.fit(encoded_log.drop(['trace_id', 'label'], axis=1), encoded_log['label'])

In [None]:
plotting.plot_decision_tree(clf, activity_names)

### Evaluation

**Importing testing set**

In [None]:
test_log = utils.import_log(TEST_LOG_PATH)
# Create prefixes for the test log as well
test_log = utils.create_prefixes_log(test_log, prefix_length=prefix_length)
test_encoded_log = utils.boolean_encode(test_log, activity_names)
predictions = clf.predict(test_encoded_log.drop(['trace_id', 'label'], axis=1))

**Confusion Matrix**

In [None]:
true_labels = test_encoded_log['label'].values
plotting.plot_confusion_matrix(true_labels, predictions)

**Metrics**

In [None]:
result = plotting.compute_all_metrics(true_labels, predictions)

## Recommendations

**Extract recommendations from the decision tree**

In [None]:
recommendations = utils.extract_recommendations(clf, activity_names, ['false', 'true'], test_encoded_log)

**Evaluation recommendations**

In [None]:
evaluation_metrics = utils.evaluate_recommendations(test_encoded_log, recommendations)
plotting.print_recommendations_metrics(evaluation_metrics)