**RANDOM FOREST (RF)**

**DATA LOADING AND PREPARATION**

In [1]:
import pandas as pd

In [2]:
# Read the CSV
tfidf_df = pd.read_csv("tfidf_sncb.csv", sep='\,', engine='python')

tfidf_df['incident_type'] = tfidf_df['incident_type'].astype('string') 

tfidf_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1011 entries, 0 to 1010
Columns: 828 entries, incident_id to 998
dtypes: float64(826), int64(1), string(1)
memory usage: 6.4 MB


In [3]:
from sklearn.model_selection import train_test_split

# Filter in the Features (the values acquired from the events sequence after TF-IDF)
X = tfidf_df.drop(['incident_type', 'incident_id'], axis=1) 

# Filter in the Target variable (labels / incident types)
y = tfidf_df['incident_type']  

# setting random_state constant to be used in the whole pipeline and guarantee reproducibility
r_state = 123

# Split data into training+validation and testing sets
train_val_X, test_X, train_val_y, test_y = train_test_split(X, 
                                                            y, 
                                                            train_size = 0.8, 
                                                            random_state = r_state, # setting random_state for reproducibility
                                                            stratify = y) # to respect class imbalance in the label column

print(f"The train_val_X pandas df has {len(train_val_X)} rows and {len(train_val_X.columns)} columns.")
print(f"The test_y pandas series has {len(test_y)} rows and 1 column.")

The train_val_X pandas df has 808 rows and 826 columns.
The test_y pandas series has 203 rows and 1 column.


In [4]:
from collections import Counter

# get the size of the smallest incident type class
value_counts = Counter(train_val_y)
min_class_setsize = min(value_counts.values())

print(f"In RepeatedStratifiedKFold() function, the parameter n_splits has to be set atmost to {min_class_setsize}, due to class imbalance in the label column.")

In RepeatedStratifiedKFold() function, the parameter n_splits has to be set atmost to 3, due to class imbalance in the label column.


**MODEL TRAINING AND VALIDATION**

In [5]:
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score
from datetime import datetime
import numpy as np

# Define the base model
model_rf = RandomForestClassifier(random_state = r_state, 
                                  n_jobs = -1,
                                  criterion = "gini", # tested ["gini", "entropy", "log_loss"]
                                  max_features = 0.12, # tested ["sqrt", "log2", 10, 100, 0.12]
                                  class_weight = None # tested ["balanced", "balanced_subsample"]
                                 )

# Set up cross-validation
rskf = RepeatedStratifiedKFold(n_splits = min_class_setsize, 
                               n_repeats = 34, 
                               random_state = r_state)

In [6]:
# Start timing
start_time = datetime.now()

# Initialize lists to store metrics
accuracy_scores = []
weighted_f1_scores = []
micro_f1_scores = []
macro_f1_scores = []
fold = 1

# Cross-validation loop
for train_idx, val_idx in rskf.split(train_val_X, train_val_y):

    # Uncomment the print statement for debugging only
    #print(f"Starting model training and validation on fold {fold}")
    
    # Split data
    fold_train_X, fold_val_X = train_val_X.iloc[train_idx], train_val_X.iloc[val_idx]
    fold_train_y, fold_val_y = train_val_y.iloc[train_idx], train_val_y.iloc[val_idx]
    
    # Train the base model
    model_rf.fit(fold_train_X, fold_train_y)
    
    # Predict on the test set
    fold_pred_y = model_rf.predict(fold_val_X)
    
    # Compute metrics
    accuracy_scores.append(accuracy_score(fold_val_y, fold_pred_y))
    weighted_f1_scores.append(f1_score(fold_val_y, fold_pred_y, average='weighted'))
    micro_f1_scores.append(f1_score(fold_val_y, fold_pred_y, average='micro'))
    macro_f1_scores.append(f1_score(fold_val_y, fold_pred_y, average='macro'))
    
    fold += 1

# End timing
end_time = datetime.now()
elapsed_time = end_time - start_time

# Aggregate results
print(f"The training and validation of model_rf took {elapsed_time.total_seconds():.2f} seconds across {rskf.get_n_splits()} iterations.\n")

print(f"Mean Accuracy:          {np.mean(accuracy_scores):.8f} ± {np.std(accuracy_scores):.8f}")
print(f"Mean Weighted F1-Score: {np.mean(weighted_f1_scores):.8f} ± {np.std(weighted_f1_scores):.8f}")
print(f"Mean Micro F1-Score:    {np.mean(micro_f1_scores):.8f} ± {np.std(micro_f1_scores):.8f}")
print(f"Mean Macro F1-Score:    {np.mean(macro_f1_scores):.8f} ± {np.std(macro_f1_scores):.8f}\n")

print(f"Each fold had {len(fold_train_X)} entries for training and {len(fold_val_X)} for validation.")

The training and validation of model_rf took 27.02 seconds across 102 iterations.

Mean Accuracy:          0.61655868 ± 0.02484951
Mean Weighted F1-Score: 0.58984693 ± 0.02445527
Mean Micro F1-Score:    0.61655868 ± 0.02484951
Mean Macro F1-Score:    0.29649956 ± 0.01488594

Each fold had 539 entries for training and 269 for validation.


**criterion = "gini" AND max_features = 0.12**

**The training and validation of model_rf took 26.07 seconds across 102 iterations.**

Mean Accuracy:          0.61655868 ± 0.02484951

Mean Weighted F1-Score: 0.58984693 ± 0.02445527

Mean Micro F1-Score:    0.61655868 ± 0.02484951

Mean Macro F1-Score:    0.29649956 ± 0.01488594

**Each fold had 539 entries for training and 269 for validation.**

**criterion = "gini" AND max_features = 100**

**The training and validation of model_rf took 26.04 seconds across 102 iterations.**

Mean Accuracy:          0.61478026 ± 0.02414821

Mean Weighted F1-Score: 0.58796849 ± 0.02367034

Mean Micro F1-Score:    0.61478026 ± 0.02414821

Mean Macro F1-Score:    0.29525023 ± 0.01380702

**Each fold had 539 entries for training and 269 for validation.**

**criterion = "gini" AND max_features = "sqrt"**

**The training and validation of model_rf took 32.69 seconds across 102 iterations.**

Mean Accuracy:          0.60528086 ± 0.02410240

Mean Weighted F1-Score: 0.57394776 ± 0.02473896

Mean Micro F1-Score:    0.60528086 ± 0.02410240

Mean Macro F1-Score:    0.29011765 ± 0.01563658

**Each fold had 539 entries for training and 269 for validation.**

**TEST SET RESULTS**

In [7]:
from sklearn.metrics import classification_report, confusion_matrix

# Train the base model
model_rf.fit(train_val_X, train_val_y)

# Predict on the test set
test_pred_y = model_rf.predict(test_X)

# Compute and display metrics
print(f"The model classified correctly {sum(test_y == test_pred_y)} entries from a total of {len(test_X)}.\n")

print(f"Accuracy on test set:          {accuracy_score(test_y, test_pred_y)}")
print(f"Weighted F1-Score on test set: {f1_score(test_y, test_pred_y, average='weighted')}\n")

print("F1-Score per class\n")

# Generate classification report
report = classification_report(test_y, test_pred_y, output_dict=True, zero_division=0)

# Display F1-score per class
for class_label, metrics in report.items():
    if isinstance(metrics, dict) and 'f1-score' in metrics:
        print(f"Class {class_label}: F1-Score = {metrics['f1-score']:.6f}")

print("\nAccuracy per class\n")

# Display F1-score per class
for class_label, metrics in report.items():
    if isinstance(metrics, dict) and 'recall' in metrics:
        print(f"Class {class_label}: Recall = {metrics['recall']:.6f}") # Recall is equivalent to per-class accuracy

The model classified correctly 133 entries from a total of 203.

Accuracy on test set:          0.6551724137931034
Weighted F1-Score on test set: 0.628514051676953

F1-Score per class

Class 11: F1-Score = 0.000000
Class 13: F1-Score = 0.786207
Class 14: F1-Score = 0.666667
Class 16: F1-Score = 0.000000
Class 17: F1-Score = 0.000000
Class 2: F1-Score = 0.840000
Class 3: F1-Score = 0.000000
Class 4: F1-Score = 0.444444
Class 6: F1-Score = 0.000000
Class 7: F1-Score = 0.000000
Class 9: F1-Score = 0.521739
Class 99: F1-Score = 0.514286
Class macro avg: F1-Score = 0.314445
Class weighted avg: F1-Score = 0.628514

Accuracy per class

Class 11: Recall = 0.000000
Class 13: Recall = 0.890625
Class 14: Recall = 0.633333
Class 16: Recall = 0.000000
Class 17: Recall = 0.000000
Class 2: Recall = 0.875000
Class 3: Recall = 0.000000
Class 4: Recall = 0.375000
Class 6: Recall = 0.000000
Class 7: Recall = 0.000000
Class 9: Recall = 0.521739
Class 99: Recall = 0.514286
Class macro avg: Recall = 0.31749

**criterion = "gini" AND max_features = 0.12**

**The model classified correctly 133 entries from a total of 203.**

Accuracy on test set:          0.6551724137931034

Weighted F1-Score on test set: 0.628514051676953

**F1-Score per class**

Class 11: F1-Score = 0.000000

Class 13: F1-Score = 0.786207

Class 14: F1-Score = 0.666667

Class 16: F1-Score = 0.000000

Class 17: F1-Score = 0.000000

Class 2: F1-Score = 0.840000

Class 3: F1-Score = 0.000000

Class 4: F1-Score = 0.444444

Class 6: F1-Score = 0.000000

Class 7: F1-Score = 0.000000

Class 9: F1-Score = 0.521739

Class 99: F1-Score = 0.514286

Class macro avg: F1-Score = 0.314445

Class weighted avg: F1-Score = 0.628514

**criterion = "gini" AND max_features = 100**

**The model classified correctly 133 entries from a total of 203.**

Accuracy on test set:          0.6551724137931034

Weighted F1-Score on test set: 0.6252145851467866

**F1-Score per class**

Class 11: F1-Score = 0.000000

Class 13: F1-Score = 0.794521

Class 14: F1-Score = 0.592593

Class 16: F1-Score = 0.000000

Class 17: F1-Score = 0.000000

Class 2: F1-Score = 0.862745

Class 3: F1-Score = 0.000000

Class 4: F1-Score = 0.416667

Class 6: F1-Score = 0.000000

Class 7: F1-Score = 0.000000

Class 9: F1-Score = 0.521739

Class 99: F1-Score = 0.540541

Class macro avg: F1-Score = 0.310734

Class weighted avg: F1-Score = 0.625215

**criterion = "gini"**

**The model classified correctly 129 entries from a total of 203.**

Accuracy on test set:          0.6354679802955665

Weighted F1-Score on test set: 0.6082172179276506

**F1-Score per class**

Class 11: F1-Score = 0.333333

Class 13: F1-Score = 0.748387

Class 14: F1-Score = 0.653846

Class 16: F1-Score = 0.000000

Class 17: F1-Score = 0.000000

Class 2: F1-Score = 0.862745

Class 3: F1-Score = 0.000000

Class 4: F1-Score = 0.434783

Class 6: F1-Score = 0.000000

Class 7: F1-Score = 0.000000

Class 9: F1-Score = 0.521739

Class 99: F1-Score = 0.417910

Class macro avg: F1-Score = 0.331062

Class weighted avg: F1-Score = 0.608217

**SAVE AND EXPORT RESULTS**

In [9]:
"""
# Create DataFrame for Scores
accuracy_rf_params_df = pd.DataFrame({'score': accuracy_scores})
weighted_f1_rf_params_df = pd.DataFrame({'score': weighted_f1_scores})
micro_f1_rf_params_df = pd.DataFrame({'score': micro_f1_scores})
macro_f1_rf_params_df = pd.DataFrame({'score': macro_f1_scores})

# Export the DataFrame to a CSV file
accuracy_rf_params_df .to_csv('accuracy_rf_params.csv', index=False)
weighted_f1_rf_params_df.to_csv('weighted_f1_rf_params.csv', index=False)
micro_f1_rf_params_df.to_csv('micro_f1_rf_params.csv', index=False)
macro_f1_rf_params_df.to_csv('macro_f1_rf_params.csv', index=False)
"""