# Modeling Phase

## Introduction

This notebook performs the modeling phase using the preprocessed data and pipelines defined in `AlternativePreprocessing.ipynb`. All feature engineering and label engineering operations (including the creation of `RISK_LEVEL` via STKDE) are performed in a leakage-free manner, i.e., only after the train/test split and within each cross-validation fold.

**Dependencies:**
- Loading artifacts from `Classification (Preprocessing)` (data, pipelines, STKDE parameters, scoring_dict)
- Use of custom transformers and modular pipelines defined in `custom_transformers.py`

**Objectives:**
- Select the best model via cross-validation
- Perform hyperparameter tuning
- Evaluate generalization on the test set

# Setup

Import libraries, define paths, and load preprocessed data. All custom functions are imported from `custom_transformers.py` to ensure modularity and reusability.

## Run on Google Colab (optional)

If working locally, this cell can be ignored.

In [1]:
# Run on Google Colab (optional)
from google.colab import drive
drive.mount('/drive', force_remount=True)

Mounted at /drive


## Import libraries

Import all libraries needed for modeling and visualization. Custom functions are imported from `custom_transformers.py`.

In [2]:
!pip install xgboost
!pip install lightgbm

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import joblib
import json
import glob
import sys

from sklearn.model_selection import StratifiedKFold, cross_validate, RandomizedSearchCV, GroupKFold, TimeSeriesSplit
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, f1_score, accuracy_score, precision_score, recall_score, confusion_matrix, ConfusionMatrixDisplay, roc_auc_score, average_precision_score, matthews_corrcoef, classification_report, PrecisionRecallDisplay
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.preprocessing import FunctionTransformer
from sklearn.base import BaseEstimator, TransformerMixin

project_path = '/drive/MyDrive/Data Mining and Machine Learning/Progetto'
os.chdir(project_path)
sys.path.append(project_path)

# Import custom transformers
from custom_transformers import cyclical_transform, BinarizeSinCosTransformer, STKDEAndRiskLabelTransformer, TargetEngineeringPipeline

from scipy.stats import ttest_rel
import random

random.seed(42)
np.random.seed(42)
#warnings.filterwarnings('ignore')
sns.set_style('whitegrid')



## Define paths

Define paths for loading preprocessed data and saving modeling results (models, metrics, plots).

In [3]:
base_dir = "/drive/MyDrive/Data Mining and Machine Learning/Progetto"

preprocessing_dir = os.path.join(base_dir, "Classification (Preprocessing)")
modeling_results_dir = os.path.join(base_dir, "Classification (Modeling)")
os.makedirs(modeling_results_dir, exist_ok=True)
print(f"Preprocessing artifacts will be loaded from: {preprocessing_dir}")
print(f"Modeling results will be saved to: {modeling_results_dir}")

Preprocessing artifacts will be loaded from: /drive/MyDrive/Data Mining and Machine Learning/Progetto/Classification (Preprocessing)
Modeling results will be saved to: /drive/MyDrive/Data Mining and Machine Learning/Progetto/Classification (Modeling)


## Load data and pipelines

Load the unprocessed data splits (`X_train`, `y_train`, `X_test`, `y_test`), the unfitted preprocessing pipelines, the feature names, and the scoring dictionary saved by `AlternativePreprocessing.ipynb`.

In [4]:
print("=== Loading data and pipelines ===")

X_train = np.load(os.path.join(preprocessing_dir, 'X_train.npy'), allow_pickle=True)
y_train = np.load(os.path.join(preprocessing_dir, 'y_train.npy'))
X_test = np.load(os.path.join(preprocessing_dir, 'X_test.npy'), allow_pickle=True)
y_test = np.load(os.path.join(preprocessing_dir, 'y_test.npy'))
feature_names = joblib.load(os.path.join(preprocessing_dir, 'X_feature_names.joblib'))

X_train = pd.DataFrame(X_train, columns=feature_names)
X_test = pd.DataFrame(X_test, columns=feature_names)

preprocessing_pipeline_general = joblib.load(os.path.join(preprocessing_dir, 'preprocessing_pipeline_general.joblib'))
preprocessing_pipeline_trees = joblib.load(os.path.join(preprocessing_dir, 'preprocessing_pipeline_trees.joblib'))
preprocessing_pipeline_bernoulli = joblib.load(os.path.join(preprocessing_dir, 'preprocessing_pipeline_bernoulli.joblib'))

scoring = joblib.load(os.path.join(preprocessing_dir, 'scoring_dict.joblib'))

print(f"Loaded unprocessed data: X_train {X_train.shape}, X_test {X_test.shape}, y_train {y_train.shape}, y_test {y_test.shape}")
print(f"Loaded preprocessing pipelines: general, trees, bernoulli")
print(f"Loaded scoring dictionary: {list(scoring.keys())}")

=== Loading data and pipelines ===
Loaded unprocessed data: X_train (362070, 27), X_test (113574, 27), y_train (362070,), y_test (113574,)
Loaded preprocessing pipelines: general, trees, bernoulli
Loaded scoring dictionary: ['accuracy', 'f1_weighted', 'precision_weighted', 'recall_weighted', 'roc_auc_ovr_weighted', 'pr_auc_ovr_weighted', 'mcc', 'neg_log_loss']


## Target variable selection

Select the target variable for classification. In this notebook, the target variable `RISK_LEVEL` is *engineered* via STKDE within the leakage-free pipeline.

## Update Scoring with Ordinal Metrics

In [5]:
# Update the scoring dictionary
from sklearn.metrics import log_loss, roc_auc_score, average_precision_score, accuracy_score, f1_score, precision_score, recall_score, matthews_corrcoef, make_scorer

# Ensure ordinal metrics are defined (copied from earlier cell)
def ordinal_mae(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

def severe_error_rate(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred) == 2)

ordinal_mae_scorer = make_scorer(ordinal_mae, greater_is_better=False)
severe_error_scorer = make_scorer(severe_error_rate, greater_is_better=False)


# Define the scoring dictionary with appropriate averages for multiclass
scoring = {
    'accuracy': make_scorer(accuracy_score),
    # Explicitly set average for multiclass metrics
    'f1_weighted': make_scorer(f1_score, average='weighted'),
    'f1_macro': make_scorer(f1_score, average='macro'), # Added f1_macro for better insight
    'precision_weighted': make_scorer(precision_score, average='weighted', zero_division=0),
    'recall_weighted': make_scorer(recall_score, average='weighted', zero_division=0),
    'mcc': make_scorer(matthews_corrcoef), # MCC handles multiclass natively

    # Ordinal metrics
    'ordinal_mae': ordinal_mae_scorer, # Already defined above
    'severe_error': severe_error_scorer, # Already defined above

    # Probability-based metrics for multiclass
    'roc_auc_ovr_weighted': make_scorer(
        roc_auc_score,
        multi_class='ovr',
        average='weighted',
        needs_proba=True
    ),
    'pr_auc_ovr_weighted': make_scorer(
        average_precision_score,
        average='weighted',
        needs_proba=True
    ),
    'neg_log_loss': make_scorer(
        log_loss,
        greater_is_better=False,
        needs_proba=True
    )
}

print(f"Updated scoring dictionary with ordinal metrics and MCC: {list(scoring.keys())}")

Updated scoring dictionary with ordinal metrics and MCC: ['accuracy', 'f1_weighted', 'f1_macro', 'precision_weighted', 'recall_weighted', 'mcc', 'ordinal_mae', 'severe_error', 'roc_auc_ovr_weighted', 'pr_auc_ovr_weighted', 'neg_log_loss']


## Data verification

Check the dimensions and class distribution of the loaded data.

In [6]:
print("=== Data verification ===")
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")
print(f"Scoring dict: {scoring}")

print("\nClass distribution in y_train (dummy, the real label will be engineered in the pipeline):")
unique, counts = np.unique(y_train, return_counts=True)
print(dict(zip(unique, counts)))
print("Proportions:")
print(dict(zip(unique, counts / len(y_train))))
print("\nNote: the real target variable will be created in the modeling pipeline.")

=== Data verification ===
X_train shape: (362070, 27)
y_train shape: (362070,)
X_test shape: (113574, 27)
y_test shape: (113574,)
Scoring dict: {'accuracy': make_scorer(accuracy_score, response_method='predict'), 'f1_weighted': make_scorer(f1_score, response_method='predict', average=weighted), 'f1_macro': make_scorer(f1_score, response_method='predict', average=macro), 'precision_weighted': make_scorer(precision_score, response_method='predict', average=weighted, zero_division=0), 'recall_weighted': make_scorer(recall_score, response_method='predict', average=weighted, zero_division=0), 'mcc': make_scorer(matthews_corrcoef, response_method='predict'), 'ordinal_mae': make_scorer(ordinal_mae, greater_is_better=False, response_method='predict'), 'severe_error': make_scorer(severe_error_rate, greater_is_better=False, response_method='predict'), 'roc_auc_ovr_weighted': make_scorer(roc_auc_score, response_method='predict', multi_class=ovr, average=weighted, needs_proba=True), 'pr_auc_ovr_we

## Pipelines and tuning for XGBoost/LightGBM

Definition of pipelines and placeholders for advanced models (XGBoost, LightGBM).

### Definition of models to compare

Definition of a dictionary with the classification models to be evaluated. Each model will be placed in a pipeline that includes preprocessing steps.

In [7]:
models_to_evaluate = {
    # --- Baseline ---
    'Dummy': DummyClassifier(strategy='stratified', random_state=42),

    # --- Linear Models ---
    'LogisticRegression': LogisticRegression(random_state=42, max_iter=1000, class_weight='balanced', n_jobs=-1),

    # --- K-Nearest Neighbors ---
    'KNN': KNeighborsClassifier(n_jobs=-1),

    # --- Support Vector Machines ---
    #'SVC_rbf': SVC(random_state=42, class_weight='balanced', probability=True, kernel='rbf'),
    'SVC_linear': SVC(random_state=42, class_weight='balanced', probability=True, kernel='linear'),

    # --- Naive Bayes ---
    'GaussianNB': GaussianNB(), # For comparison
    'BernoulliNB': BernoulliNB(), # Added

    # --- Tree-Based Models ---
    'DecisionTree': DecisionTreeClassifier(random_state=42, class_weight='balanced'),

    # --- Ensemble Methods ---
    'RandomForest': RandomForestClassifier(random_state=42, class_weight='balanced', n_jobs=-1),
    'GradientBoosting': GradientBoostingClassifier(random_state=42),
}

models_to_evaluate.update({
    'XGBoost': None,  # Placeholder for XGBoost model
    'LightGBM': None  # Placeholder for LightGBM model
})
try:
    from xgboost import XGBClassifier
    models_to_evaluate['XGBoost'] = XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='mlogloss')
except ImportError:
    print("XGBoost not installed. Skipping.")

try:
    from lightgbm import LGBMClassifier
    models_to_evaluate['LightGBM'] = LGBMClassifier(random_state=42)
except ImportError:
    print("LightGBM not installed. Skipping.")

print(f"Defined models: {list(models_to_evaluate.keys())}")

Defined models: ['Dummy', 'LogisticRegression', 'KNN', 'SVC_linear', 'GaussianNB', 'BernoulliNB', 'DecisionTree', 'RandomForest', 'GradientBoosting', 'XGBoost', 'LightGBM']


### Model selection via cross-validation

Each model is evaluated using leakage-free cross-validation (GroupKFold on YEAR+MONTH if available, otherwise StratifiedKFold). The average metric results are saved for comparison.

In [8]:
print("=== Starting model evaluation with Cross-Validation ===")

from sklearn.model_selection import TimeSeriesSplit

# --- Load optimal STKDE parameters ---
optimal_stkde_params_path = os.path.join(preprocessing_dir, 'stkde_optimal_params.json')
hs_optimal = 200.0  # Default
ht_optimal = 60.0   # Default
try:
    with open(optimal_stkde_params_path, 'r') as f:
        params = json.load(f)
        hs_optimal = params.get('hs_opt', hs_optimal)
        ht_optimal = params.get('ht_opt', ht_optimal)
    print(f"Loaded optimal STKDE parameters: hs={hs_optimal}, ht={ht_optimal}")
except FileNotFoundError:
    print(f"Optimal STKDE parameters file not found: {optimal_stkde_params_path}. Using default values hs={hs_optimal}, ht={ht_optimal}.")
except Exception as e:
    print(f"Error loading STKDE parameters: {e}. Using default values hs={hs_optimal}, ht={ht_optimal}.")
# --- End loading optimal STKDE parameters ---

cv_results_all_models = {}

# --- TimeSeriesSplit logic ---
# Ensure 'YEAR', 'MONTH', 'DAY', 'HOUR' are numeric for sorting
for col in ['YEAR', 'MONTH', 'DAY', 'HOUR']:
    if col in X_train.columns:
        X_train[col] = pd.to_numeric(X_train[col], errors='coerce')
    if col in X_test.columns:
        X_test[col] = pd.to_numeric(X_test[col], errors='coerce')

# Sort X_train by time for proper TimeSeriesSplit
X_train_sorted = X_train.sort_values(['YEAR', 'MONTH', 'DAY', 'HOUR']).reset_index(drop=True)
y_train_sorted = y_train[X_train_sorted.index]

n_splits = 5
cv_strategy = TimeSeriesSplit(n_splits=n_splits)
print(f"Using TimeSeriesSplit with {n_splits} splits for time-aware cross-validation.")

# Directory for cached splits with engineered labels
split_cache_dir = os.path.join(modeling_results_dir, "split_label_cache")
os.makedirs(split_cache_dir, exist_ok=True)

for model_name, model_instance in models_to_evaluate.items():
    if model_instance is None:
        print(f"Skipping {model_name} as it is not available.")
        if model_name not in cv_results_all_models:
             cv_results_all_models[model_name] = {metric: np.nan for metric in scoring.keys()}
        continue

    print(f"\n--- Processing {model_name} ---")
    n_splits = 5
    split_metric_files = [
        os.path.join(modeling_results_dir, f"{model_name}_cv_split_{i+1}.json")
        for i in range(n_splits)
    ]
    all_splits_exist = all(os.path.exists(f) for f in split_metric_files)
    train_model_cv = not all_splits_exist

    if all_splits_exist:
        print(f"All split metric files found for {model_name}. Loading per-split results.")
        split_scores = []
        for split_file in split_metric_files:
            with open(split_file, 'r') as f:
                split_scores.append(json.load(f))
        # Aggregate metrics across all folds
        avg_scores = {}
        for metric in scoring.keys():
            metric_values = [s.get(metric, np.nan) for s in split_scores]
            avg_scores[metric] = float(np.nanmean(metric_values))
        cv_results_all_models[model_name] = avg_scores
        print(f"Aggregated CV results for {model_name}:")
        for metric_key_loaded, score_loaded in avg_scores.items():
            print(f"    {metric_key_loaded}: {score_loaded:.4f}")
    else:
        print(f"Not all split metric files found for {model_name}. Training required for missing splits.")
        # --- Setup pipeline as before ---
        current_preprocessing_pipeline_obj = None
        if model_name == 'Dummy':
            current_preprocessing_pipeline_obj = 'passthrough'
            print("No preprocessing for Dummy (using passthrough in pipeline)")
        elif model_name in ['DecisionTree', 'RandomForest', 'GradientBoosting', 'XGBoost', 'LightGBM']:
            current_preprocessing_pipeline_obj = preprocessing_pipeline_trees
            print(f"Using tree-optimized pipeline for {model_name}")
        elif model_name == 'BernoulliNB':
            current_preprocessing_pipeline_obj = preprocessing_pipeline_bernoulli
            print(f"Using BernoulliNB-optimized pipeline for {model_name}")
        else:
            current_preprocessing_pipeline_obj = preprocessing_pipeline_general
            print(f"Using general preprocessing pipeline for {model_name}")

        # --- Define STKDE and label engineering step ---
        stkde_label_generator = STKDEAndRiskLabelTransformer(
            hs=hs_optimal, ht=ht_optimal, n_classes=3, n_jobs=-1, random_state=42,
            intensity_col_name='stkde_intensity_fold',
            label_col_name='RISK_LEVEL_fold_engineered'
        )

        # --- Define feature processing and classification pipeline ---
        feature_processing_and_classifier_pipeline = Pipeline([
            ("feature_preprocessor", current_preprocessing_pipeline_obj),
            ("classifier", model_instance)
        ])

        # --- Combine into TargetEngineeringPipeline ---
        full_model_pipeline_for_cv = TargetEngineeringPipeline(
            target_engineer=stkde_label_generator,
            feature_pipeline=feature_processing_and_classifier_pipeline
        )

        # --- Custom cross-validation loop with label caching ---
        cv_scores = {f'test_{metric}': [] for metric in scoring.keys()}
        split_idx = 0
        for train_idx, val_idx in cv_strategy.split(X_train_sorted):
            split_idx += 1
            split_metric_path = os.path.join(modeling_results_dir, f"{model_name}_cv_split_{split_idx}.json")
            split_cache_path = os.path.join(split_cache_dir, f"split_{split_idx}_with_labels.csv")
            if os.path.exists(split_metric_path):
                print(f"Split {split_idx}: metrics already computed, loading from {split_metric_path}")
                with open(split_metric_path, 'r') as f:
                    split_metrics = json.load(f)
                for metric in scoring.keys():
                    cv_scores[f'test_{metric}'].append(split_metrics.get(metric, np.nan))
                continue

            # Prepare train/val splits and check for cached labels
            if os.path.exists(split_cache_path):
                print(f"Loading cached split with engineered labels: {split_cache_path}")
                split_df = pd.read_csv(split_cache_path)
                X_train_fold = split_df.loc[split_df['split'] == 'train'].drop(['split', 'engineered_label'], axis=1)
                y_train_fold = split_df.loc[split_df['split'] == 'train', 'engineered_label']
                X_val_fold = split_df.loc[split_df['split'] == 'val'].drop(['split', 'engineered_label'], axis=1)
                y_val_fold = split_df.loc[split_df['split'] == 'val', 'engineered_label']
            else:
                X_train_fold = X_train_sorted.iloc[train_idx].copy()
                X_val_fold = X_train_sorted.iloc[val_idx].copy()
                # Fit label engineer on train, transform both
                stkde_label_generator_fold = STKDEAndRiskLabelTransformer(
                    hs=hs_optimal, ht=ht_optimal, n_classes=3, n_jobs=-1, random_state=42,
                    intensity_col_name='stkde_intensity_fold',
                    label_col_name='RISK_LEVEL_fold_engineered'
                )
                stkde_label_generator_fold.fit(X_train_fold)
                X_train_fold_aug = stkde_label_generator_fold.transform(X_train_fold)
                X_val_fold_aug = stkde_label_generator_fold.transform(X_val_fold)
                y_train_fold = X_train_fold_aug['RISK_LEVEL_fold_engineered']
                y_val_fold = X_val_fold_aug['RISK_LEVEL_fold_engineered']
                # Save split with engineered labels for reuse
                train_fold_save = X_train_fold.copy()
                train_fold_save['split'] = 'train'
                train_fold_save['engineered_label'] = y_train_fold.values
                val_fold_save = X_val_fold.copy()
                val_fold_save['split'] = 'val'
                val_fold_save['engineered_label'] = y_val_fold.values
                split_df = pd.concat([train_fold_save, val_fold_save], axis=0)
                split_df.to_csv(split_cache_path, index=False)
                print(f"Saved split with engineered labels: {split_cache_path}")

            # Fit model on train fold, evaluate on val fold
            feature_pipeline_fold = Pipeline([
                ("feature_preprocessor", current_preprocessing_pipeline_obj),
                ("classifier", model_instance)
            ])
            feature_pipeline_fold.fit(X_train_fold, y_train_fold)
            y_val_pred = feature_pipeline_fold.predict(X_val_fold)
            y_val_proba = None
            # Check if predict_proba exists and is callable
            if hasattr(feature_pipeline_fold, "predict_proba") and callable(feature_pipeline_fold.predict_proba):
                try:
                    y_val_proba = feature_pipeline_fold.predict_proba(X_val_fold)
                except Exception as e:
                    print(f"Warning: Could not get probabilities for {model_name}: {e}")
                    y_val_proba = None

            # Compute metrics
            split_metrics = {}
            for metric_name, scorer in scoring.items():
                score = np.nan # Default to NaN if computation fails or proba needed but not available
                try:
                    if metric_name == 'accuracy':
                        score = accuracy_score(y_val_fold, y_val_pred)
                    elif metric_name == 'f1_weighted':
                        score = f1_score(y_val_fold, y_val_pred, average='weighted', zero_division=0)
                    elif metric_name == 'f1_macro':
                        score = f1_score(y_val_fold, y_val_pred, average='macro', zero_division=0)
                    elif metric_name == 'precision_weighted':
                        score = precision_score(y_val_fold, y_val_pred, average='weighted', zero_division=0)
                    elif metric_name == 'recall_weighted':
                        score = recall_score(y_val_fold, y_val_pred, average='weighted', zero_division=0)
                    elif metric_name == 'mcc':
                        score = matthews_corrcoef(y_val_fold, y_val_pred)
                    elif metric_name == 'ordinal_mae':
# Ensure ordinal_mae function is available
                        if 'ordinal_mae' in globals() and callable(globals()['ordinal_mae']):
                            score = ordinal_mae(y_val_fold, y_val_pred)
                        else:
                            print(f"Warning: ordinal_mae function not found for metric {metric_name}")
                            score = np.nan
                    elif metric_name == 'severe_error':
# Ensure severe_error_rate function is available
                        if 'severe_error_rate' in globals() and callable(globals()['severe_error_rate']):
                            score = severe_error_rate(y_val_fold, y_val_pred)
                        else:
                            print(f"Warning: severe_error_rate function not found for metric {metric_name}")
                            score = np.nan
                    elif metric_name == 'roc_auc_ovr_weighted':
                        if y_val_proba is not None and len(np.unique(y_val_fold)) > 1: # Check for multiple classes
                            try:
                                score = roc_auc_score(y_val_fold, y_val_proba, multi_class='ovr', average='weighted')
                            except ValueError as e:
                                print(f"Warning: Could not compute ROC AUC for {model_name}, split {split_idx}: {e}")
                                score = np.nan
                        else:
                            score = np.nan # Not applicable or probas not available
                    elif metric_name == 'pr_auc_ovr_weighted':
                        if y_val_proba is not None and len(np.unique(y_val_fold)) > 1: # Check for multiple classes
                            try:
                                score = average_precision_score(y_val_fold, y_val_proba, average='weighted')
                            except ValueError as e:
                                print(f"Warning: Could not compute PR AUC for {model_name}, split {split_idx}: {e}")
                                score = np.nan
                        else:
                            score = np.nan # Not applicable or probas not available
                    elif metric_name == 'neg_log_loss':
                        if y_val_proba is not None and len(np.unique(y_val_fold)) > 1: # Check for multiple classes
                            try:
# log_loss expects positive values, scorer has greater_is_better=False,
                                # so store the negative log loss directly.
                                score = log_loss(y_val_fold, y_val_proba)
                            except ValueError as e:
                                print(f"Warning: Could not compute log loss for {model_name}, split {split_idx}: {e}")
                                score = np.nan
                        else:
                            score = np.nan # Not applicable or probas not available
                    else:
                        # Fallback for any other scorers defined
                        # This re-introduces potential issues if the scorer's _score_func
                        # doesn't handle multiclass correctly without explicit args.
                        # A more robust approach would be to add more elif blocks
                        # for other specific scorers if needed.
                        if hasattr(scorer, '_score_func'):
                            needs_proba = getattr(scorer, 'kwargs', {}).get('needs_proba', False)
                            if needs_proba:
                                if y_val_proba is not None:
                                    score = scorer._score_func(y_val_fold, y_val_proba)
                                else:
                                    score = np.nan # Probas needed but not available
                            else:
                                score = scorer._score_func(y_val_fold, y_val_pred)
                        else:
                            print(f"Warning: Scorer object for '{metric_name}' does not have _score_func or known type.")
                            score = np.nan

                except Exception as e:
                    print(f"Error computing metric '{metric_name}' for {model_name}, split {split_idx}: {e}")
                    import traceback
                    traceback.print_exc()
                    score = np.nan # Ensure NaN is stored if computation fails

                cv_scores[f'test_{metric_name}'].append(score)
                split_metrics[metric_name] = score

            # Save metrics for this split
            with open(split_metric_path, 'w') as f:
                json.dump(split_metrics, f, indent=4)
            print(f"Saved metrics for split {split_idx} to {split_metric_path}")

        # Aggregate metrics across all folds
        avg_scores_calculated = {}
        print(f"  CV Scores for {model_name}:")
        for metric in scoring.keys():
            test_metric_key = f'test_{metric}'
            if test_metric_key in cv_scores:
                mean_score = np.nanmean(cv_scores[test_metric_key])
                std_score = np.nanstd(cv_scores[test_metric_key])
                avg_scores_calculated[metric] = mean_score
                print(f"    {metric}: {mean_score:.4f} +/- {std_score:.4f}")
            else:
                avg_scores_calculated[metric] = np.nan
                print(f"    {metric}: Not computed (NaN)")
        cv_results_all_models[model_name] = avg_scores_calculated

        json_path = os.path.join(modeling_results_dir, f"{model_name}_cv_results.json")
        with open(json_path, 'w') as f:
            json.dump(avg_scores_calculated, f, indent=4)
        print(f"CV results saved to: {json_path}")

=== Starting model evaluation with Cross-Validation ===
Loaded optimal STKDE parameters: hs=150, ht=45
Using TimeSeriesSplit with 5 splits for time-aware cross-validation.

--- Processing Dummy ---
Not all split metric files found for Dummy. Training required for missing splits.
No preprocessing for Dummy (using passthrough in pipeline)
Split 1: metrics already computed, loading from /drive/MyDrive/Data Mining and Machine Learning/Progetto/Classification (Modeling)/Dummy_cv_split_1.json
Split 2: metrics already computed, loading from /drive/MyDrive/Data Mining and Machine Learning/Progetto/Classification (Modeling)/Dummy_cv_split_2.json
Split 3: metrics already computed, loading from /drive/MyDrive/Data Mining and Machine Learning/Progetto/Classification (Modeling)/Dummy_cv_split_3.json
Split 4: metrics already computed, loading from /drive/MyDrive/Data Mining and Machine Learning/Progetto/Classification (Modeling)/Dummy_cv_split_4.json


KeyboardInterrupt: 

In [None]:
# Cross-validation results visualization

print("\n=== Aggregated cross-validation results ===")
if 'cv_results_all_models' in locals() and cv_results_all_models:
    cv_results_df = pd.DataFrame.from_dict(cv_results_all_models, orient='index')
    expected_metric_keys = list(scoring.keys())
    for metric_key in expected_metric_keys:
        if metric_key not in cv_results_df.columns:
            cv_results_df[metric_key] = np.nan
    print("\nCross-validation results summary (means):")
    if 'f1_weighted' in cv_results_df.columns:
        display(cv_results_df.sort_values(by='f1_weighted', ascending=False))
        plt.figure(figsize=(12, 8))
        sorted_df_plot = cv_results_df.sort_values(by='f1_weighted', ascending=False)
        sns.barplot(x=sorted_df_plot['f1_weighted'], y=sorted_df_plot.index)
        plt.title('Mean F1-Weighted (Cross-Validation)')
        plt.xlabel('F1-Weighted')
        plt.ylabel('Model')
        plt.tight_layout()
        plt.show()
    else:
        print("Displaying results without sorting by f1_weighted.")
        display(cv_results_df)
else:
    print("No cross-validation results available.")
    cv_results_df = pd.DataFrame()

### Hyperparameter tuning

The best model is selected and tuned via RandomizedSearchCV. The best model is chosen based on the main metric (f1_weighted, mcc, or accuracy).

In [None]:
cv_results_df = pd.DataFrame(cv_results_all_models).T # Ensure cv_results_df is created

if cv_results_df.empty or not any(col in cv_results_df.columns for col in ['f1_weighted', 'mcc', 'accuracy']) or cv_results_df[[col for col in ['f1_weighted', 'mcc', 'accuracy'] if col in cv_results_df.columns]].isnull().all().all():
    print("CV results are empty or key metrics (f1_weighted, mcc, accuracy) are missing/all NaN. Unable to select the best model for tuning.")
    best_model_name_for_tuning = None
    best_tuned_full_pipeline = None
else:
    primary_metric_to_sort = None
    if 'f1_weighted' in cv_results_df.columns and not cv_results_df['f1_weighted'].isnull().all():
        primary_metric_to_sort = 'f1_weighted'
    elif 'mcc' in cv_results_df.columns and not cv_results_df['mcc'].isnull().all():
        primary_metric_to_sort = 'mcc'
    elif 'accuracy' in cv_results_df.columns and not cv_results_df['accuracy'].isnull().all():
        primary_metric_to_sort = 'accuracy'

    if primary_metric_to_sort:
        best_model_name_for_tuning = cv_results_df[primary_metric_to_sort].idxmax()
        print(f"Model selected for tuning based on {primary_metric_to_sort}: {best_model_name_for_tuning}")
    else:
        print("Unable to determine a primary metric to sort models. Skipping tuning.")
        best_model_name_for_tuning = None
        best_tuned_full_pipeline = None

    # --- Load optimal STKDE parameters (as in the CV cell) ---
    optimal_stkde_params_path_tuning = os.path.join(preprocessing_dir, 'stkde_optimal_params.json')
    hs_optimal_tuning = 200.0  # Default
    ht_optimal_tuning = 60.0   # Default
    try:
        with open(optimal_stkde_params_path_tuning, 'r') as f:
            params_tuning = json.load(f)
            hs_optimal_tuning = params_tuning.get('hs_opt', hs_optimal_tuning)
            ht_optimal_tuning = params_tuning.get('ht_opt', ht_optimal_tuning)
        print(f"Optimal STKDE parameters loaded for tuning: hs={hs_optimal_tuning}, ht={ht_optimal_tuning}")
    except FileNotFoundError:
        print(f"Optimal STKDE parameters file not found for tuning. Using default values hs={hs_optimal_tuning}, ht={ht_optimal_tuning}.")
    except Exception as e:
        print(f"Error loading STKDE parameters for tuning: {e}. Using default values hs={hs_optimal_tuning}, ht={ht_optimal_tuning}.")
    # --- End loading optimal STKDE parameters ---

    # Determine the correct preprocessing pipeline for the selected model
    tuning_feature_preprocessor_obj = None
    if best_model_name_for_tuning:
        if best_model_name_for_tuning in ['DecisionTree', 'RandomForest', 'GradientBoosting', 'XGBoost', 'LightGBM']:
            tuning_feature_preprocessor_obj = preprocessing_pipeline_trees
        elif best_model_name_for_tuning == 'BernoulliNB':
            tuning_feature_preprocessor_obj = preprocessing_pipeline_bernoulli
        elif best_model_name_for_tuning == 'Dummy':
            tuning_feature_preprocessor_obj = 'passthrough'
        else: # For LogisticRegression, KNN, SVC_rbf, SVC_linear, GaussianNB
            tuning_feature_preprocessor_obj = preprocessing_pipeline_general
        print(f"Using {type(tuning_feature_preprocessor_obj).__name__ if tuning_feature_preprocessor_obj != 'passthrough' else 'passthrough'} for feature preprocessing during tuning of {best_model_name_for_tuning}")

    best_tuned_full_pipeline = None
    loaded_model_successfully = False
    if best_model_name_for_tuning and models_to_evaluate.get(best_model_name_for_tuning) is not None:
        best_tuned_model_path = os.path.join(modeling_results_dir, f'{best_model_name_for_tuning}_best_tuned_model.joblib')
        if os.path.exists(best_tuned_model_path):
            print(f"\nOptimized model found for {best_model_name_for_tuning} in: {best_tuned_model_path}")
            try:
                best_tuned_full_pipeline = joblib.load(best_tuned_model_path)
                print(f"Existing optimized model loaded successfully for {best_model_name_for_tuning}.")
                loaded_model_successfully = True
            except Exception as e:
                print(f"Error loading optimized model for {best_model_name_for_tuning}: {e}. Proceeding with tuning.")
                best_tuned_full_pipeline = None

    if not loaded_model_successfully and best_model_name_for_tuning and models_to_evaluate.get(best_model_name_for_tuning) is not None and tuning_feature_preprocessor_obj is not None:
        print(f"\nProceeding with tuning for {best_model_name_for_tuning}.")

        # Define parameter grids with correct prefixes for TargetEngineeringPipeline
        param_dist_rf = {
            'feature_pipeline__classifier__n_estimators': [100, 200, 300, 500],
            'feature_pipeline__classifier__max_features': ['sqrt', 'log2', 0.5],
            'feature_pipeline__classifier__max_depth': [10, 20, 30, None],
            'feature_pipeline__classifier__min_samples_split': [2, 5, 10],
            'feature_pipeline__classifier__min_samples_leaf': [1, 2, 4],
            'feature_pipeline__classifier__class_weight': ['balanced', 'balanced_subsample', None]
        }
        param_dist_gb = {
            'feature_pipeline__classifier__n_estimators': [100, 200, 300],
            'feature_pipeline__classifier__learning_rate': [0.01, 0.05, 0.1],
            'feature_pipeline__classifier__max_depth': [3, 5, 7]
        }
        param_dist_lr = {
            'feature_pipeline__classifier__C': np.logspace(-3, 3, 10),
            'feature_pipeline__classifier__penalty': ['l1', 'l2'],
            'feature_pipeline__classifier__solver': ['liblinear', 'saga']
        }
        param_dist_svc = {
            'feature_pipeline__classifier__C': np.logspace(-3, 2, 7),
            'feature_pipeline__classifier__kernel': ['linear', 'rbf'],
            'feature_pipeline__classifier__gamma': np.logspace(-3, 2, 7) # For rbf
        }
        param_dist_xgb = {
            'feature_pipeline__classifier__n_estimators': [100, 200, 300],
            'feature_pipeline__classifier__learning_rate': [0.01, 0.05, 0.1],
            'feature_pipeline__classifier__max_depth': [3, 5, 7]
        }
        param_dist_lgbm = {
            'feature_pipeline__classifier__n_estimators': [100, 200, 300],
            'feature_pipeline__classifier__learning_rate': [0.01, 0.05, 0.1],
            'feature_pipeline__classifier__num_leaves': [31, 50, 70]
        }

        current_param_dist = None
        if best_model_name_for_tuning == 'RandomForest': current_param_dist = param_dist_rf
        elif best_model_name_for_tuning == 'GradientBoosting': current_param_dist = param_dist_gb
        elif best_model_name_for_tuning == 'LogisticRegression': current_param_dist = param_dist_lr
        elif best_model_name_for_tuning and best_model_name_for_tuning.startswith('SVC'): current_param_dist = param_dist_svc
        elif best_model_name_for_tuning == 'XGBoost': current_param_dist = param_dist_xgb
        elif best_model_name_for_tuning == 'LightGBM': current_param_dist = param_dist_lgbm

        if current_param_dist:
            # --- Define STKDE and label engineering step for tuning ---
            stkde_label_generator_tuning = STKDEAndRiskLabelTransformer(
                hs=hs_optimal_tuning, ht=ht_optimal_tuning, n_classes=3, n_jobs=-1, random_state=42,
                intensity_col_name='stkde_intensity_tune',
                label_col_name='RISK_LEVEL_tune_engineered'
            )

            # --- Define preprocessing and classification pipeline for tuning ---
            base_classifier_instance_tuning = models_to_evaluate[best_model_name_for_tuning]
            feature_processing_pipeline_tuning = Pipeline([
                ("feature_preprocessor", tuning_feature_preprocessor_obj),
                ("classifier", base_classifier_instance_tuning)
            ])

            # --- Compose into TargetEngineeringPipeline for tuning ---
            pipeline_for_tuning_estimator = TargetEngineeringPipeline(
                target_engineer=stkde_label_generator_tuning,
                feature_pipeline=feature_processing_pipeline_tuning
            )

            # Setup RandomizedSearchCV
            if 'cv_strategy' not in locals() or ('groups' not in locals() and isinstance(cv_strategy, GroupKFold)):
                 print("Warning: cv_strategy or groups for GroupKFold not found. Re-initializing for tuning.")
                 if 'YEAR' in X_train.columns and 'MONTH' in X_train.columns:
                     dt_tune = pd.to_datetime(dict(year=X_train['YEAR'], month=X_train['MONTH'], day=1))
                     groups_tune = dt_tune.dt.to_period('M').astype(int)
                     cv_strategy_tune = GroupKFold(n_splits=5)
                 else:
                     groups_tune = None
                     cv_strategy_tune = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
            else:
                 cv_strategy_tune = cv_strategy
                 groups_tune = groups if 'groups' in locals() else None

            refit_metric_tuning = None
            if 'f1_weighted' in scoring: refit_metric_tuning = 'f1_weighted'
            elif 'mcc' in scoring: refit_metric_tuning = 'mcc'
            elif 'accuracy' in scoring: refit_metric_tuning = 'accuracy'
            else:
                refit_metric_tuning = list(scoring.keys())[0]
                print(f"Warning: defaulting RandomizedSearch refit metric to {refit_metric_tuning}")


            random_search = RandomizedSearchCV(
                estimator=pipeline_for_tuning_estimator,
                param_distributions=current_param_dist,
                n_iter=20, # Reduced for speed, increase for thoroughness (50-100)
                cv=cv_strategy_tune,
                scoring=scoring,
                refit=refit_metric_tuning,
                n_jobs=-1,
                random_state=42,
                verbose=1,
                error_score=np.nan
            )

            print(f"\nStarting RandomizedSearch CV for {best_model_name_for_tuning}...")
            random_search.fit(X_train, y_train, groups=groups_tune if groups_tune is not None else None)

            print("\nBest parameters found:")
            print(random_search.best_params_)
            print(f"\nBest {refit_metric_tuning} score during search: {random_search.best_score_:.4f}")

            best_tuned_full_pipeline = random_search.best_estimator_
            joblib.dump(best_tuned_full_pipeline, best_tuned_model_path)
            print(f"\nBest optimized model saved in: {best_tuned_model_path}")

            search_results_path = os.path.join(modeling_results_dir, f'{best_model_name_for_tuning}_random_search_results.joblib')
            joblib.dump(random_search.cv_results_, search_results_path)
            print(f"RandomizedSearch results saved in: {search_results_path}")
        else:
            print(f"No specific hyperparameter grid for {best_model_name_for_tuning}, or model not typically tuned (e.g., Dummy). Fitting the base model.")
            stkde_label_generator_base = STKDEAndRiskLabelTransformer(
                hs=hs_optimal_tuning, ht=ht_optimal_tuning, n_classes=3, n_jobs=-1, random_state=42
            )
            base_classifier_instance = models_to_evaluate[best_model_name_for_tuning]
            feature_processing_pipeline_base = Pipeline([
                ("feature_preprocessor", tuning_feature_preprocessor_obj),
                ("classifier", base_classifier_instance)
            ])
            best_tuned_full_pipeline = TargetEngineeringPipeline(
                target_engineer=stkde_label_generator_base,
                feature_pipeline=feature_processing_pipeline_base
            )
            print(f"Fitting the base pipeline for {best_model_name_for_tuning} on the full training data...")
            best_tuned_full_pipeline.fit(X_train, y_train)
            joblib.dump(best_tuned_full_pipeline, best_tuned_model_path)
            print(f"Base model fitted and saved in: {best_tuned_model_path}")

    elif not best_model_name_for_tuning:
        print("Skipping tuning, no best model determined.")
        best_tuned_full_pipeline = None
    else:
        print(f"\nUsing the pre-loaded optimized model for {best_model_name_for_tuning}. Skipping tuning/fitting.")

In [None]:
# Final evaluation on the test set

# The variable from the previous cell should be 'best_tuned_full_pipeline'
if 'best_tuned_full_pipeline' in locals() and best_tuned_full_pipeline is not None:
    final_model_name_eval = "TunedModel"
    try:
        final_classifier_step = best_tuned_full_pipeline.feature_pipeline_.named_steps.get('classifier')
        if final_classifier_step:
            final_model_name_eval = final_classifier_step.__class__.__name__
        if 'best_model_name_for_tuning' in locals() and best_model_name_for_tuning:
            final_model_name_eval = best_model_name_for_tuning
    except AttributeError:
        pass # Default name

    print(f"\n=== Final evaluation of {final_model_name_eval} (from TargetEngineeringPipeline) on the test set ===")

    y_pred_test_engineered = best_tuned_full_pipeline.predict(X_test)

    y_proba_test_engineered = None
    if hasattr(best_tuned_full_pipeline, "predict_proba"):
        try:
            y_proba_test_engineered = best_tuned_full_pipeline.predict_proba(X_test)
        except Exception as e:
            print(f"Unable to obtain probabilities from best_tuned_full_pipeline: {e}")

    # --- Generate true engineered labels for the test set ---
    try:
        stkde_transformer_fitted_on_full_train = best_tuned_full_pipeline.target_engineer_
        if not isinstance(X_test, pd.DataFrame):
            if 'feature_names' in locals() and feature_names is not None:
                X_test_df_eval = pd.DataFrame(X_test, columns=feature_names)
            else:
                 raise ValueError("X_test is not a DataFrame and feature_names are not available to reconstruct it.")
        else:
            X_test_df_eval = X_test

        X_test_transformed_by_stkde = stkde_transformer_fitted_on_full_train.transform(X_test_df_eval)
        true_label_col_name = stkde_transformer_fitted_on_full_train.label_col_name
        y_test_engineered_true = X_test_transformed_by_stkde[true_label_col_name]
        print(f"Engineered true labels for the test set generated successfully: {true_label_col_name}")
        print(f"Class counts for y_test_engineered_true:\n{y_test_engineered_true.value_counts().sort_index()}")
        print(f"Class counts for y_pred_test_engineered:\n{pd.Series(y_pred_test_engineered).value_counts().sort_index()}")

    except Exception as e:
        print(f"Error generating true labels for the test set: {e}")
        import traceback
        traceback.print_exc()
        print("CRITICAL WARNING: Using original y_test for evaluation, metrics may be incorrect.")
        y_test_engineered_true = y_test

    # --- Compute and print metrics with y_test_engineered_true and y_pred_test_engineered ---
    test_metrics = {}
    print("\n--- Test set metrics (engineered target) ---")

    test_metrics['accuracy'] = accuracy_score(y_test_engineered_true, y_pred_test_engineered)
    print(f"Accuracy: {test_metrics['accuracy']:.4f}")

    test_metrics['f1_weighted'] = f1_score(y_test_engineered_true, y_pred_test_engineered, average='weighted', zero_division=0)
    print(f"F1-Weighted: {test_metrics['f1_weighted']:.4f}")
    test_metrics['f1_macro'] = f1_score(y_test_engineered_true, y_pred_test_engineered, average='macro', zero_division=0)
    print(f"F1-Macro: {test_metrics['f1_macro']:.4f}")

    test_metrics['precision_weighted'] = precision_score(y_test_engineered_true, y_pred_test_engineered, average='weighted', zero_division=0)
    print(f"Precision-Weighted: {test_metrics['precision_weighted']:.4f}")
    test_metrics['recall_weighted'] = recall_score(y_test_engineered_true, y_pred_test_engineered, average='weighted', zero_division=0)
    print(f"Recall-Weighted: {test_metrics['recall_weighted']:.4f}")

    if y_proba_test_engineered is not None:
        num_classes_engineered = len(np.unique(y_test_engineered_true))
        if num_classes_engineered == 2:
            test_metrics['roc_auc'] = roc_auc_score(y_test_engineered_true, y_proba_test_engineered[:, 1])
            print(f"ROC AUC: {test_metrics['roc_auc']:.4f}")
        elif num_classes_engineered > 2:
            try:
                test_metrics['roc_auc_ovr_weighted'] = roc_auc_score(y_test_engineered_true, y_proba_test_engineered, multi_class='ovr', average='weighted')
                print(f"ROC AUC (OvR Weighted): {test_metrics['roc_auc_ovr_weighted']:.4f}")
            except ValueError as e:
                print(f"Unable to compute ROC AUC for multiclass: {e}. Probas shape: {y_proba_test_engineered.shape}, Unique true labels: {np.unique(y_test_engineered_true)}")
        else:
             print("ROC AUC not applicable, only one class in y_test_engineered_true.")

    if 'ordinal_mae' in scoring:
        mae = ordinal_mae(y_test_engineered_true, y_pred_test_engineered)
        test_metrics['ordinal_mae'] = mae
        print(f"Ordinal MAE: {mae:.4f}")
    if 'severe_error' in scoring:
        ser = severe_error_rate(y_test_engineered_true, y_pred_test_engineered)
        test_metrics['severe_error'] = ser
        print(f"Severe Error Rate (2-tier): {ser:.4f}")

    test_metrics['mcc'] = matthews_corrcoef(y_test_engineered_true, y_pred_test_engineered)
    print(f"Matthews Correlation Coefficient: {test_metrics['mcc']:.4f}")

    print("\n--- Classification report (Test Set - Engineered Target) ---")
    unique_engineered_labels = np.unique(y_test_engineered_true)
    class_labels_engineered = [str(lbl) for lbl in sorted(unique_engineered_labels)]
    print(classification_report(y_test_engineered_true, y_pred_test_engineered, target_names=class_labels_engineered, zero_division=0))

    print("\n--- Confusion matrix (Test Set - Engineered Target) ---")
    cm = confusion_matrix(y_test_engineered_true, y_pred_test_engineered, labels=sorted(unique_engineered_labels))
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=class_labels_engineered)
    disp.plot(cmap=plt.cm.Blues)
    plt.title(f'Confusion Matrix for {final_model_name_eval} (Engineered Target)')
    plt.show()

    print('\n--- Ordinal error analysis (Test Set - Engineered Target) ---')
    if 'ordinal_error_analysis' in locals():
         ordinal_error_analysis(y_test_engineered_true, y_pred_test_engineered)
    else:
        print("Warning: ordinal_error_analysis function not found.")

    test_metrics_path = os.path.join(modeling_results_dir, f'{final_model_name_eval}_engineered_test_set_metrics.json')
    with open(test_metrics_path, 'w') as f:
        json.dump(test_metrics, f, indent=4)
    print(f"\nTest set metrics saved in: {test_metrics_path}")

else:
    print("Skipping final evaluation on the test set: 'best_tuned_full_pipeline' is not defined or None.")

## Obsolete section: previous use of STKDE Transformer

This section is obsolete: the generation of STKDE intensity and the RISK_LEVEL label is now performed in a leakage-free way via custom transformers and the dedicated pipeline.

# Leakage-free STKDE and label engineering

All feature/label engineering operations are performed only after the split and within each cross-validation fold, to prevent any form of leakage.

In [None]:
# (remove useless placeholder cells, move explanations to markdown)

**Conceptual usage example:**

The leakage-free pipeline combines STKDEAndRiskLabelTransformer and TargetEngineeringPipeline to ensure that the label is always generated only on the training data of each fold.