# Modeling Phase

## Introduction

This notebook outlines the modeling phase, utilizing preprocessed data and pipelines from `AlternativePreprocessing.ipynb`. All feature engineering and label engineering, including the generation of `RISK_LEVEL` via Spatio-Temporal Kernel Density Estimation (STKDE), are conducted in a manner that prevents data leakage. This is achieved by performing these operations, including the determination of STKDE intensity thresholds using Jenks natural breaks, only after the initial train/test split and subsequently within each cross-validation fold. The classification task is to distinguish between two risk categories: "LOW RISK" and "HIGH RISK".

**Dependencies:**
- Artifacts from `Classification (Preprocessing)` (including data, pipelines, STKDE parameters, and the `scoring_dict`).
- Custom transformers and modular pipelines defined in `custom_transformers.py`.

**Objectives:**
- Select the optimal model for 2-class risk classification through rigorous cross-validation.
- Perform hyperparameter tuning for the selected model.
- Evaluate the model\'s generalization performance on the unseen test set.

# Setup

This section handles the initial setup, including importing necessary libraries, defining file paths, and loading the preprocessed data. All custom functions are imported from `custom_transformers.py` to maintain modularity and facilitate code reuse for the 2-class risk classification.

## Optional: Run on Google Colab

This cell is for users working in a Google Colab environment. If you are running this notebook locally, you can safely ignore this section.

In [None]:
# Run on Google Colab (optional)
#from google.colab import drive
#drive.mount('/drive', force_remount=True)

## Import Libraries

Import all libraries required for the modeling process and for generating visualizations. Custom functions, critical for specific transformations and operations for the 2-class problem, are imported from the `custom_transformers.py` script.

In [None]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import joblib
import json
import glob
import sys

from sklearn.model_selection import StratifiedKFold, cross_validate, RandomizedSearchCV, GroupKFold, TimeSeriesSplit
from sklearn.pipeline import Pipeline
from sklearn.metrics import (make_scorer, f1_score, accuracy_score, precision_score, recall_score,
                               confusion_matrix, ConfusionMatrixDisplay, roc_auc_score,
                               average_precision_score, matthews_corrcoef, classification_report)
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier, BaggingClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import FunctionTransformer
from sklearn.base import BaseEstimator, TransformerMixin, clone
from IPython.display import display
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from imblearn.ensemble import EasyEnsembleClassifier, BalancedRandomForestClassifier, BalancedBaggingClassifier
from imblearn.combine import SMOTEENN, SMOTETomek
from typing import Any, Dict, List, Tuple

project_path = os.getcwd()
os.chdir(project_path)
sys.path.append(project_path)

# Import custom transformers
try:
    from Utilities.custom_transformers import STKDEAndRiskLabelTransformer, CustomModelPipeline 
except ImportError as e:
    print(f"Error importing custom transformers: {e}")
    raise


try:
    from xgboost import XGBClassifier
except ImportError:
    XGBClassifier = None
    print("XGBoost not installed. Skipping.")
try:
    from lightgbm import LGBMClassifier
except ImportError:
    LGBMClassifier = None
    print("LightGBM not installed. Skipping.")

try:
    import jenkspy
except ImportError:
    print("jenkspy not installed. Please install it using 'pip install jenkspy'.")
    jenkspy = None

from scipy.stats import ttest_rel
import random

# Set seeds and styles
random.seed(42)
np.random.seed(42)
#warnings.filterwarnings('ignore')
sns.set_style('whitegrid')

## Define Paths

Establish and define the directory paths for loading preprocessed data and for saving the results of the modeling phase, such as trained models, performance metrics, and generated plots for the 2-class problem.

In [None]:
base_dir = os.getcwd()

preprocessing_dir = os.path.join(base_dir, "Classification (Preprocessing)")
modeling_results_dir = os.path.join(base_dir, "Classification (Modeling)")
os.makedirs(modeling_results_dir, exist_ok=True)
print(f"Preprocessing artifacts will be loaded from: {preprocessing_dir}")
print(f"Modeling results will be saved to: {modeling_results_dir}")

## Load Data and Pipelines

Load the unprocessed data splits (`X_train`, `y_train`, `X_test`, `y_test`), the unfitted preprocessing pipelines, the list of feature names, and the scoring dictionary. These artifacts were previously generated and saved by the `AlternativePreprocessing.ipynb` notebook.

In [None]:
print("=== Loading data and pipelines ===")

# (Keep loading the full pipelines for legacy/compat, but do NOT use them as steps)
preprocessing_pipeline_general = joblib.load(os.path.join(preprocessing_dir, 'preprocessing_pipeline_general.joblib'))
preprocessing_pipeline_trees = joblib.load(os.path.join(preprocessing_dir, 'preprocessing_pipeline_trees.joblib'))

print(f"Loaded preprocessing pipelines: general, trees")

def load_preprocessing_artifacts(preprocessing_dir: str) -> Dict[str, Any]:
    """
    Load all preprocessing artifacts with comprehensive validation.
    
    Args:
        preprocessing_dir: Directory containing preprocessing artifacts
        
    Returns:
        Dictionary containing loaded artifacts
    """
    print("=== Loading Preprocessing Artifacts ===")
    
    if not os.path.exists(preprocessing_dir):
        raise FileNotFoundError(f"Preprocessing directory not found: {preprocessing_dir}")
    
    artifacts = {}
    
    try:
        # Load raw data
        X_train_path = os.path.join(preprocessing_dir, 'X_train.npy')
        X_test_path = os.path.join(preprocessing_dir, 'X_test.npy')
        y_train_path = os.path.join(preprocessing_dir, 'y_train.npy')
        y_test_path = os.path.join(preprocessing_dir, 'y_test.npy')
        
        if not all(os.path.exists(p) for p in [X_train_path, X_test_path, y_train_path, y_test_path]):
            raise FileNotFoundError("One or more data files missing")
        
        X_train_raw = np.load(X_train_path, allow_pickle=True)
        X_test_raw = np.load(X_test_path, allow_pickle=True)
        y_train_raw = np.load(y_train_path, allow_pickle=True)
        y_test_raw = np.load(y_test_path, allow_pickle=True)
        
        # Load feature names
        feature_names_path = os.path.join(preprocessing_dir, 'feature_names.joblib')
        if os.path.exists(feature_names_path):
            feature_names = joblib.load(feature_names_path)
        else:
            # Fallback to legacy name
            legacy_path = os.path.join(preprocessing_dir, 'X_feature_names.joblib')
            if os.path.exists(legacy_path):
                feature_names = joblib.load(legacy_path)
            else:
                raise FileNotFoundError("Feature names file not found")
        
        # Convert to DataFrames with proper validation
        if X_train_raw.ndim != 2 or X_test_raw.ndim != 2:
            raise ValueError("Training or test data has incorrect dimensions")
        
        if len(feature_names) != X_train_raw.shape[1]:
            raise ValueError(f"Feature names length ({len(feature_names)}) != data columns ({X_train_raw.shape[1]})")
        
        X_train = pd.DataFrame(X_train_raw, columns=feature_names)
        X_test = pd.DataFrame(X_test_raw, columns=feature_names)
        y_train = pd.Series(y_train_raw, name='DUMMY_TARGET')
        y_test = pd.Series(y_test_raw, name='DUMMY_TARGET')
        
        artifacts.update({
            'X_train': X_train,
            'X_test': X_test,
            'y_train': y_train,
            'y_test': y_test,
            'feature_names': feature_names
        })
        
        print(f"✓ Data loaded: X_train {X_train.shape}, X_test {X_test.shape}")
        
        # Load preprocessing pipelines
        pipeline_files = {
            # Full pipelines (might be used for reference or specific scenarios)
            'preprocessing_pipeline_general_with_pca': 'preprocessing_pipeline_general.joblib', # CT + PCA
            'preprocessing_pipeline_trees_ct_only': 'preprocessing_pipeline_trees.joblib',   # Just CT for trees

            # ColumnTransformers (these are the primary components for building new pipelines)
            'preprocessor_general_ct': 'preprocessor_full.joblib', 
            'preprocessor_trees_ct': 'preprocessor_trees.joblib'
        }
        
        for artifact_name, filename in pipeline_files.items():
            file_path = os.path.join(preprocessing_dir, filename)
            if os.path.exists(file_path):
                try:
                    artifacts[artifact_name] = joblib.load(file_path)
                    print(f"✓ Loaded {artifact_name}")
                except Exception as e:
                    print(f"⚠ Failed to load {artifact_name}: {e}")
                    artifacts[artifact_name] = None
            else:
                print(f"⚠ {artifact_name} not found")
                artifacts[artifact_name] = None
        
        # Load metadata
        metadata_path = os.path.join(preprocessing_dir, 'preprocessing_metadata.json')
        if os.path.exists(metadata_path):
            try:
                with open(metadata_path, 'r') as f:
                    metadata = json.load(f)
                artifacts['metadata'] = metadata
                
                # Extract STKDE parameters
                stkde_params = metadata.get('stkde_parameters', {})
                artifacts['hs_optimal'] = stkde_params.get('hs_optimal', 200.0)
                artifacts['ht_optimal'] = stkde_params.get('ht_optimal', 60.0)
                
                print(f"✓ Loaded metadata with STKDE params: hs={artifacts['hs_optimal']}, ht={artifacts['ht_optimal']}")
            except Exception as e:
                print(f"⚠ Failed to load metadata: {e}")
                artifacts['metadata'] = None
                artifacts['hs_optimal'] = 200.0
                artifacts['ht_optimal'] = 60.0
        else:
            print("⚠ Metadata not found, using default STKDE parameters")
            artifacts['hs_optimal'] = 200.0
            artifacts['ht_optimal'] = 60.0
            
        # Load legacy STKDE parameters if metadata not available
        if artifacts['hs_optimal'] == 200.0 and artifacts['ht_optimal'] == 60.0:
            legacy_stkde_path = os.path.join(preprocessing_dir, 'stkde_optimal_params.json')
            if os.path.exists(legacy_stkde_path):
                try:
                    with open(legacy_stkde_path, 'r') as f:
                        legacy_params = json.load(f)
                    artifacts['hs_optimal'] = legacy_params.get('hs_opt', 200.0)
                    artifacts['ht_optimal'] = legacy_params.get('ht_opt', 60.0)
                    print(f"✓ Loaded legacy STKDE params: hs={artifacts['hs_optimal']}, ht={artifacts['ht_optimal']}")
                except Exception as e:
                    print(f"⚠ Failed to load legacy STKDE params: {e}")
        
        return artifacts
        
    except Exception as e:
        print(f"❌ Error loading preprocessing artifacts: {e}")
        raise

def validate_data_quality(artifacts: Dict[str, Any]) -> None:
    """
    Validate the quality and consistency of loaded data.
    
    Args:
        artifacts: Dictionary containing loaded artifacts
    """
    print("\n=== Data Quality Validation ===")
    
    X_train, X_test = artifacts['X_train'], artifacts['X_test']
    y_train, y_test = artifacts['y_train'], artifacts['y_test']
    
    # Basic shape validation
    if X_train.empty or X_test.empty:
        raise ValueError("Training or test data is empty")
    
    # Column consistency
    if not X_train.columns.equals(X_test.columns):
        raise ValueError("Training and test data have different columns")
    
    # Required columns for STKDE
    required_cols = ['YEAR', 'MONTH', 'DAY', 'HOUR', 'Latitude', 'Longitude']
    missing_cols = [col for col in required_cols if col not in X_train.columns]
    if missing_cols:
        raise ValueError(f"Missing required columns: {missing_cols}")
    
    # Data type validation
    for col in required_cols:
        if X_train[col].dtype == 'object':
            print(f"⚠ Column '{col}' has object dtype, may need conversion")
    
    # Missing value check
    train_missing = X_train.isnull().sum().sum()
    test_missing = X_test.isnull().sum().sum()
    
    if train_missing > 0:
        print(f"⚠ Training data has {train_missing} missing values")
    if test_missing > 0:
        print(f"⚠ Test data has {test_missing} missing values")
    
    print(f"✓ Data validation completed")
    print(f"  Training: {X_train.shape[0]} samples, {X_train.shape[1]} features")
    print(f"  Test: {X_test.shape[0]} samples, {X_test.shape[1]} features")
    print(f"  Feature types: {X_train.dtypes.value_counts().to_dict()}")
    print(f"  Feature names: {', '.join(X_train.columns)}")

# Load and validate data
try:
    artifacts = load_preprocessing_artifacts(preprocessing_dir)
    validate_data_quality(artifacts)
    
    # Extract main variables
    X_train = artifacts['X_train']
    X_test = artifacts['X_test']
    y_train = artifacts['y_train']
    y_test = artifacts['y_test']
    feature_names = artifacts['feature_names']
    hs_optimal = artifacts['hs_optimal']
    ht_optimal = artifacts['ht_optimal']
    
    # Extract preprocessing components
    preprocessor_general_ct = artifacts.get('preprocessor_general_ct')
    preprocessor_trees_ct = artifacts.get('preprocessor_trees_ct')

    
except Exception as e:
    print(f"❌ Failed to load preprocessing artifacts: {e}")
    raise

## Target Variable Selection

The primary target variable for classification in this notebook is `RISK_LEVEL`. This variable is not directly present in the initial dataset but is *engineered* using STKDE to represent two classes: "LOW RISK" and "HIGH RISK". This engineering process is integrated into a leakage-free pipeline, ensuring that the target is generated based only on information available at each stage (e.g., within each CV fold using only training data for that fold).

## Update Scoring Dictionary for Binary Classification

This section updates the `scoring` dictionary to include metrics suitable for binary classification tasks. Ordinal-specific metrics that are not applicable to a 2-class problem (like `severe_error_rate`) are removed.

In [None]:
# Update the scoring dictionary
from sklearn.metrics import log_loss, roc_auc_score, accuracy_score, f1_score, precision_score, recall_score, matthews_corrcoef, make_scorer

# Ordinal MAE can still be computed for binary (0/1) labels, representing the proportion of misclassified instances if labels are 0 and 1.
# However, its interpretation is less direct than accuracy for binary. We keep it for now but primary focus will be on standard binary metrics.
def ordinal_mae(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

ordinal_mae_scorer = make_scorer(ordinal_mae, greater_is_better=False)
# severe_error_rate is not applicable for 2 classes, so it's removed.

# Define the scoring dictionary with appropriate averages for binary/multiclass (some metrics adapt automatically)
# For binary classification, 'weighted' average for F1, precision, recall is equivalent to the binary metric for the positive class if pos_label=1.
# 'macro' average for binary is the unweighted average of the metric for each class.
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'f1': make_scorer(f1_score, average='binary', zero_division=0), # Standard F1 for binary
    'f1_weighted': make_scorer(f1_score, average='weighted', zero_division=0), # Weighted F1
    'f1_macro': make_scorer(f1_score, average='macro', zero_division=0), # Macro F1
    'precision': make_scorer(precision_score, average='binary', zero_division=0), # Standard Precision
    'precision_weighted': make_scorer(precision_score, average='weighted', zero_division=0),
    'recall': make_scorer(recall_score, average='binary', zero_division=0), # Standard Recall
    'recall_weighted': make_scorer(recall_score, average='weighted', zero_division=0),
    'mcc': make_scorer(matthews_corrcoef), # MCC handles binary and multiclass

    # Ordinal MAE (retained, but less primary for binary)
    'ordinal_mae': ordinal_mae_scorer,

    # Probability-based metrics
    # For binary, roc_auc_score doesn't need multi_class or average.
    'roc_auc': make_scorer(roc_auc_score, needs_proba=True), # Standard ROC AUC for binary
    # For PR AUC, average='weighted' or 'macro' can be used, or simply calculate for positive class.
    # average_precision_score defaults to average of PR curve for binary.
    'pr_auc': make_scorer(average_precision_score, needs_proba=True),
    'neg_log_loss': make_scorer(log_loss, greater_is_better=False, needs_proba=True)
}

print(f"Updated scoring dictionary for 2-class problem with keys: {list(scoring.keys())}")

## Data Verification

Perform a quick check on the loaded data to verify dimensions and the class distribution of the original target variable (before pipeline-based engineering for the 2-class problem). This helps ensure that the data has been loaded correctly.

In [None]:
print("=== Data verification ===")
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")
print(f"Scoring dict: {scoring}")

print("\nClass distribution in y_train (dummy, the real label will be engineered in the pipeline):")
unique, counts = np.unique(y_train, return_counts=True)
print(dict(zip(unique, counts)))
print("Proportions:")
print(dict(zip(unique, counts / len(y_train))))
print("\nNote: the real target variable (2 classes) will be created in the modeling pipeline.")

def perform_data_verification(X_train: pd.DataFrame, X_test: pd.DataFrame, 
                             y_train: pd.Series, y_test: pd.Series,
                             hs_optimal: float, ht_optimal: float) -> None:
    """
    Perform comprehensive data verification for modeling phase.
    
    Args:
        X_train, X_test: Feature data
        y_train, y_test: Target data (dummy targets)
        hs_optimal, ht_optimal: STKDE parameters
    """
    print("\n=== Data Verification for Modeling ===")
    
    # Basic information
    print(f"Training data: {X_train.shape}")
    print(f"Test data: {X_test.shape}")
    print(f"Target variables: y_train {y_train.shape}, y_test {y_test.shape}")
    print(f"STKDE parameters: hs={hs_optimal}, ht={ht_optimal}")
    
    # Check for required columns
    required_cols = ['YEAR', 'MONTH', 'DAY', 'HOUR', 'Latitude', 'Longitude']
    available_required = [col for col in required_cols if col in X_train.columns]
    print(f"Required STKDE columns available: {len(available_required)}/{len(required_cols)}")
    
    if len(available_required) != len(required_cols):
        missing = [col for col in required_cols if col not in X_train.columns]
        print(f"⚠ Missing required columns: {missing}")
    
    # Dummy target distribution (will be replaced by engineered targets)
    print(f"\nDummy target distribution (y_train):")
    if not y_train.empty:
        target_counts = y_train.value_counts()
        target_props = y_train.value_counts(normalize=True)
        for val in target_counts.index:
            print(f"  {val}: {target_counts[val]} ({target_props[val]:.3f})")
    else:
        print("  y_train is empty")
    
    print("\nNote: Real binary targets (LOW/HIGH RISK) will be engineered using STKDE in the modeling pipeline")
    
    # Data quality summary
    quality_issues = []
    
    # Check for missing values
    train_missing = X_train.isnull().sum().sum()
    test_missing = X_test.isnull().sum().sum()
    
    if train_missing > 0:
        quality_issues.append(f"{train_missing} missing values in training data")
    if test_missing > 0:
        quality_issues.append(f"{test_missing} missing values in test data")
    
    # Check data types
    object_cols = X_train.select_dtypes(include=['object']).columns
    if len(object_cols) > 0:
        quality_issues.append(f"{len(object_cols)} columns with object dtype")
    
    if quality_issues:
        print(f"\n⚠ Data quality issues detected:")
        for issue in quality_issues:
            print(f"  - {issue}")
    else:
        print(f"\n✓ No major data quality issues detected")
    
    # Sample preview
    print(f"\nSample data preview (first 3 rows):")
    sample_cols = X_train.columns[:8].tolist()  # Show first 8 columns
    print(X_train[sample_cols].head(3).to_string())
    if len(X_train.columns) > 8:
        print(f"... and {len(X_train.columns) - 8} more columns")

# Perform verification
perform_data_verification(X_train, X_test, y_train, y_test, hs_optimal, ht_optimal)

## Model Definitions and Preprocessing Pipelines

This section defines the various classification models that will be evaluated for the 2-class risk problem. It also specifies the appropriate preprocessing pipelines to be used with different types of models (e.g., tree-based models, linear models).

### Define Models for Comparison

A dictionary, `models_to_evaluate`, is created to hold the instances of the classification models that will be compared. Each model will be integrated into a pipeline that includes the necessary preprocessing steps tailored to its requirements for the 2-class problem.

In [None]:
from typing import Dict, Any, List
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from imblearn.ensemble import BalancedRandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

def define_core_models() -> Dict[str, Any]:
    """
    Define core classification models for evaluation.
    
    Returns:
        Dictionary of model name -> model instance
    """
    print("\n=== Defining Core Models for Binary Classification ===")
    
    models = {
    # --- Baseline ---
    'Dummy': DummyClassifier(strategy='stratified', random_state=42),

    # --- Linear Models ---
    'LogisticRegression': LogisticRegression(random_state=42, max_iter=1000, class_weight='balanced', n_jobs=-1),

    # --- K-Nearest Neighbors ---
    'KNN': KNeighborsClassifier(n_jobs=-1),

    # --- Naive Bayes ---
    'GaussianNB': GaussianNB(),

    # --- Tree-Based Models ---
    'DecisionTree': DecisionTreeClassifier(random_state=42, class_weight='balanced'),

    # --- Ensemble Methods ---
    'RandomForest': RandomForestClassifier(random_state=42, class_weight='balanced', n_jobs=-1),
    'GradientBoosting': GradientBoostingClassifier(random_state=42),
    'ExtraTrees': ExtraTreesClassifier(random_state=42, class_weight='balanced', n_jobs=-1),
    'Bagging': BaggingClassifier(random_state=42, n_jobs=-1),
    'AdaBoost': AdaBoostClassifier(random_state=42),

    # --- Support Vector Machines ---
    'SVC_linear': LinearSVC(C=1.0, penalty='l2', loss='squared_hinge', dual='auto', max_iter=5000, class_weight='balanced', random_state=42),

    # --- Advanced Ensemble Methods (from imblearn) ---
    'EasyEnsemble': EasyEnsembleClassifier(random_state=42, n_jobs=-1),
    'BalancedRandomForest': BalancedRandomForestClassifier(random_state=42, n_jobs=-1),
    'BalancedBagging': BalancedBaggingClassifier(random_state=42, n_jobs=-1),
    }
    
    # Add XGBoost if available
    try:
        from xgboost import XGBClassifier
        models['XGBoost'] = XGBClassifier(
            random_state=42,
            n_estimators=100,
            max_depth=6,
            use_label_encoder=False,
            eval_metric='logloss',
            n_jobs=-1
        )
        print("✓ Added XGBoost")
    except ImportError:
        print("⚠ XGBoost not available")
    
    # Add LightGBM if available
    try:
        from lightgbm import LGBMClassifier
        models['LightGBM'] = LGBMClassifier(
            random_state=42,
            n_estimators=100,
            max_depth=6,
            n_jobs=-1,
            verbose=-1,  # Suppress warnings
        )
        print("✓ Added LightGBM")
    except ImportError:
        print("⚠ LightGBM not available")
    
    print(f"\nDefined {len(models)} models: {list(models.keys())}")
    return models

def categorize_models_for_preprocessing(models: Dict[str, Any]) -> Dict[str, List[str]]:
    """
    Categorize models based on their preprocessing requirements.
    
    Args:
        models: Dictionary of model instances
        
    Returns:
        Dictionary with model categories
    """
    # Models that work better with tree preprocessing (no scaling, ordinal encoding)
    tree_models = [
        'DecisionTree', 'RandomForest', 'GradientBoosting', 'ExtraTrees',
        'Bagging', 'AdaBoost', 'BalancedRandomForest', 'XGBoost', 'LightGBM',
        'EasyEnsemble', 'BalancedBagging'
    ]
    
    # Models that work better with general preprocessing (scaling, one-hot encoding, PCA)
    general_models = [
        'Dummy', 'LogisticRegression', 'KNN', 'GaussianNB', 'SVC_linear'
    ]
    
    # Models that handle imbalance internally (don't need SMOTE)
    balanced_models = [
        'Dummy',  # Due to strategy='stratified'
        'LogisticRegression',  # Due to class_weight='balanced'
        'DecisionTree',  # Due to class_weight='balanced'
        'RandomForest',  # Due to class_weight='balanced'
        'ExtraTrees',  # Due to class_weight='balanced'
        'SVC_linear',  # Due to class_weight='balanced'
        'BalancedRandomForest',  # Inherently balanced
        'EasyEnsemble',  # Inherently balanced
        'BalancedBagging',  # Inherently balanced
        'XGBoost',  # Can handle imbalance (e.g., scale_pos_weight)
        'LightGBM',  # Can handle imbalance (e.g., is_unbalance=True)
        'GradientBoosting' # Generally robust, user included it previously
    ]
    
    # Filter by actually available models
    available_tree_models = [name for name in tree_models if name in models]
    available_general_models = [name for name in general_models if name in models]
    available_balanced_models = [name for name in balanced_models if name in models]
    
    categorization = {
        'tree_preprocessing': available_tree_models,
        'general_preprocessing': available_general_models,
        'needs_smote': [name for name in models.keys() if name not in available_balanced_models],
        'balanced_internally': available_balanced_models
    }
    
    print("\nModel categorization:")
    for category, model_list in categorization.items():
        print(f"  {category}: {model_list}")
    
    return categorization

# Define models and categorization
models_to_evaluate = define_core_models()
model_categories = categorize_models_for_preprocessing(models_to_evaluate)

### Exploratory Data Analysis (EDA) for Threshold Definition (No Longer Performed Here)

Previously, this section was dedicated to performing Exploratory Data Analysis (EDA) on STKDE intensities to identify a fixed threshold for classifying risk levels using Jenks' natural breaks. **This approach has been revised to prevent data leakage.**

**The calculation of Jenks' natural breaks (and thus the definition of thresholds for the 'fixed' strategy) is now integrated directly into the `STKDEAndRiskLabelTransformer` and is performed *within each cross-validation fold* using only the training data of that specific fold.** This ensures that information from the validation set (or the test set, in the case of final model training) does not influence the threshold selection process for the training data of that fold.

Therefore, the EDA and global Jenks calculation previously done here are no longer necessary and have been removed. The `CHOSEN_LABELING_STRATEGY` will determine if 'fixed' (with per-fold Jenks) or 'quantile' labeling is used within the pipeline.

In [None]:
# CONFIGURATION: PERFORM_EDA is no longer used for global Jenks calculation.
# Jenks calculation for 'fixed' strategy is now done within STKDEAndRiskLabelTransformer per fold.
PERFORM_EDA = False # This flag is kept for compatibility but does not trigger global Jenks.

# Initialize stkde_intensities_train_for_eda - this variable is no longer populated globally.
stkde_intensities_train_for_eda = None

print("Global EDA for Jenks threshold definition is no longer performed here.")
print("If CHOSEN_LABELING_STRATEGY is 'fixed', Jenks breaks will be calculated within each CV fold.")

In [None]:
# Initialize CHOSEN_LABELING_STRATEGY and CHOSEN_FIXED_THRESHOLDS with defaults
# These will be used by the STKDEAndRiskLabelTransformer.
# If 'fixed' is chosen, the transformer will calculate Jenks per fold.
# If 'quantile' is chosen, it will use quantiles per fold.

CHOSEN_LABELING_STRATEGY = 'fixed' # Default strategy, can be 'fixed' or 'quantile'
CHOSEN_FIXED_THRESHOLDS = None       # This is now effectively a placeholder if strategy is 'fixed',
                                     # as actual thresholds are determined per-fold.
                                     # If strategy is 'fixed' and this is None, STKDE transformer will calculate them.
                                     # If strategy is 'fixed' and this is a list, it might be used as a fallback
                                     # or override IF the transformer is designed to accept pre-defined global thresholds
                                     # (current implementation calculates per-fold for 'fixed').

# The PERFORM_EDA block that calculated global Jenks thresholds is removed.
# The STKDEAndRiskLabelTransformer is now responsible for handling 'fixed' strategy by calculating Jenks per fold.

print("\\nSTKDE Labeling Strategy Configuration:")
print(f"Chosen strategy for STKDEAndRiskLabelTransformer: {CHOSEN_LABELING_STRATEGY}")
if CHOSEN_LABELING_STRATEGY == 'fixed':
    print("If 'fixed' strategy is used, STKDEAndRiskLabelTransformer will calculate Jenks breaks per fold.")
    if CHOSEN_FIXED_THRESHOLDS is not None:
        print(f"Global CHOSEN_FIXED_THRESHOLDS is set to: {CHOSEN_FIXED_THRESHOLDS}, but per-fold calculation will take precedence for 'fixed' strategy.")
elif CHOSEN_LABELING_STRATEGY == 'quantile':
    print("If 'quantile' strategy is used, STKDEAndRiskLabelTransformer will use quantiles per fold.")
else:
    print(f"Warning: Unknown CHOSEN_LABELING_STRATEGY: {CHOSEN_LABELING_STRATEGY}. Behavior might be undefined.")


# --- Results Analysis ---

In [None]:
def analyze_model_results(results_df: pd.DataFrame, primary_metric: str = 'f1_mean') -> Dict[str, Any]:
    """
    Analyze model selection results and identify best performers.
    
    Args:
        results_df: DataFrame with model results
        primary_metric: Primary metric for ranking
        
    Returns:
        Dictionary with analysis results
    """
    print(f"\n=== Analyzing Model Results (Primary Metric: {primary_metric}) ===")
    
    if results_df.empty:
        print("⚠ No results to analyze")
        return {}
    
    analysis = {}
    
    # Filter out failed models (those with NaN scores)
    valid_results = results_df.dropna(subset=[primary_metric])
    
    if valid_results.empty:
        print("❌ No valid results found")
        return {'error': 'No valid results'}
    
    # Sort by primary metric
    valid_results_sorted = valid_results.sort_values(primary_metric, ascending=False)
    
    # Top performers
    top_5 = valid_results_sorted.head(5)
    analysis['top_performers'] = top_5[['model_name', primary_metric]].to_dict('records')
    
    print(f"\nTop 5 performers by {primary_metric}:")
    for i, row in top_5.iterrows():
        score = row[primary_metric]
        std_metric = primary_metric.replace('_mean', '_std')
        std_score = row.get(std_metric, 0)
        print(f"  {row['model_name']}: {score:.4f} ± {std_score:.4f}")
    
    # Best model
    best_model = valid_results_sorted.iloc[0]
    analysis['best_model'] = {
        'name': best_model['model_name'],
        'score': best_model[primary_metric],
        'preprocessing': best_model['preprocessing_type'],
        'uses_smote': best_model['uses_smote']
    }
    
    print(f"\n✓ Best model: {best_model['model_name']} ({best_model[primary_metric]:.4f})")
    
    # Performance by preprocessing type
    preprocessing_analysis = valid_results.groupby('preprocessing_type')[primary_metric].agg(['mean', 'std', 'count'])
    analysis['preprocessing_analysis'] = preprocessing_analysis.to_dict('index')
    
    print(f"\nPerformance by preprocessing type:")
    for prep_type, stats in preprocessing_analysis.iterrows():
        print(f"  {prep_type}: {stats['mean']:.4f} ± {stats['std']:.4f} (n={stats['count']})")
    
    # SMOTE analysis
    if 'uses_smote' in valid_results.columns:
        smote_analysis = valid_results.groupby('uses_smote')[primary_metric].agg(['mean', 'std', 'count'])
        analysis['smote_analysis'] = smote_analysis.to_dict('index')
        
        print(f"\nPerformance by SMOTE usage:")
        for uses_smote, stats in smote_analysis.iterrows():
            smote_label = "With SMOTE" if uses_smote else "Without SMOTE"
            print(f"  {smote_label}: {stats['mean']:.4f} ± {stats['std']:.4f} (n={stats['count']})")
    
    return analysis

def create_results_visualization(results_df: pd.DataFrame, analysis: Dict[str, Any], 
                               primary_metric: str = 'f1_mean') -> None:
    """
    Create visualizations for model selection results.
    
    Args:
        results_df: DataFrame with model results
        analysis: Analysis results
        primary_metric: Primary metric for visualization
    """
    print("\n=== Creating Results Visualization ===")
    
    if results_df.empty or not analysis:
        print("⚠ No data to visualize")
        return
    
    # Filter valid results
    valid_results = results_df.dropna(subset=[primary_metric])
    
    if valid_results.empty:
        print("⚠ No valid results to visualize")
        return
    
    # Create figure with subplots
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    fig.suptitle('Model Selection Results', fontsize=16)
    
    # Plot 1: Model performance comparison
    ax1 = axes[0, 0]
    sorted_results = valid_results.sort_values(primary_metric, ascending=True)
    
    y_pos = np.arange(len(sorted_results))
    scores = sorted_results[primary_metric]
    
    bars = ax1.barh(y_pos, scores, alpha=0.7)
    ax1.set_yticks(y_pos)
    ax1.set_yticklabels(sorted_results['model_name'])
    ax1.set_xlabel(f'{primary_metric.replace("_", " ").title()}')
    ax1.set_title('Model Performance Comparison')
    ax1.grid(True, alpha=0.3)
    
    # Color top 3 performers differently
    for i, bar in enumerate(bars):
        if i >= len(bars) - 3:  # Top 3
            bar.set_color('darkgreen')
            bar.set_alpha(0.8)
    
    # Plot 2: Performance by preprocessing type
    ax2 = axes[0, 1]
    if 'preprocessing_analysis' in analysis:
        prep_data = analysis['preprocessing_analysis']
        prep_types = list(prep_data.keys())
        prep_means = [prep_data[pt]['mean'] for pt in prep_types]
        prep_stds = [prep_data[pt]['std'] for pt in prep_types]
        
        ax2.bar(prep_types, prep_means, yerr=prep_stds, alpha=0.7, capsize=5)
        ax2.set_ylabel(f'{primary_metric.replace("_", " ").title()}')
        ax2.set_title('Performance by Preprocessing Type')
        ax2.grid(True, alpha=0.3)
    
    # Plot 3: SMOTE impact
    ax3 = axes[1, 0]
    if 'smote_analysis' in analysis:
        smote_data = analysis['smote_analysis']
        smote_labels = ['Without SMOTE' if not k else 'With SMOTE' for k in smote_data.keys()]
        smote_means = [smote_data[k]['mean'] for k in smote_data.keys()]
        smote_stds = [smote_data[k]['std'] for k in smote_data.keys()]
        
        ax3.bar(smote_labels, smote_means, yerr=smote_stds, alpha=0.7, capsize=5)
        ax3.set_ylabel(f'{primary_metric.replace("_", " ").title()}')
        ax3.set_title('Impact of SMOTE')
        ax3.grid(True, alpha=0.3)
    
    # Plot 4: Top models detailed comparison
    ax4 = axes[1, 1]
    if 'top_performers' in analysis:
        top_models = analysis['top_performers'][:5]
        model_names = [m['model_name'] for m in top_models]
        model_scores = [m[primary_metric] for m in top_models]
        
        bars = ax4.bar(range(len(model_names)), model_scores, alpha=0.7)
        ax4.set_xticks(range(len(model_names)))
        ax4.set_xticklabels(model_names, rotation=45, ha='right')
        ax4.set_ylabel(f'{primary_metric.replace("_", " ").title()}')
        ax4.set_title('Top 5 Models')
        ax4.grid(True, alpha=0.3)
        
        # Highlight best model
        bars[0].set_color('gold')
        bars[0].set_alpha(0.9)
    
    plt.tight_layout()
    plt.show()
    
    print("✓ Visualization created")

print("✓ Analysis and visualization functions defined")

## Model Selection and Hyperparameter Tuning

Based on the cross-validation results, we select the best performing model and perform hyperparameter tuning to optimize its performance. The tuning process uses the same leakage-free approach with proper target engineering within each fold.

In [None]:
def perform_hyperparameter_tuning(model_name: str, base_model: Any,
                                  preprocessor: Any, stkde_config: Dict[str, Any],
                                  cv_config: Dict[str, Any], use_smote: bool,
                                  results_dir: str) -> Tuple[Pipeline, Dict[str, Any]]:
    """
    Perform hyperparameter tuning for the selected model.
    
    Args:
        model_name: Name of the model
        base_model: Base model instance
        preprocessor: Preprocessing pipeline
        stkde_config: STKDE configuration
        cv_config: CV configuration
        use_smote: Whether to use SMOTE
        results_dir: Results directory
        
    Returns:
        Tuple of (best_pipeline, best_params)
    """
    print(f"\n=== Hyperparameter Tuning for {model_name} ===")
    
    # Define parameter grids for different models
    param_grids = {
        'RandomForest': {
            'feature_pipeline__classifier__n_estimators': [50, 100, 200],
            'feature_pipeline__classifier__max_depth': [5, 10, None],
            'feature_pipeline__classifier__min_samples_split': [2, 5, 10],
            'feature_pipeline__classifier__min_samples_leaf': [1, 2, 4]
        },
        'GradientBoosting': {
            'feature_pipeline__classifier__n_estimators': [50, 100, 200],
            'feature_pipeline__classifier__max_depth': [3, 6, 9],
            'feature_pipeline__classifier__learning_rate': [0.01, 0.1, 0.2],
            'feature_pipeline__classifier__subsample': [0.8, 0.9, 1.0]
        },
        'LogisticRegression': {
            'feature_pipeline__classifier__C': [0.1, 1.0, 10.0],
            'feature_pipeline__classifier__penalty': ['l1', 'l2', 'elasticnet'],
            'feature_pipeline__classifier__solver': ['liblinear', 'saga']
        },
        'XGBoost': {
            'feature_pipeline__classifier__n_estimators': [50, 100, 200],
            'feature_pipeline__classifier__max_depth': [3, 6, 9],
            'feature_pipeline__classifier__learning_rate': [0.01, 0.1, 0.2],
            'feature_pipeline__classifier__subsample': [0.8, 0.9, 1.0]
        },
        'LightGBM': {
            'feature_pipeline__classifier__n_estimators': [50, 100, 200],
            'feature_pipeline__classifier__max_depth': [3, 6, 9],
            'feature_pipeline__classifier__learning_rate': [0.01, 0.1, 0.2],
            'feature_pipeline__classifier__subsample': [0.8, 0.9, 1.0]
        }
    }
    
    # Get parameter grid for the model
    if model_name not in param_grids:
        print(f"⚠ No parameter grid defined for {model_name}, using default parameters")
        # Create pipeline with default parameters
        pipeline = create_model_pipeline(
            model_name=model_name,
            model_instance=base_model,
            preprocessor=preprocessor,
            stkde_config=stkde_config,
            use_smote=use_smote
        )
        return pipeline, {}
    
    param_grid = param_grids[model_name]
    
    # Adjust parameter names if SMOTE is used
    if use_smote:
        adjusted_param_grid = {}
        for key, value in param_grid.items():
            # Parameters come after SMOTE in the pipeline
            if 'feature_pipeline__classifier__' in key:
                new_key = key.replace('feature_pipeline__classifier__', 'feature_pipeline__classifier__')
                adjusted_param_grid[new_key] = value
            else:
                adjusted_param_grid[key] = value
        param_grid = adjusted_param_grid
    
    # Create base pipeline
    pipeline = create_model_pipeline(
        model_name=model_name,
        model_instance=base_model,
        preprocessor=preprocessor,
        stkde_config=stkde_config,
        use_smote=use_smote
    )
    
    try:
        # Perform randomized search
        print(f"Performing randomized search with {len(param_grid)} parameter ranges...")
        
        random_search = RandomizedSearchCV(
            estimator=pipeline,
            param_distributions=param_grid,
            n_iter=20,  # Limit iterations for efficiency
            cv=cv_config['cv_strategy'],
            scoring='f1',  # Use F1 as primary metric
            n_jobs=1,  # Avoid nested parallelization
            random_state=42,
            verbose=1
        )
        
        # Fit the search
        random_search.fit(cv_config['X_train_sorted'], cv_config['y_train_sorted'])
        
        print(f"✓ Best F1 score: {random_search.best_score_:.4f}")
        print(f"✓ Best parameters: {random_search.best_params_}")
        
        # Save tuning results
        tuning_results_file = os.path.join(results_dir, f"{model_name}_tuning_results.json")
        try:
            with open(tuning_results_file, 'w') as f:
                json.dump(tuning_results, f, indent=2)
            print(f"✓ Tuning results saved to: {tuning_results_file}")
        except Exception as e:
            print(f"⚠ Error saving tuning results: {e}")
        
        return random_search.best_estimator_, tuning_results
        
    except Exception as e:
        print(f"❌ Error during hyperparameter tuning: {e}")
        
        # Return the base pipeline as fallback
        base_pipeline = create_model_pipeline(
            model_name=model_name,
            model_instance=base_model,
            preprocessor=preprocessor,
            stkde_config=stkde_config,
            use_smote=use_smote
        )
        
        fallback_results = {
            'best_score': np.nan,
            'best_params': {},
            'model_name': model_name,
            'error': str(e)
        }
        
        return base_pipeline, fallback_results

def define_parameter_grids() -> Dict[str, Dict[str, Any]]:
    """
    Define parameter grids for hyperparameter tuning.
    
    Returns:
        Dictionary mapping model names to parameter grids
    """
    param_grids = {
        'LogisticRegression': {
            'feature_pipeline__classifier__C': [0.01, 0.1, 1.0, 10.0, 100.0],
            'feature_pipeline__classifier__penalty': ['l1', 'l2'],
            'feature_pipeline__classifier__solver': ['liblinear', 'saga']
        },
        
        'DecisionTree': {
            'feature_pipeline__classifier__max_depth': [3, 5, 7, 10, None],
            'feature_pipeline__classifier__min_samples_split': [2, 5, 10],
            'feature_pipeline__classifier__min_samples_leaf': [1, 2, 4]
        },
        
        'RandomForest': {
            'feature_pipeline__classifier__n_estimators': [50, 100, 200],
            'feature_pipeline__classifier__max_depth': [5, 10, 15, None],
            'feature_pipeline__classifier__min_samples_split': [2, 5],
            'feature_pipeline__classifier__min_samples_leaf': [1, 2]
        },
        
        'GradientBoosting': {
            'feature_pipeline__classifier__n_estimators': [50, 100, 200],
            'feature_pipeline__classifier__max_depth': [3, 5, 7],
            'feature_pipeline__classifier__learning_rate': [0.01, 0.1, 0.2]
        },
        
        'XGBoost': {
            'feature_pipeline__classifier__n_estimators': [50, 100, 200],
            'feature_pipeline__classifier__max_depth': [3, 5, 7],
            'feature_pipeline__classifier__learning_rate': [0.01, 0.1, 0.2]
        },
        
        'LightGBM': {
            'feature_pipeline__classifier__n_estimators': [50, 100, 200],
            'feature_pipeline__classifier__max_depth': [3, 5, 7],
            'feature_pipeline__classifier__learning_rate': [0.01, 0.1, 0.2]
        },
        
        'KNN': {
            'feature_pipeline__classifier__n_neighbors': [3, 5, 7, 9, 11],
            'feature_pipeline__classifier__weights': ['uniform', 'distance']
        }
    }
    
    return param_grids

# Hyperparameter tuning execution
if 'best_model_name' in locals() and best_model_name is not None:
    print(f"\n=== Hyperparameter Tuning for {best_model_name} ===")
    
    # Define parameter grids
    param_grids = define_parameter_grids()
    
    if best_model_name in param_grids:
        # Get model configuration
        best_model_instance = models_to_evaluate[best_model_name]
        
        # Determine preprocessing and SMOTE settings
        if best_model_name in model_categories['tree_preprocessing']:
            best_preprocessor = preprocessors['preprocessor_trees']
            preprocessing_type = 'tree'
        else:
            best_preprocessor = preprocessors['preprocessor_general']
            preprocessing_type = 'general'
        
        uses_smote = best_model_name in model_categories.get('needs_smote', [])
        
        print(f"Tuning {best_model_name} with {preprocessing_type} preprocessing, SMOTE: {uses_smote}")
        
        # Perform tuning
        try:
            tuned_pipeline, tuning_results = perform_hyperparameter_tuning(
                model_name=best_model_name,
                base_model=best_model_instance,
                preprocessor=best_preprocessor,
                stkde_config=stkde_config,
                cv_config=cv_config,
                use_smote=uses_smote,
                results_dir=modeling_results_dir
            )
            
            print(f"✓ Hyperparameter tuning completed for {best_model_name}")
            print(f"  Best CV F1 score: {tuning_results.get('best_score', 'N/A')}")
            
        except Exception as e:
            print(f"❌ Hyperparameter tuning failed: {e}")
            tuned_pipeline = None
            tuning_results = {}
            
    else:
        print(f"⚠ No parameter grid defined for {best_model_name}")
        tuned_pipeline = None
        tuning_results = {}
        
else:
    print("⚠ No best model identified for hyperparameter tuning")
    tuned_pipeline = None
    tuning_results = {}

def evaluate_final_model_on_test_set(pipeline: Pipeline, X_train: pd.DataFrame, y_train: pd.Series,
                                   X_test: pd.DataFrame, y_test: pd.Series, 
                                   model_name: str, results_dir: str) -> Dict[str, Any]:
    """
    Evaluate the final tuned model on the test set.
    
    Args:
        pipeline: Tuned model pipeline
        X_train, y_train: Training data for fitting
        X_test, y_test: Test data for evaluation
        model_name: Name of the model
        results_dir: Directory for saving results
        
    Returns:
        Dictionary with test set evaluation results
    """
    print(f"\n=== Final Test Set Evaluation for {model_name} ===")
    
    try:
        # Fit the pipeline on full training data
        print("Fitting final model on complete training data...")
        pipeline.fit(X_train, y_train)
        
        # Make predictions on test set
        print("Making predictions on test set...")
        y_pred_test = pipeline.predict(X_test)
        
        # Get probability predictions if available
        try:
            y_proba_test = pipeline.predict_proba(X_test)
            if y_proba_test.shape[1] == 2:
                y_proba_test_pos = y_proba_test[:, 1]  # Probability of positive class
            else:
                y_proba_test_pos = y_proba_test.ravel()
        except:
            y_proba_test_pos = None
            print("⚠ Probability predictions not available")
        
        # Calculate metrics
        test_results = {}
        
        # Basic classification metrics
        test_results['accuracy'] = accuracy_score(y_test, y_pred_test)
        test_results['f1'] = f1_score(y_test, y_pred_test, average='binary')
        test_results['precision'] = precision_score(y_test, y_pred_test, average='binary', zero_division=0)
        test_results['recall'] = recall_score(y_test, y_pred_test, average='binary', zero_division=0)
        test_results['mcc'] = matthews_corrcoef(y_test, y_pred_test)
        
        # Probability-based metrics
        if y_proba_test_pos is not None:
            test_results['roc_auc'] = roc_auc_score(y_test, y_proba_test_pos)
            test_results['pr_auc'] = average_precision_score(y_test, y_proba_test_pos)
        
        # Class distribution
        unique_test, counts_test = np.unique(y_test, return_counts=True)
        unique_pred, counts_pred = np.unique(y_pred_test, return_counts=True)
        
        test_results['test_class_distribution'] = dict(zip(unique_test.astype(str), counts_test.tolist()))
        test_results['pred_class_distribution'] = dict(zip(unique_pred.astype(str), counts_pred.tolist()))
        
        # Print results
        print(f"\nTest Set Performance for {model_name}:")
        print(f"  Accuracy: {test_results['accuracy']:.4f}")
        print(f"  F1-score: {test_results['f1']:.4f}")
        print(f"  Precision: {test_results['precision']:.4f}")
        print(f"  Recall: {test_results['recall']:.4f}")
        print(f"  MCC: {test_results['mcc']:.4f}")
        
        if 'roc_auc' in test_results:
            print(f"  ROC AUC: {test_results['roc_auc']:.4f}")
        if 'pr_auc' in test_results:
            print(f"  PR AUC: {test_results['pr_auc']:.4f}")
        
        # Save detailed results
        test_results_file = os.path.join(results_dir, f"{model_name}_test_results.json")
        try:
            # Convert numpy types to native Python types for JSON serialization
            serializable_results = {}
            for key, value in test_results.items():
                if isinstance(value, (np.integer, np.floating)):
                    serializable_results[key] = float(value)
                else:
                    serializable_results[key] = value
            
            with open(test_results_file, 'w') as f:
                json.dump(serializable_results, f, indent=2)
            print(f"✓ Test results saved to: {test_results_file}")
        except Exception as e:
            print(f"⚠ Error saving test results: {e}")
        
        # Save predictions
        predictions_file = os.path.join(results_dir, f"{model_name}_test_predictions.csv")
        try:
            pred_df = pd.DataFrame({
                'y_true': y_test,
                'y_pred': y_pred_test
            })
            
            if y_proba_test_pos is not None:
                pred_df['y_proba'] = y_proba_test_pos
            
            pred_df.to_csv(predictions_file, index=False)
            print(f"✓ Predictions saved to: {predictions_file}")
        except Exception as e:
            print(f"⚠ Error saving predictions: {e}")
        
        return test_results
        
    except Exception as e:
        print(f"❌ Error in test set evaluation: {e}")
        return {'error': str(e)}

# Test set evaluation
if tuned_pipeline is not None and best_model_name is not None:
    print("\n=== Final Test Set Evaluation ===")
    
    final_test_results = evaluate_final_model_on_test_set(
        pipeline=tuned_pipeline,
        X_train=X_train,
        y_train=y_train,
        X_test=X_test,
        y_test=y_test,
        model_name=best_model_name,
        results_dir=modeling_results_dir
    )
    
    print(f"\n✅ Phase 3 - Modeling Complete!")
    print(f"   Best Model: {best_model_name}")
    if 'f1' in final_test_results:
        print(f"   Test F1-score: {final_test_results['f1']:.4f}")
    if 'accuracy' in final_test_results:
        print(f"   Test Accuracy: {final_test_results['accuracy']:.4f}")
        
else:
    print("⚠ Cannot perform test set evaluation - no tuned pipeline available")
    final_test_results = {}

In [None]:
def create_final_summary_report(model_results_df: pd.DataFrame, 
                               final_test_results: Dict[str, Any],
                               best_model_name: str, 
                               results_dir: str) -> None:
    """
    Create a comprehensive final summary report of the modeling phase.
    
    Args:
        model_results_df: DataFrame with cross-validation results
        final_test_results: Test set evaluation results
        best_model_name: Name of the best performing model
        results_dir: Directory for saving the report
    """
    print("\n" + "="*60)
    print("🎯 DMML PROJECT - MODELING PHASE SUMMARY REPORT")
    print("="*60)
    
    # Phase completion status
    print("\n📋 PHASE COMPLETION STATUS:")
    print("✅ Phase 1 - Custom Transformers: COMPLETED")
    print("✅ Phase 2 - Alternative Preprocessing: COMPLETED") 
    print("✅ Phase 3 - Modeling & Evaluation: COMPLETED")
    print("📋 Phase 4 - Final Tuning & Training: READY")
    
    # Model evaluation summary
    print(f"\n🔍 MODEL EVALUATION SUMMARY:")
    if not model_results_df.empty:
        total_models = len(model_results_df)
        successful_models = len(model_results_df.dropna(subset=['f1_mean'] if 'f1_mean' in model_results_df.columns else []))
        print(f"   Total models evaluated: {total_models}")
        print(f"   Successful evaluations: {successful_models}")
        print(f"   Success rate: {successful_models/total_models*100:.1f}%")
        
        # Top 3 performers
        if 'f1_mean' in model_results_df.columns:
            valid_results = model_results_df.dropna(subset=['f1_mean'])
            if not valid_results.empty:
                top_3 = valid_results.nlargest(3, 'f1_mean')
                print(f"\n   🏆 TOP 3 PERFORMERS (by F1-score):")
                for i, (_, row) in enumerate(top_3.iterrows(), 1):
                    print(f"      {i}. {row['model_name']}: {row['f1_mean']:.4f} ± {row.get('f1_std', 0):.4f}")
    
    # Best model details
    print(f"\n🏅 BEST MODEL DETAILS:")
    if best_model_name:
        print(f"   Model: {best_model_name}")
        
        if not model_results_df.empty:
            best_cv_row = model_results_df[model_results_df['model_name'] == best_model_name]
            if not best_cv_row.empty:
                row = best_cv_row.iloc[0]
                print(f"   Preprocessing: {row.get('preprocessing_type', 'unknown')}")
                print(f"   Uses SMOTE: {row.get('uses_smote', False)}")
                print(f"   Internal balancing: {row.get('uses_balancing', False)}")
                
                # CV performance
                print(f"\n   📊 CROSS-VALIDATION PERFORMANCE:")
                cv_metrics = ['accuracy_mean', 'f1_mean', 'precision_mean', 'recall_mean', 'roc_auc_mean']
                for metric in cv_metrics:
                    if metric in row:
                        std_metric = metric.replace('_mean', '_std')
                        std_val = row.get(std_metric, 0)
                        print(f"      {metric.replace('_mean', '').upper()}: {row[metric]:.4f} ± {std_val:.4f}")
        
        # Test set performance
        if final_test_results and 'error' not in final_test_results:
            print(f"\n   🎯 TEST SET PERFORMANCE:")
            test_metrics = ['accuracy', 'f1', 'precision', 'recall', 'roc_auc', 'pr_auc', 'mcc']
            for metric in test_metrics:
                if metric in final_test_results:
                    print(f"      {metric.upper()}: {final_test_results[metric]:.4f}")
    
    # Technical configuration
    print(f"\n⚙️  TECHNICAL CONFIGURATION:")
    print(f"   Target engineering: STKDE-based binary risk labeling")
    print(f"   Cross-validation: TimeSeriesSplit (5 folds)")
    print(f"   Data leakage prevention: ✅ Implemented")
    print(f"   Class imbalance handling: SMOTE + balanced models")
    print(f"   Hyperparameter tuning: RandomizedSearchCV")
    
    # File outputs
    print(f"\n📁 OUTPUT FILES GENERATED:")
    print(f"   📊 Model comparison: model_selection_results.csv")
    print(f"   🔧 Hyperparameter tuning: {best_model_name}_tuning_results.json")
    print(f"   🎯 Test set evaluation: {best_model_name}_test_results.json")
    print(f"   📈 Test predictions: {best_model_name}_test_predictions.csv")
    print(f"   📂 All files saved to: {results_dir}")
    
    # Quality metrics
    if final_test_results and 'error' not in final_test_results:
        f1_score = final_test_results.get('f1', 0)
        accuracy = final_test_results.get('accuracy', 0)
        
        print(f"\n⭐ QUALITY ASSESSMENT:")
        if f1_score >= 0.8:
            print("   🟢 Excellent performance (F1 ≥ 0.80)")
        elif f1_score >= 0.7:
            print("   🟡 Good performance (F1 ≥ 0.70)")
        elif f1_score >= 0.6:
            print("   🟠 Moderate performance (F1 ≥ 0.60)")
        else:
            print("   🔴 Low performance (F1 < 0.60)")
        
        print(f"   Model reliability: {'High' if f1_score > 0.75 and accuracy > 0.75 else 'Moderate' if f1_score > 0.6 else 'Low'}")
    
    # Next steps
    print(f"\n🚀 NEXT STEPS:")
    print("   1. Proceed to Phase 4 - Final Tuning & Training")
    print("   2. Advanced hyperparameter optimization")
    print("   3. Model interpretability analysis")
    print("   4. Production deployment preparation")
    
    print("\n" + "="*60)
    print("📝 End of Modeling Phase Summary")
    print("="*60)

# Generate final summary report
try:
    create_final_summary_report(
        model_results_df=model_results_df if 'model_results_df' in locals() else pd.DataFrame(),
               final_test_results=final_test_results if 'final_test_results' in locals() else {},
        best_model_name=best_model_name if 'best_model_name' in locals() else 'Unknown',
        results_dir=modeling_results_dir
    )
except Exception as e:
    print(f"⚠ Error generating summary report: {e}")
    print("\n✅ Phase 3 - Modeling phase completed with basic functionality")

print(f"\n🎉 PHASE 3 COMPLETED SUCCESSFULLY!")
print(f"The modeling pipeline is now fully functional with:")
print(f"✓ Leakage-free STKDE target engineering")
print(f"✓ Comprehensive model evaluation framework") 
print(f"✓ Robust cross-validation with temporal considerations")
print(f"✓ Hyperparameter tuning capabilities")
print(f"✓ Final test set evaluation")
print(f"\nReady to proceed to Phase 4 - TuningAndTraining.ipynb")

## Workflow Execution

This section executes the main modeling workflow: model selection, hyperparameter tuning (if a best model is found), and final evaluation on the test set.

In [None]:
print("\n=== Starting Modeling Workflow ===")

# Configuration for the modeling run
# These would typically be defined earlier or passed as arguments
N_SPLITS_CV = 5  # Number of splits for TimeSeriesSplit
PRIMARY_METRIC_MODEL_SELECTION = 'f1_mean'  # Metric to select the best model
PRIMARY_METRIC_TUNING = 'f1'  # Metric for hyperparameter tuning scoring
N_ITER_RANDOM_SEARCH = 10  # Number of iterations for RandomizedSearchCV

# 1. Define STKDE Configuration (used by STKDEAndRiskLabelTransformer)
stkde_config = {
    'hs': hs_optimal,  # Loaded from preprocessing artifacts
    'ht': ht_optimal,  # Loaded from preprocessing artifacts
    'strategy': CHOSEN_LABELING_STRATEGY,  # 'fixed' or 'quantile', defined earlier
    'fixed_thresholds': CHOSEN_FIXED_THRESHOLDS,  # None if Jenks is computed per fold for 'fixed'
    'n_classes': 2,  # For binary classification
    'n_jobs': -1,
    'intensity_col_name': 'stkde_intensity_engineered',
    'label_col_name': 'RISK_LEVEL_engineered'
}
print(f"STKDE Configuration: {stkde_config}")

# 2. Define Cross-Validation Strategy
X_train_sorted = X_train.sort_values(by=['YEAR', 'MONTH', 'DAY', 'HOUR']).copy()
y_train_sorted = y_train.loc[X_train_sorted.index].copy()
cv_strategy = TimeSeriesSplit(n_splits=N_SPLITS_CV)

cv_config = {
    'X_train_sorted': X_train_sorted,
    'y_train_sorted': y_train_sorted,  # Placeholder: STKDE transformer will engineer true labels
    'cv_strategy': cv_strategy,
    'scoring': scoring,  # Defined earlier
    'n_jobs_cv': 1
}
print(f"CV Configuration: Using TimeSeriesSplit with {N_SPLITS_CV} splits.")

# --- Start Modifications for Score Management ---
LOAD_SCORES_IF_EXIST = True # Set to False to force recalculation
all_fold_scores = {}
fold_scores_path = os.path.join(modeling_results_dir, 'all_models_fold_scores.json')

if LOAD_SCORES_IF_EXIST and os.path.exists(fold_scores_path):
    print(f"\nℹ️ Loading existing scores from: {fold_scores_path}")
    try:
        with open(fold_scores_path, 'r') as f:
            all_fold_scores = json.load(f)
        print(f"   Loaded scores for {len(all_fold_scores)} models.")
    except Exception as e:
        print(f"\n⚠️ Error loading existing scores: {e}. Re-running all models.")
        all_fold_scores = {}
# --- End Modifications for Score Management ---

# 3. Run Model Selection
print("\n--- Running Model Selection ---")
model_results_list = []

for model_name, model_instance in models_to_evaluate.items():
    print(f"\nEvaluating model: {model_name}")

    # Determine preprocessing type and SMOTE usage
    if model_name in model_categories['tree_preprocessing']:
        preprocessor_ct = preprocessor_trees_ct
        preprocessing_type = 'tree'
    else:
        preprocessor_ct = preprocessor_general_ct
        preprocessing_type = 'general'

    if preprocessor_ct is None:
        print(f"⚠ Skipping {model_name} because its required preprocessor ({preprocessing_type}) is not available.")
        model_results_list.append({
            'model_name': model_name,
            'error': f"Preprocessor '{preprocessing_type}' not available",
            'preprocessing_type': preprocessing_type,
            'uses_smote': False,
            'uses_balancing': False
        })
        continue

    needs_smote = model_name in model_categories.get('needs_smote', [])
    uses_internal_balancing = model_name in model_categories.get('balanced_internally', [])

    print(f"  Preprocessing: {preprocessing_type}, Needs SMOTE: {needs_smote}, Internal Balancing: {uses_internal_balancing}")

    # --- Start Modifications for Loading Model Scores ---
    if LOAD_SCORES_IF_EXIST and model_name in all_fold_scores:
        print(f"  Found existing scores for {model_name}. Using stored scores.")
        cv_results_loaded = all_fold_scores[model_name]
        
        current_result_entry = {
            'model_name': model_name,
            'preprocessing_type': preprocessing_type,
            'uses_smote': needs_smote,
            'uses_balancing': uses_internal_balancing,
            'loaded_from_file': True
        }
        
        valid_scores_found = False
        # Rebuild means and standard deviations from loaded fold scores
        for metric_name_loaded, scores_array_loaded in cv_results_loaded.items():
            if metric_name_loaded.startswith('test_'): # These are per-fold score arrays
                metric_key_loaded = metric_name_loaded.replace('test_', '')
                scores_np_array = np.array(scores_array_loaded) if isinstance(scores_array_loaded, list) else scores_array_loaded
                if scores_np_array.size > 0:
                    current_result_entry[f"{metric_key_loaded}_mean"] = np.mean(scores_np_array)
                    current_result_entry[f"{metric_key_loaded}_std"] = np.std(scores_np_array)
                    valid_scores_found = True
            elif metric_name_loaded in ['fit_time', 'score_time']:
                scores_np_array = np.array(scores_array_loaded) if isinstance(scores_array_loaded, list) else scores_array_loaded
                if scores_np_array.size > 0:
                     current_result_entry[f"{metric_name_loaded}_mean"] = np.mean(scores_np_array)
        
        if valid_scores_found:
            model_results_list.append(current_result_entry)
            # Improved Metrics Display for loaded scores
            f1_mean_loaded = current_result_entry.get('f1_mean', np.nan)
            f1_std_loaded = current_result_entry.get('f1_std', np.nan)
            print(f"  ✓ {model_name} CV scores loaded. F1: {f1_mean_loaded:.4f} ± {f1_std_loaded:.4f}")
            
            accuracy_mean_loaded = current_result_entry.get('accuracy_mean', np.nan)
            accuracy_std_loaded = current_result_entry.get('accuracy_std', np.nan)
            print(f"    Accuracy: {accuracy_mean_loaded:.4f} ± {accuracy_std_loaded:.4f}")
            
            roc_auc_mean_loaded = current_result_entry.get('roc_auc_mean', np.nan)
            roc_auc_std_loaded = current_result_entry.get('roc_auc_std', np.nan)
            if not np.isnan(roc_auc_mean_loaded):
                print(f"    ROC AUC: {roc_auc_mean_loaded:.4f} ± {roc_auc_std_loaded:.4f}")
            continue
        else:
            print(f"  ⚠️ Loaded scores for {model_name} are invalid or empty. Re-running cross-validation.")
    # --- End Modifications for Loading Model Scores ---

    stkde_transformer_instance = STKDEAndRiskLabelTransformer(**stkde_config)

    if needs_smote:
        feature_processor_pipeline = ImbPipeline([
            ('preprocessor', preprocessor_ct),
            ('smote', SMOTE(random_state=42))
        ])
    else:
        feature_processor_pipeline = Pipeline([
            ('preprocessor', preprocessor_ct)
        ])

    full_pipeline = CustomModelPipeline(
        stkde_transformer=stkde_transformer_instance,
        feature_processor=feature_processor_pipeline,
        classifier=clone(model_instance)
    )

    try:
        cv_results = cross_validate(
            full_pipeline,
            cv_config['X_train_sorted'],
            cv_config['y_train_sorted'],
            cv=cv_config['cv_strategy'],
            scoring=cv_config['scoring'],
            n_jobs=cv_config['n_jobs_cv'],
            error_score='raise'
        )

        # --- Start Modifications for Saving Fold Scores ---
        all_fold_scores[model_name] = {k_cv: (v_cv.tolist() if isinstance(v_cv, np.ndarray) else v_cv) for k_cv, v_cv in cv_results.items()}
        # --- End Modifications for Saving Fold Scores ---

        result_entry = {
            'model_name': model_name,
            'preprocessing_type': preprocessing_type,
            'uses_smote': needs_smote,
            'uses_balancing': uses_internal_balancing,
            'loaded_from_file': False
        }
        for metric_name, scores_array in cv_results.items():
            if metric_name.startswith('test_'):
                metric_key = metric_name.replace('test_', '')
                result_entry[f"{metric_key}_mean"] = np.mean(scores_array)
                result_entry[f"{metric_key}_std"] = np.std(scores_array)
            elif metric_name in ['fit_time', 'score_time']:
                result_entry[f"{metric_name}_mean"] = np.mean(scores_array)
        
        model_results_list.append(result_entry)
        
        # --- Improved Metrics Display ---
        f1_mean = result_entry.get('f1_mean', np.nan)
        f1_std = result_entry.get('f1_std', np.nan)
        print(f"  ✓ {model_name} CV completed. F1: {f1_mean:.4f} ± {f1_std:.4f}")
        
        accuracy_mean = result_entry.get('accuracy_mean', np.nan)
        accuracy_std = result_entry.get('accuracy_std', np.nan)
        print(f"    Accuracy: {accuracy_mean:.4f} ± {accuracy_std:.4f}")
        
        roc_auc_mean = result_entry.get('roc_auc_mean', np.nan)
        roc_auc_std = result_entry.get('roc_auc_std', np.nan)
        if not np.isnan(roc_auc_mean):
             print(f"    ROC AUC: {roc_auc_mean:.4f} ± {roc_auc_std:.4f}")
        # --- End Improved Metrics Display ---

    except Exception as e:
        print(f"  ❌ Error evaluating {model_name}: {e}")
        model_results_list.append({
            'model_name': model_name,
            'error': str(e),
            'preprocessing_type': preprocessing_type,
            'uses_smote': needs_smote,
            'uses_balancing': uses_internal_balancing,
            'loaded_from_file': False
        })
# --- Start Final Save of All Fold Scores ---
if all_fold_scores:
    try:
        with open(fold_scores_path, 'w') as f:
            json.dump(all_fold_scores, f, indent=4)
        print(f"\n✓ All fold scores have been saved to: {fold_scores_path}")
    except Exception as e:
        print(f"\n⚠️ Error saving all fold scores: {e}")
# --- End Final Save of All Fold Scores ---

model_results_df = pd.DataFrame(model_results_list)
results_file_path = os.path.join(modeling_results_dir, 'model_selection_results.csv')
model_results_df.to_csv(results_file_path, index=False)
print(f"\n✓ Model selection results saved to: {results_file_path}")

if not model_results_df.empty:
    display(model_results_df.sort_values(by=PRIMARY_METRIC_MODEL_SELECTION, ascending=False))
else:
    print("No model results to display.")

if not model_results_df.empty and PRIMARY_METRIC_MODEL_SELECTION in model_results_df.columns:
    valid_model_results_df = model_results_df.dropna(subset=[PRIMARY_METRIC_MODEL_SELECTION])
    if not valid_model_results_df.empty:
        analysis_summary = analyze_model_results(valid_model_results_df, primary_metric=PRIMARY_METRIC_MODEL_SELECTION)
        create_results_visualization(valid_model_results_df, analysis_summary, primary_metric=PRIMARY_METRIC_MODEL_SELECTION)

        if analysis_summary and 'best_model' in analysis_summary and analysis_summary['best_model']:
            best_model_info = analysis_summary['best_model']
            best_model_name = best_model_info['name']
            print(f"\n🏆 Best model from selection: {best_model_name} with {PRIMARY_METRIC_MODEL_SELECTION}: {best_model_info['score']:.4f}")
        else:
            best_model_name = None
            print("\n⚠ Could not determine best model from selection results (valid_model_results_df was empty or analysis failed).")
    else:
        best_model_name = None
        analysis_summary = {}
        print(f"\n⚠ No valid models with primary metric '{PRIMARY_METRIC_MODEL_SELECTION}' found after loading/running. Cannot determine best model.")
elif not model_results_df.empty:
    print(f"\n⚠ Primary metric '{PRIMARY_METRIC_MODEL_SELECTION}' not found in results. Cannot determine best model.")
    best_model_name = None
    analysis_summary = {}
else:
    print("\n⚠ Model selection produced no results. Cannot determine best model.")
    best_model_name = None
    analysis_summary = {}

# 5. Hyperparameter Tuning
tuned_pipeline = None
tuning_results_summary = {}

if best_model_name and best_model_name in models_to_evaluate:
    print(f"\n--- Hyperparameter Tuning for {best_model_name} ---")
    base_model_for_tuning = models_to_evaluate[best_model_name]
    param_grids_all = define_parameter_grids()  # Must be defined in a previous cell

    if best_model_name not in param_grids_all:
        print(f"⚠ No parameter grid defined for {best_model_name}. Skipping tuning.")
    else:
        param_grid_for_model = param_grids_all[best_model_name]

        if best_model_name in model_categories['tree_preprocessing']:
            preprocessor_ct_tuning = preprocessor_trees_ct
        else:
            preprocessor_ct_tuning = preprocessor_general_ct

        if preprocessor_ct_tuning is None:
            print(f"⚠ Preprocessor for {best_model_name} not available. Skipping tuning.")
        else:
            needs_smote_tuning = best_model_name in model_categories.get('needs_smote', [])
            stkde_transformer_tuning = STKDEAndRiskLabelTransformer(**stkde_config)

            # Adjust parameter grid keys based on whether SMOTE is in the feature_processor pipeline
            # The CustomModelPipeline structure is:
            # stkde_transformer -> feature_processor (preprocessor_ct -> smote?) -> classifier
            # So, RandomizedSearchCV params for classifier need to be prefixed by 'classifier__'
            # And params for preprocessor_ct need to be prefixed by 'feature_processor__preprocessor__'
            
            # Create the feature processing part of the pipeline first
            if needs_smote_tuning:
                feature_processor_tuning = ImbPipeline([
                    ('preprocessor', preprocessor_ct_tuning), # Params: feature_processor__preprocessor__<param>
                    ('smote', SMOTE(random_state=42))         # SMOTE usually has no tunable params here
                ])
            else:
                feature_processor_tuning = Pipeline([
                    ('preprocessor', preprocessor_ct_tuning) # Params: feature_processor__preprocessor__<param>
                ])

            # Adjust the keys in param_grid_for_model to match the structure of CustomModelPipeline
            # Original param_grids are defined like 'feature_pipeline__classifier__n_estimators'
            # We need to change 'feature_pipeline__classifier__' to 'classifier__'
            # and 'feature_pipeline__preprocessor__' to 'feature_processor__preprocessor__'
            
            adjusted_param_grid_for_tuning = {}
            for key, value in param_grid_for_model.items():
                if key.startswith('feature_pipeline__classifier__'):
                    new_key = key.replace('feature_pipeline__classifier__', 'classifier__')
                    adjusted_param_grid_for_tuning[new_key] = value
                elif key.startswith('feature_pipeline__preprocessor__'): # Assuming preprocessor params might be tuned
                    new_key = key.replace('feature_pipeline__preprocessor__', 'feature_processor__preprocessor__')
                    adjusted_param_grid_for_tuning[new_key] = value
                else:
                    # If keys are already simple like 'classifier__n_estimators', keep them
                    adjusted_param_grid_for_tuning[key] = value
            
            print(f"Adjusted param grid for {best_model_name} tuning: {adjusted_param_grid_for_tuning.keys()}")


            full_tuning_pipeline = CustomModelPipeline(
                stkde_transformer=stkde_transformer_tuning,
                feature_processor=feature_processor_tuning, # This is now an ImbPipeline or Pipeline
                classifier=clone(base_model_for_tuning)   # Params: classifier__<param>
            )
            
            try:
                print(f"Starting RandomizedSearchCV for {best_model_name} with {N_ITER_RANDOM_SEARCH} iterations.")
                random_search = RandomizedSearchCV(
                    estimator=full_tuning_pipeline,
                    param_distributions=adjusted_param_grid_for_tuning, # Use the adjusted grid
                    n_iter=N_ITER_RANDOM_SEARCH,
                    cv=cv_config['cv_strategy'],
                    scoring=PRIMARY_METRIC_TUNING,
                    n_jobs=cv_config['n_jobs_cv'], # Use 1 if STKDE is not thread-safe or for debugging
                    random_state=42,
                    verbose=1,
                    error_score='raise' # Make sure errors in CV during search are raised
                )
                # Fit on y_train_sorted (dummy target), actual target engineered inside pipeline
                random_search.fit(cv_config['X_train_sorted'], cv_config['y_train_sorted'])


                tuned_pipeline = random_search.best_estimator_
                tuning_results_summary = {
                    'best_score': random_search.best_score_,
                    'best_params': random_search.best_params_,
                    'model_name': best_model_name
                }

                print(f"✓ Hyperparameter tuning completed for {best_model_name}.")
                print(f"  Best CV {PRIMARY_METRIC_TUNING} score: {random_search.best_score_:.4f}")
                print(f"  Best parameters: {random_search.best_params_}")

                tuning_results_file = os.path.join(modeling_results_dir, f"{best_model_name}_tuning_results.json")
                # Ensure all parts of cv_results_ and best_params_ are JSON serializable
                serializable_cv_results = {}
                if hasattr(random_search, 'cv_results_'):
                    for k_cv, v_cv in random_search.cv_results_.items():
                        if isinstance(v_cv, np.ndarray):
                            serializable_cv_results[k_cv] = v_cv.tolist()
                        elif isinstance(v_cv, list) and v_cv and isinstance(v_cv[0], (np.integer, np.floating, np.bool_)):
                             serializable_cv_results[k_cv] = [item.item() if hasattr(item, 'item') else item for item in v_cv] # Convert numpy types in lists
                        else:
                            serializable_cv_results[k_cv] = v_cv
                
                serializable_best_params = {}
                if random_search.best_params_:
                    for k_bp, v_bp in random_search.best_params_.items():
                        if isinstance(v_bp, (np.integer, np.floating, np.bool_)):
                             serializable_best_params[k_bp] = v_bp.item() # Convert numpy types
                        else:
                             serializable_best_params[k_bp] = v_bp


                with open(tuning_results_file, 'w') as f:
                    json.dump({
                        'best_score': random_search.best_score_,
                        'best_params': serializable_best_params,
                        'cv_results': serializable_cv_results
                    }, f, indent=2)
                print(f"✓ Tuning results saved to: {tuning_results_file}")

            except Exception as e:
                print(f"❌ Error during hyperparameter tuning for {best_model_name}: {e}")
                import traceback
                traceback.print_exc() # Print full traceback for debugging
                tuned_pipeline = None # Ensure it's None if tuning fails
                tuning_results_summary = {'error': str(e), 'model_name': best_model_name}
else:
    print("\n⚠ No best model identified or best model not in evaluation list. Skipping hyperparameter tuning.")

# 6. Final Evaluation on Test Set
final_test_results = {} # Initialize final_test_results
if tuned_pipeline is not None and best_model_name is not None:
    print(f"\n--- Final Test Set Evaluation for Tuned {best_model_name} ---")
    
    try:
        print("Refitting the best tuned pipeline on the entire training set (X_train_sorted, y_train_sorted)...")
        # Fit on y_train_sorted (dummy target), actual target engineered inside pipeline
        tuned_pipeline.fit(X_train_sorted, y_train_sorted)
        print("Refit complete.")

        print("Making predictions on the test set (X_test)...")
        y_pred_test = tuned_pipeline.predict(X_test)

        # To get the actual y_test_engineered, we need to pass X_test through the STKDE transformer part
        # of the *fitted* tuned_pipeline.
        print("Engineering true labels for the test set using the fitted STKDE transformer...")
        if hasattr(tuned_pipeline, 'stkde_transformer') and hasattr(tuned_pipeline.stkde_transformer, 'transform'):
            # Ensure X_test is copied to avoid in-place modifications if any
            X_test_stkde_aug = tuned_pipeline.stkde_transformer.transform(X_test.copy())
            y_test_engineered = X_test_stkde_aug[tuned_pipeline.stkde_transformer.label_col_name]
            print(f"Engineered y_test shape: {y_test_engineered.shape}")
        else:
            raise AttributeError("Tuned pipeline does not have a 'stkde_transformer' with a 'transform' method.")

        # Ensure y_pred_test and y_test_engineered are aligned and have same length
        if len(y_pred_test) != len(y_test_engineered):
            print(f"Warning: Length mismatch between y_pred_test ({len(y_pred_test)}) and y_test_engineered ({len(y_test_engineered)}).")
            # This might indicate that STKDEAndRiskLabelTransformer.transform changed the number of rows in X_test.
            # If so, y_pred_test (from the full pipeline) should align with the output of STKDE.
            # Assuming y_pred_test is already aligned if the pipeline is consistent.
            # If X_test_stkde_aug has a different index, y_pred_test might need re-indexing if it's a Series.
            # For now, we proceed, but this is a critical point for data integrity.
            # If y_pred_test is a numpy array, it should correspond to X_test_stkde_aug's rows.
            if isinstance(y_pred_test, np.ndarray) and isinstance(y_test_engineered, pd.Series):
                 if len(y_pred_test) == len(y_test_engineered.index):
                     y_pred_test = pd.Series(y_pred_test, index=y_test_engineered.index)
                 else:
                     print("Cannot align y_pred_test (numpy) with y_test_engineered (pandas Series) due to length mismatch after STKDE.")
                     # Handle error or skip evaluation
        
        y_proba_test_pos = None
        if hasattr(tuned_pipeline, "predict_proba"):
            print("Getting probability predictions from the tuned pipeline...")
            y_proba_test_full = tuned_pipeline.predict_proba(X_test) # Probabilities for all classes
            if y_proba_test_full.ndim == 2 and y_proba_test_full.shape[1] >= 2:
                y_proba_test_pos = y_proba_test_full[:, 1]  # Probability of the positive class (class 1)
            elif y_proba_test_full.ndim == 1: # Case for some classifiers or if only one class prob is returned
                y_proba_test_pos = y_proba_test_full
            else:
                print(f"Warning: Unexpected shape for probability predictions: {y_proba_test_full.shape}")
        else:
            print("Probability predictions (predict_proba) not available for the tuned pipeline.")

        current_test_metrics = {
            'accuracy': accuracy_score(y_test_engineered, y_pred_test),
            'f1': f1_score(y_test_engineered, y_pred_test, average='binary', zero_division=0),
            'precision': precision_score(y_test_engineered, y_pred_test, average='binary', zero_division=0),
            'recall': recall_score(y_test_engineered, y_pred_test, average='binary', zero_division=0),
            'mcc': matthews_corrcoef(y_test_engineered, y_pred_test)
        }
        if y_proba_test_pos is not None:
            try:
                # Ensure y_proba_test_pos has the same length as y_test_engineered
                if len(y_proba_test_pos) == len(y_test_engineered):
                    current_test_metrics['roc_auc'] = roc_auc_score(y_test_engineered, y_proba_test_pos)
                    current_test_metrics['pr_auc'] = average_precision_score(y_test_engineered, y_proba_test_pos)
                    # log_loss needs probabilities for all classes
                    current_test_metrics['neg_log_loss'] = log_loss(y_test_engineered, y_proba_test_full)
                else:
                    print(f"Warning: Length mismatch for probability scores. y_proba_test_pos: {len(y_proba_test_pos)}, y_test_engineered: {len(y_test_engineered)}")
            except ValueError as e_proba_metrics:
                print(f"Could not calculate probability-based metrics on test set: {e_proba_metrics}")
        
        final_test_results.update(current_test_metrics)

        print(f"\nTest Set Performance for Tuned {best_model_name}:")
        for metric, value in final_test_results.items():
            if isinstance(value, float):
                print(f"  {metric.capitalize()}: {value:.4f}")
            else:
                print(f"  {metric.capitalize()}: {value}")


        test_results_file_path = os.path.join(modeling_results_dir, f"{best_model_name}_final_test_results.json")
        with open(test_results_file_path, 'w') as f:
            # Ensure all values are native Python types for JSON
            serializable_final_test_results = {
                k_ftr: (v_ftr.item() if hasattr(v_ftr, 'item') else v_ftr) 
                for k_ftr, v_ftr in final_test_results.items()
            }
            json.dump(serializable_final_test_results, f, indent=2)
        print(f"✓ Final test results saved to: {test_results_file_path}")

        # Save predictions with original features and engineered target/predictions
        # Ensure X_test_stkde_aug (which might have different row count/index than original X_test) is used for joining
        # y_test_engineered is already aligned with X_test_stkde_aug
        # y_pred_test needs to be aligned if it's not already a Series with the same index
        
        # Create a base DataFrame from X_test that matches the rows of y_test_engineered
        # This assumes y_test_engineered (from X_test_stkde_aug) has an index that can be used to loc original X_test features
        if isinstance(y_test_engineered, pd.Series):
            aligned_X_test_features = X_test.loc[y_test_engineered.index].copy()
        else: # If y_test_engineered is not a Series (e.g. numpy array), this alignment is harder
            print("Warning: y_test_engineered is not a pandas Series. Cannot reliably align original X_test features for predictions export.")
            aligned_X_test_features = X_test.copy() # Fallback, might not be aligned

        predictions_output_df = pd.DataFrame(index=aligned_X_test_features.index)
        predictions_output_df['true_risk_level_engineered'] = y_test_engineered
        
        if isinstance(y_pred_test, pd.Series) and y_pred_test.index.equals(predictions_output_df.index):
            predictions_output_df['predicted_risk_level'] = y_pred_test
        elif isinstance(y_pred_test, np.ndarray) and len(y_pred_test) == len(predictions_output_df):
            predictions_output_df['predicted_risk_level'] = y_pred_test
        else:
            print("Warning: Could not align y_pred_test with the output DataFrame for predictions.")

        if y_proba_test_pos is not None:
            if isinstance(y_proba_test_pos, pd.Series) and y_proba_test_pos.index.equals(predictions_output_df.index):
                predictions_output_df['predicted_proba_high_risk'] = y_proba_test_pos
            elif isinstance(y_proba_test_pos, np.ndarray) and len(y_proba_test_pos) == len(predictions_output_df):
                predictions_output_df['predicted_proba_high_risk'] = y_proba_test_pos
            else:
                 print("Warning: Could not align y_proba_test_pos with the output DataFrame for predictions.")

        # Add original features
        final_predictions_df_with_features = aligned_X_test_features.join(predictions_output_df)

        final_predictions_file_path = os.path.join(modeling_results_dir, f"{best_model_name}_final_test_predictions_with_features.csv")
        final_predictions_df_with_features.to_csv(final_predictions_file_path, index=True) # Keep index if it's meaningful (e.g. original IDs)
        print(f"✓ Final test predictions with features saved to: {final_predictions_file_path}")

    except Exception as e_final_eval:
        print(f"❌ Error during final test set evaluation for {best_model_name}: {e_final_eval}")
        import traceback
        traceback.print_exc()
        # final_test_results might be partially populated or empty
        if not final_test_results: # Ensure it's a dict even on error
            final_test_results = {'error': str(e_final_eval), 'model_name': best_model_name}


else:
    print("\n⚠ Cannot perform final test set evaluation: tuned pipeline or best model name is not available.")
    if not final_test_results: # Ensure it's a dict
         final_test_results = {}


# 7. Generate Final Summary Report
print("\n--- Generating Final Summary Report ---")
create_final_summary_report(
    model_results_df=model_results_df if 'model_results_df' in locals() else pd.DataFrame(),
    final_test_results=final_test_results if 'final_test_results' in locals() else {}, # Use the potentially updated final_test_results
    best_model_name=best_model_name if 'best_model_name' in locals() and best_model_name is not None else 'Unknown',
    results_dir=modeling_results_dir
)

print("\n=== Modeling Workflow Completed ===")
