# Fraud Detection - Model Training and Evaluation (5 Scenarios)

This notebook implements a **systematic comparison of 5 scenarios** for the IEEE-CIS Fraud Detection task.

## Key Design Decisions

**No SMOTE**: All scenarios use **class weighting only** (no synthetic oversampling).

**Fixed Evaluation Protocol**:
- StratifiedKFold: n_splits=3, shuffle=True, random_state=42
- Metrics: CV ROC-AUC, Test ROC-AUC, Test PR-AUC (AP), Youden threshold, Precision/Recall/F1

## 5 Comparison Scenarios

| # | Scenario | Missing Strategy | Feature Set | Purpose |
|---|----------|-----------------|-------------|---------|
| 1 | Baseline | Sentinel (-999) | Full (175) | Control experiment |
| 2 | NaN Strategy | NaN (native) | Full (175) | Missing value handling comparison |
| 3 | No Interactions | Sentinel (-999) | No inter_* features | Feature ablation |
| 4 | Strong Only | Sentinel (-999) | strong_features + strong_cat | Aggressive reduction |
| 5 | Strong + Moderate | Sentinel (-999) | strong + moderate features | Moderate reduction |

## Models Evaluated (per scenario)

**Base Models** (4):
1. RandomForest - Bagging ensemble with class_weight='balanced'
2. XGBoost - Gradient boosting with scale_pos_weight
3. LightGBM - Leaf-wise growth with is_unbalance=True
4. CatBoost - Ordered boosting with auto_class_weights='Balanced'

**Stacking Variants** (3):
5. Stacking_Weighted - CV-AUC-based weighted average
6. Stacking_Logistic - LogisticRegression meta-learner (OOF-trained)
7. Stacking_Ridge - L2-regularized LogisticRegression meta-learner (OOF-trained)

In [1]:
# =============================================================================
# Import Libraries
# =============================================================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
import time
import pickle
warnings.filterwarnings('ignore')

# Sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import (
    roc_auc_score, roc_curve, auc, average_precision_score,
    precision_score, recall_score, f1_score
)
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.base import clone
from sklearn.preprocessing import StandardScaler

# Boosting models
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

# Custom functions
import sys
ROOT = Path.cwd().parent
sys.path.append(str(ROOT / "functions"))

print("Libraries imported successfully!")
print("Note: SMOTE/imblearn NOT imported - using class weights only")

Libraries imported successfully!
Note: SMOTE/imblearn NOT imported - using class weights only


## 1. Load Data & Configuration

In [2]:
# =============================================================================
# Load Preprocessed Data from EDA.ipynb
# =============================================================================
import pickle

DATA = ROOT / "data"

# Load DataFrames
train_df = pd.read_parquet(DATA / "train_preprocessed.parquet")
test_df = pd.read_parquet(DATA / "test_preprocessed.parquet")

# Load feature lists
with open(DATA / "feature_lists.pkl", 'rb') as f:
    feature_lists = pickle.load(f)

filtered_features = feature_lists['filtered_features']
categorical_features = feature_lists['categorical_for_model']
strong_features = feature_lists['strong_features']
moderate_features = feature_lists['moderate_features']

print(f"‚úÖ Data loaded successfully!")
print(f"   Train shape: {train_df.shape}")
print(f"   Test shape: {test_df.shape}")
print(f"   Features for modeling: {len(filtered_features)}")

‚úÖ Data loaded successfully!
   Train shape: (472432, 195)
   Test shape: (118108, 195)
   Features for modeling: 175


In [3]:
# =============================================================================
# CONFIGURATION - Fixed evaluation protocol for all scenarios
# =============================================================================

# Cross-validation settings (fixed for all scenarios)
N_FOLDS = 3
CV_RANDOM_STATE = 42

# Load feature lists and examine available keys
print("üìã Available feature list keys:")
for key in feature_lists.keys():
    if isinstance(feature_lists[key], list):
        print(f"   {key}: {len(feature_lists[key])} features")
    else:
        print(f"   {key}: {type(feature_lists[key])}")

# Check for categorical feature lists
strong_cat = feature_lists.get('strong_cat', [])
moderate_cat = feature_lists.get('moderate_cat', [])

print(f"\n‚úÖ Configuration loaded:")
print(f"   CV: StratifiedKFold, n_splits={N_FOLDS}, shuffle=True, random_state={CV_RANDOM_STATE}")
print(f"   Strong categorical features: {len(strong_cat)}")
print(f"   Moderate categorical features: {len(moderate_cat)}")

üìã Available feature list keys:
   filtered_features: 175 features
   strong_features: 51 features
   moderate_features: 55 features
   weak_features: 13 features
   strong_cat: 67 features
   moderate_cat: 2 features
   weak_cat: 3 features
   categorical_for_model: 69 features
   numerical_for_model: 106 features

‚úÖ Configuration loaded:
   CV: StratifiedKFold, n_splits=3, shuffle=True, random_state=42
   Strong categorical features: 67
   Moderate categorical features: 2


In [4]:
# =============================================================================
# Prepare Base Features and Target (Original Data - Scenarios will transform)
# =============================================================================

# Load original preprocessed data (with NaN values intact for scenario flexibility)
X_train_raw = train_df[filtered_features].copy()
y_train = train_df['isFraud'].copy()
X_test_raw = test_df[filtered_features].copy()
y_test = test_df['isFraud'].copy()

# =============================================================================
# Handle Infinity Values Only (NaN handling is scenario-specific)
# =============================================================================
inf_cols = X_train_raw.columns[X_train_raw.isin([np.inf, -np.inf]).any()].tolist()
if inf_cols:
    print(f"‚ö†Ô∏è Found infinity in {len(inf_cols)} columns: {inf_cols}")

X_train_raw = X_train_raw.replace([np.inf, -np.inf], np.nan)
X_test_raw = X_test_raw.replace([np.inf, -np.inf], np.nan)

# Calculate class imbalance ratio for class weighting
scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()

print(f"\nüìä Base Data prepared:")
print(f"   X_train_raw shape: {X_train_raw.shape}")
print(f"   X_test_raw shape: {X_test_raw.shape}")
print(f"   Train fraud rate: {y_train.mean()*100:.2f}%")
print(f"   Test fraud rate: {y_test.mean()*100:.2f}%")
print(f"   Class imbalance ratio (scale_pos_weight): {scale_pos_weight:.2f}")


üìä Base Data prepared:
   X_train_raw shape: (472432, 175)
   X_test_raw shape: (118108, 175)
   Train fraud rate: 3.51%
   Test fraud rate: 3.44%
   Class imbalance ratio (scale_pos_weight): 27.46


### Temporal Split Justification

**Why Temporal vs Random Split:**

In fraud detection, using a **temporal (time-based) split** is critical because:

1. **Concept Drift**: Fraudsters continuously adapt their strategies over time, making historical patterns less predictive of future fraud.
2. **Data Leakage Prevention**: Random splits would allow "future" information to leak into training, artificially inflating performance metrics.
3. **Realistic Evaluation**: Models trained on past data must predict future fraud patterns, mirroring real-world deployment.

**Academic Justification:**

- *Krawczyk et al. (2017), "Ensemble learning for data stream analysis: A survey", Information Fusion*: Highlights the importance of temporal ordering in streaming data classification tasks.
- *Dal Pozzolo et al. (2014), "Learned lessons in credit card fraud detection from a practitioner perspective", Expert Systems with Applications*: Demonstrates that temporal splits provide more realistic performance estimates in fraud detection.

**Implication**: CV scores (computed on training data) may differ from test set performance because the test set represents a **future time period** with potentially different fraud patterns. This is expected behavior and reflects real-world deployment conditions.

In [5]:
# # =============================================================================
# # Visualize Train/Test Split (Temporal)
# # =============================================================================
# fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# # 1. Dataset Size Comparison
# ax1 = axes[0]
# sizes = [len(X_train), len(X_test)]
# labels = ['Train (80%)', 'Test (20%)']
# colors = ['#3498db', '#e74c3c']
# bars = ax1.bar(labels, sizes, color=colors, edgecolor='black', linewidth=1.2)
# ax1.set_ylabel('Number of Samples', fontsize=11)
# ax1.set_title('Dataset Split', fontsize=12, fontweight='bold')
# for bar, size in zip(bars, sizes):
#     ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1000, 
#              f'{size:,}', ha='center', va='bottom', fontsize=10, fontweight='bold')

# # 2. Fraud Rate Comparison
# ax2 = axes[1]
# fraud_rates = [y_train.mean()*100, y_test.mean()*100]
# bars2 = ax2.bar(labels, fraud_rates, color=colors, edgecolor='black', linewidth=1.2)
# ax2.set_ylabel('Fraud Rate (%)', fontsize=11)
# ax2.set_title('Fraud Rate by Split', fontsize=12, fontweight='bold')
# ax2.axhline(y=y_train.mean()*100, color='gray', linestyle='--', alpha=0.5, label='Train Rate')
# for bar, rate in zip(bars2, fraud_rates):
#     ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1, 
#              f'{rate:.2f}%', ha='center', va='bottom', fontsize=10, fontweight='bold')

# # 3. Class Distribution (Stacked)
# ax3 = axes[2]
# train_counts = [len(y_train) - y_train.sum(), y_train.sum()]
# test_counts = [len(y_test) - y_test.sum(), y_test.sum()]
# x_pos = np.arange(2)
# width = 0.35

# bars_normal = ax3.bar(x_pos - width/2, [train_counts[0], test_counts[0]], width, 
#                        label='Normal', color='#2ecc71', edgecolor='black')
# bars_fraud = ax3.bar(x_pos + width/2, [train_counts[1], test_counts[1]], width, 
#                       label='Fraud', color='#e74c3c', edgecolor='black')
# ax3.set_xticks(x_pos)
# ax3.set_xticklabels(['Train', 'Test'])
# ax3.set_ylabel('Count', fontsize=11)
# ax3.set_title('Class Distribution', fontsize=12, fontweight='bold')
# ax3.legend(loc='upper right')
# ax3.set_yscale('log')  # Log scale due to imbalance

# plt.tight_layout()
# plt.show()

# # Summary table
# print("\n" + "="*60)
# print("DATA SPLIT SUMMARY (Temporal Split: Train=Past, Test=Future)")
# print("="*60)
# print(f"{'Dataset':<12} {'Samples':>12} {'Normal':>12} {'Fraud':>10} {'Fraud %':>10}")
# print("-"*60)
# print(f"{'Train':<12} {len(X_train):>12,} {int(len(y_train)-y_train.sum()):>12,} {int(y_train.sum()):>10,} {y_train.mean()*100:>9.2f}%")
# print(f"{'Test':<12} {len(X_test):>12,} {int(len(y_test)-y_test.sum()):>12,} {int(y_test.sum()):>10,} {y_test.mean()*100:>9.2f}%")
# print("="*60)

## 2. Model Definitions (Class-Weighted Only)

In [6]:
# =============================================================================
# Model Factory Functions (Class-weighted only - No SMOTE)
# =============================================================================

def create_random_forest(use_imputer: bool = False):
    """
    Create RandomForest classifier with optional imputation for NaN handling.
    
    Args:
        use_imputer: If True, wrap RF in pipeline with median imputer (for NaN strategy)
    """
    rf = RandomForestClassifier(
        n_estimators=200,
        max_depth=15,
        min_samples_split=10,
        min_samples_leaf=5,
        class_weight='balanced',
        random_state=42,
        n_jobs=-1
    )
    
    if use_imputer:
        return Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('classifier', rf)
        ])
    return rf


def create_xgboost(scale_pos_weight: float):
    """Create XGBoost classifier with class weighting."""
    return XGBClassifier(
        n_estimators=300,
        max_depth=6,
        learning_rate=0.05,
        subsample=0.8,
        colsample_bytree=0.8,
        scale_pos_weight=scale_pos_weight,
        eval_metric='auc',
        random_state=42,
        n_jobs=-1
    )


def create_lightgbm():
    """Create LightGBM classifier with is_unbalance flag."""
    return LGBMClassifier(
        n_estimators=300,
        max_depth=8,
        learning_rate=0.05,
        num_leaves=31,
        subsample=0.8,
        colsample_bytree=0.8,
        is_unbalance=True,
        random_state=42,
        n_jobs=-1,
        verbose=-1
    )


def create_catboost():
    """Create CatBoost classifier with auto_class_weights."""
    return CatBoostClassifier(
        iterations=300,
        depth=6,
        learning_rate=0.05,
        auto_class_weights='Balanced',
        eval_metric='AUC',
        random_seed=42,
        verbose=0
    )


def get_base_models(missing_strategy: str, scale_pos_weight: float):
    """
    Get dictionary of base models configured for the given missing strategy.
    
    Args:
        missing_strategy: 'sentinel' or 'nan'
        scale_pos_weight: Class imbalance ratio for XGBoost
    
    Returns:
        Dict of model_name -> model instance
    """
    use_imputer = (missing_strategy == 'nan')  # RF needs imputer for NaN
    
    return {
        'RandomForest': create_random_forest(use_imputer=use_imputer),
        'XGBoost': create_xgboost(scale_pos_weight),
        'LightGBM': create_lightgbm(),
        'CatBoost': create_catboost()
    }


print("‚úÖ Model factory functions defined (class-weighted only, no SMOTE)")
print(f"   Class imbalance ratio for XGBoost: {scale_pos_weight:.2f}")

‚úÖ Model factory functions defined (class-weighted only, no SMOTE)
   Class imbalance ratio for XGBoost: 27.46


## Scenario Definitions

We define **5 comparison scenarios** to systematically evaluate different feature engineering and missing value strategies.

| Scenario | Missing Strategy | Feature Set | Description |
|----------|-----------------|-------------|-------------|
| 1 | Sentinel (-999) | Full (175 features) | Baseline control |
| 2 | NaN (native) | Full (175 features) | Compare missing value handling |
| 3 | Sentinel (-999) | No interaction features | Feature ablation study |
| 4 | Sentinel (-999) | Strong features only | Aggressive feature reduction |
| 5 | Sentinel (-999) | Strong + Moderate features | Moderate feature reduction |

**All scenarios use:**
- Class weighting mechanisms (no SMOTE)
- StratifiedKFold: n_splits=3, shuffle=True, random_state=42
- Metrics: CV ROC-AUC, Test ROC-AUC, Test PR-AUC, Youden threshold, Precision/Recall/F1

In [7]:
# =============================================================================
# Define All 5 Scenarios
# =============================================================================

# Get available feature subsets
interaction_features = [f for f in filtered_features if f.startswith('inter_')]
non_interaction_features = [f for f in filtered_features if not f.startswith('inter_')]

# Strong features (numerical + categorical)
strong_num = feature_lists.get('strong_features', [])
strong_cat = feature_lists.get('strong_cat', [])
strong_all = strong_num + strong_cat
strong_all = [f for f in strong_all if f in X_train_raw.columns]

# Moderate features (numerical + categorical if available)
moderate_num = feature_lists.get('moderate_features', [])
moderate_cat = feature_lists.get('moderate_cat', [])

# Strong + Moderate combined
strong_moderate_all = strong_num + moderate_num + strong_cat + moderate_cat
strong_moderate_all = list(set(strong_moderate_all))  # Remove duplicates
strong_moderate_all = [f for f in strong_moderate_all if f in X_train_raw.columns]

print("üìä Feature subset sizes:")
print(f"   Full features: {len(filtered_features)}")
print(f"   Interaction features (inter_*): {len(interaction_features)}")
print(f"   Non-interaction features: {len(non_interaction_features)}")
print(f"   Strong (num + cat): {len(strong_all)}")
print(f"   Strong + Moderate: {len(strong_moderate_all)}")
if not moderate_cat:
    print("   ‚ö†Ô∏è Note: moderate_cat not found in feature_lists.pkl, using empty list")

# =============================================================================
# Define Scenario Configurations
# =============================================================================
SCENARIOS = {
    1: {
        'name': 'Baseline (Control)',
        'missing_strategy': 'sentinel',
        'features': filtered_features.copy(),
        'description': 'Missing=-999, full 175 features'
    },
    2: {
        'name': 'NaN Strategy',
        'missing_strategy': 'nan',
        'features': filtered_features.copy(),
        'description': 'Missing=NaN, full 175 features'
    },
    3: {
        'name': 'No Interaction Features',
        'missing_strategy': 'sentinel',
        'features': non_interaction_features.copy(),
        'description': 'Missing=-999, no inter_* features'
    },
    4: {
        'name': 'Strong Features Only',
        'missing_strategy': 'sentinel',
        'features': strong_all.copy(),
        'description': 'Missing=-999, strong_features + strong_cat'
    },
    5: {
        'name': 'Strong + Moderate Features',
        'missing_strategy': 'sentinel',
        'features': strong_moderate_all.copy(),
        'description': 'Missing=-999, strong + moderate features'
    }
}

print("\n‚úÖ Scenarios defined:")
for sid, cfg in SCENARIOS.items():
    print(f"   Scenario {sid}: {cfg['name']} ({len(cfg['features'])} features, missing={cfg['missing_strategy']})")

üìä Feature subset sizes:
   Full features: 175
   Interaction features (inter_*): 30
   Non-interaction features: 145
   Strong (num + cat): 118
   Strong + Moderate: 175

‚úÖ Scenarios defined:
   Scenario 1: Baseline (Control) (175 features, missing=sentinel)
   Scenario 2: NaN Strategy (175 features, missing=nan)
   Scenario 3: No Interaction Features (145 features, missing=sentinel)
   Scenario 4: Strong Features Only (118 features, missing=sentinel)
   Scenario 5: Strong + Moderate Features (175 features, missing=sentinel)


## 3. Scenario Runner - Model Training and Evaluation

**For each scenario, the runner:**
1. Prepares features according to scenario configuration
2. Trains 4 base models (RF, XGB, LGBM, CatBoost) with 3-fold CV
3. Collects out-of-fold (OOF) predictions for stacking
4. Builds 3 stacking variants:
   - **Stacking_Weighted**: CV-AUC-based weighted average
   - **Stacking_Logistic**: LogisticRegression meta-learner on OOF
   - **Stacking_Ridge**: L2-regularized LogisticRegression meta-learner
5. Evaluates all 7 approaches on the test set
6. Records metrics: CV_AUC, Test_AUC, Test_AP, Youden threshold, Precision/Recall/F1

In [8]:
# =============================================================================
# Scenario Runner - Execute All Scenarios End-to-End
# =============================================================================

def prepare_data_for_scenario(X_train_raw, X_test_raw, features, missing_strategy):
    """Prepare train/test data for a specific scenario."""
    X_train = X_train_raw[features].copy()
    X_test = X_test_raw[features].copy()
    
    if missing_strategy == 'sentinel':
        MISSING_FLAG = -999
        X_train = X_train.fillna(MISSING_FLAG)
        X_test = X_test.fillna(MISSING_FLAG)
    # For 'nan', keep NaN values (handled by models or imputers)
    
    return X_train, X_test


def evaluate_model_on_test(y_true, y_prob):
    """Calculate all test metrics including Youden-optimal threshold."""
    test_auc = roc_auc_score(y_true, y_prob)
    test_ap = average_precision_score(y_true, y_prob)
    
    # Optimal threshold using Youden's J statistic
    fpr, tpr, thresholds = roc_curve(y_true, y_prob)
    j_scores = tpr - fpr
    optimal_idx = np.argmax(j_scores)
    optimal_thresh = thresholds[optimal_idx]
    
    # Metrics at optimal threshold
    y_pred_binary = (y_prob >= optimal_thresh).astype(int)
    precision_opt = precision_score(y_true, y_pred_binary, zero_division=0)
    recall_opt = recall_score(y_true, y_pred_binary, zero_division=0)
    f1_opt = f1_score(y_true, y_pred_binary, zero_division=0)
    
    return {
        'Test_AUC': test_auc,
        'Test_AP': test_ap,
        'Optimal_Threshold': optimal_thresh,
        'Precision_opt': precision_opt,
        'Recall_opt': recall_opt,
        'F1_opt': f1_opt
    }


def run_scenario(scenario_id, config, X_train_raw, X_test_raw, y_train, y_test, 
                 scale_pos_weight, n_folds=3, random_state=42):
    """
    Run a complete scenario: train base models with CV, build stackings, evaluate on test.
    
    Returns:
        DataFrame with results for all 7 approaches (4 base + 3 stacking)
    """
    print(f"\n{'='*80}")
    print(f"SCENARIO {scenario_id}: {config['name']}")
    print(f"Description: {config['description']}")
    print(f"{'='*80}")
    
    # Prepare data for this scenario
    X_train, X_test = prepare_data_for_scenario(
        X_train_raw, X_test_raw, config['features'], config['missing_strategy']
    )
    print(f"Data: {X_train.shape[1]} features, missing={config['missing_strategy']}")
    
    # Get models for this scenario
    base_models = get_base_models(config['missing_strategy'], scale_pos_weight)
    
    # Initialize storage
    skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=random_state)
    results = []
    oof_predictions = {}
    test_predictions = {}
    cv_scores_dict = {}
    
    # Train each base model with CV + OOF collection
    for model_name, model in base_models.items():
        start_time = time.time()
        
        oof_probs = np.zeros(len(X_train))
        cv_scores = []
        
        # K-Fold CV with OOF prediction collection
        for fold, (train_idx, val_idx) in enumerate(skf.split(X_train, y_train)):
            X_fold_train = X_train.iloc[train_idx]
            y_fold_train = y_train.iloc[train_idx]
            X_fold_val = X_train.iloc[val_idx]
            y_fold_val = y_train.iloc[val_idx]
            
            fold_model = clone(model)
            fold_model.fit(X_fold_train, y_fold_train)
            
            fold_probs = fold_model.predict_proba(X_fold_val)[:, 1]
            oof_probs[val_idx] = fold_probs
            
            fold_auc = roc_auc_score(y_fold_val, fold_probs)
            cv_scores.append(fold_auc)
        
        cv_scores = np.array(cv_scores)
        oof_predictions[model_name] = oof_probs
        cv_scores_dict[model_name] = cv_scores.mean()
        
        # Train final model on full training data
        final_model = clone(model)
        final_model.fit(X_train, y_train)
        
        # Test predictions
        test_probs = final_model.predict_proba(X_test)[:, 1]
        test_predictions[model_name] = test_probs
        
        # Evaluate on test
        test_metrics = evaluate_model_on_test(y_test, test_probs)
        
        elapsed = time.time() - start_time
        
        results.append({
            'Scenario': scenario_id,
            'Approach': model_name,
            'CV_AUC_mean': cv_scores.mean(),
            'CV_AUC_std': cv_scores.std(),
            **test_metrics
        })
        
        print(f"  {model_name}: CV AUC {cv_scores.mean():.4f}¬±{cv_scores.std():.4f} | "
              f"Test AUC {test_metrics['Test_AUC']:.4f} | {elapsed:.1f}s")
    
    # ==========================================================================
    # Build Stacking Variants
    # ==========================================================================
    model_names = list(base_models.keys())
    oof_matrix = np.column_stack([oof_predictions[name] for name in model_names])
    test_matrix = np.column_stack([test_predictions[name] for name in model_names])
    
    # Scale for meta-learners
    meta_scaler = StandardScaler()
    oof_matrix_scaled = meta_scaler.fit_transform(oof_matrix)
    test_matrix_scaled = meta_scaler.transform(test_matrix)
    
    # --- Stacking_Weighted: CV-AUC-based weights ---
    total_auc = sum(cv_scores_dict.values())
    weights = np.array([cv_scores_dict[name] / total_auc for name in model_names])
    
    weighted_test_probs = np.dot(test_matrix, weights)
    test_predictions['Stacking_Weighted'] = weighted_test_probs
    test_metrics = evaluate_model_on_test(y_test, weighted_test_probs)
    
    results.append({
        'Scenario': scenario_id,
        'Approach': 'Stacking_Weighted',
        'CV_AUC_mean': np.nan,
        'CV_AUC_std': np.nan,
        **test_metrics
    })
    print(f"  Stacking_Weighted: Test AUC {test_metrics['Test_AUC']:.4f}")
    
    # --- Stacking_Logistic: OOF-trained LogisticRegression ---
    meta_logistic = LogisticRegression(
        C=1.0, class_weight='balanced', solver='lbfgs', max_iter=1000, random_state=42
    )
    meta_logistic.fit(oof_matrix_scaled, y_train)
    
    logistic_test_probs = meta_logistic.predict_proba(test_matrix_scaled)[:, 1]
    test_predictions['Stacking_Logistic'] = logistic_test_probs
    test_metrics = evaluate_model_on_test(y_test, logistic_test_probs)
    
    results.append({
        'Scenario': scenario_id,
        'Approach': 'Stacking_Logistic',
        'CV_AUC_mean': np.nan,
        'CV_AUC_std': np.nan,
        **test_metrics
    })
    print(f"  Stacking_Logistic: Test AUC {test_metrics['Test_AUC']:.4f}")
    
    # --- Stacking_Ridge: L2-regularized LogisticRegression (Ridge-like) ---
    meta_ridge = LogisticRegression(
        C=0.1, penalty='l2', class_weight='balanced', solver='lbfgs', 
        max_iter=1000, random_state=42
    )
    meta_ridge.fit(oof_matrix_scaled, y_train)
    
    ridge_test_probs = meta_ridge.predict_proba(test_matrix_scaled)[:, 1]
    test_predictions['Stacking_Ridge'] = ridge_test_probs
    test_metrics = evaluate_model_on_test(y_test, ridge_test_probs)
    
    results.append({
        'Scenario': scenario_id,
        'Approach': 'Stacking_Ridge',
        'CV_AUC_mean': np.nan,
        'CV_AUC_std': np.nan,
        **test_metrics
    })
    print(f"  Stacking_Ridge: Test AUC {test_metrics['Test_AUC']:.4f}")
    
    return pd.DataFrame(results)


# =============================================================================
# Execute All Scenarios
# =============================================================================
print("üöÄ Starting Scenario Runner - All 5 Scenarios")
print(f"   CV: StratifiedKFold, n_splits={N_FOLDS}, shuffle=True, random_state={CV_RANDOM_STATE}")
print(f"   Models: RandomForest, XGBoost, LightGBM, CatBoost + 3 Stacking variants")

all_results = []

for scenario_id, config in SCENARIOS.items():
    scenario_results = run_scenario(
        scenario_id=scenario_id,
        config=config,
        X_train_raw=X_train_raw,
        X_test_raw=X_test_raw,
        y_train=y_train,
        y_test=y_test,
        scale_pos_weight=scale_pos_weight,
        n_folds=N_FOLDS,
        random_state=CV_RANDOM_STATE
    )
    all_results.append(scenario_results)

# Combine all results into master DataFrame
master_results = pd.concat(all_results, ignore_index=True)
print(f"\n{'='*80}")
print(f"‚úÖ All scenarios complete! Total results: {len(master_results)} rows")

üöÄ Starting Scenario Runner - All 5 Scenarios
   CV: StratifiedKFold, n_splits=3, shuffle=True, random_state=42
   Models: RandomForest, XGBoost, LightGBM, CatBoost + 3 Stacking variants

SCENARIO 1: Baseline (Control)
Description: Missing=-999, full 175 features
Data: 175 features, missing=sentinel
  RandomForest: CV AUC 0.9345¬±0.0006 | Test AUC 0.8845 | 260.2s
  XGBoost: CV AUC 0.9404¬±0.0003 | Test AUC 0.8978 | 443.4s
  LightGBM: CV AUC 0.9381¬±0.0008 | Test AUC 0.9003 | 41.3s
  CatBoost: CV AUC 0.9182¬±0.0008 | Test AUC 0.8933 | 113.8s
  Stacking_Weighted: Test AUC 0.9027
  Stacking_Logistic: Test AUC 0.8732
  Stacking_Ridge: Test AUC 0.8734

SCENARIO 2: NaN Strategy
Description: Missing=NaN, full 175 features
Data: 175 features, missing=nan
  RandomForest: CV AUC 0.9336¬±0.0004 | Test AUC 0.8889 | 221.5s
  XGBoost: CV AUC 0.9410¬±0.0001 | Test AUC 0.9047 | 447.9s
  LightGBM: CV AUC 0.9396¬±0.0015 | Test AUC 0.9075 | 36.1s
  CatBoost: CV AUC 0.9179¬±0.0015 | Test AUC 0.8883 | 11

## 4. Results Output and Export

The scenario runner has completed execution. Below we:
1. Display the **MASTER results table** with all scenarios and approaches
2. Display **per-scenario tables** with English headings
3. Export results to:
   - `results_master.csv` - All results in flat CSV format
   - `results_by_scenario.xlsx` - Excel workbook with one sheet per scenario

In [9]:
# =============================================================================
# Format and Display Results
# =============================================================================

# Round numeric columns for display
numeric_cols = ['CV_AUC_mean', 'CV_AUC_std', 'Test_AUC', 'Test_AP', 
                'Optimal_Threshold', 'Precision_opt', 'Recall_opt', 'F1_opt']

results_display = master_results.copy()
for col in numeric_cols:
    results_display[col] = results_display[col].round(4)

# =============================================================================
# 1. MASTER TABLE (All Scenarios √ó All Approaches)
# =============================================================================
print("\n" + "="*120)
print("MASTER RESULTS TABLE - All Scenarios and Approaches")
print("="*120)
print(results_display.to_string(index=False))

# Best overall approach
best_row = results_display.loc[results_display['Test_AUC'].idxmax()]
print(f"\nüèÜ Best Overall: Scenario {int(best_row['Scenario'])} - {best_row['Approach']} "
      f"(Test AUC: {best_row['Test_AUC']:.4f})")

# =============================================================================
# 2. PER-SCENARIO TABLES
# =============================================================================
print("\n" + "="*120)
print("RESULTS BY SCENARIO")
print("="*120)

scenario_results_dict = {}

for scenario_id in sorted(master_results['Scenario'].unique()):
    scenario_name = SCENARIOS[scenario_id]['name']
    scenario_desc = SCENARIOS[scenario_id]['description']
    
    scenario_df = results_display[results_display['Scenario'] == scenario_id].copy()
    scenario_df = scenario_df.drop(columns=['Scenario'])  # Remove redundant column
    scenario_df = scenario_df.sort_values('Test_AUC', ascending=False).reset_index(drop=True)
    
    scenario_results_dict[scenario_id] = scenario_df
    
    print(f"\n{'='*80}")
    print(f"Scenario {scenario_id} Results: {scenario_name}")
    print(f"({scenario_desc})")
    print("="*80)
    print(scenario_df.to_string(index=False))
    
    # Best in scenario
    best_in_scenario = scenario_df.iloc[0]
    print(f"\n  Best: {best_in_scenario['Approach']} (Test AUC: {best_in_scenario['Test_AUC']:.4f})")


MASTER RESULTS TABLE - All Scenarios and Approaches
 Scenario          Approach  CV_AUC_mean  CV_AUC_std  Test_AUC  Test_AP  Optimal_Threshold  Precision_opt  Recall_opt  F1_opt
        1      RandomForest       0.9345      0.0006    0.8845   0.4414             0.2560         0.1268      0.8246  0.2198
        1           XGBoost       0.9404      0.0003    0.8978   0.4889             0.3293         0.1613      0.7894  0.2679
        1          LightGBM       0.9381      0.0008    0.9003   0.5042             0.4406         0.1703      0.7704  0.2789
        1          CatBoost       0.9182      0.0008    0.8933   0.4800             0.4826         0.1500      0.7793  0.2515
        1 Stacking_Weighted          NaN         NaN    0.9027   0.4994             0.3824         0.1631      0.7889  0.2703
        1 Stacking_Logistic          NaN         NaN    0.8732   0.4625             0.2303         0.1394      0.7958  0.2373
        1    Stacking_Ridge          NaN         NaN    0.8734   

## 5. Export Results to Files

Export results in two formats:
1. **CSV**: Flat file with all results (`results_master.csv`)
2. **Excel**: Multi-sheet workbook with one sheet per scenario (`results_by_scenario.xlsx`)

In [10]:
# =============================================================================
# Export Results to Files
# =============================================================================

# Create results directory if needed
results_dir = ROOT / "results"
results_dir.mkdir(exist_ok=True)

# =============================================================================
# 1. Save Master CSV
# =============================================================================
csv_path = results_dir / "results_master.csv"
results_display.to_csv(csv_path, index=False)
print(f"‚úÖ Saved: {csv_path}")

# =============================================================================
# 2. Save Excel with Per-Scenario Sheets
# =============================================================================
excel_path = results_dir / "results_by_scenario.xlsx"

with pd.ExcelWriter(excel_path, engine='openpyxl') as writer:
    # Master sheet
    results_display.to_excel(writer, sheet_name='Master', index=False)
    
    # Per-scenario sheets
    for scenario_id, scenario_df in scenario_results_dict.items():
        sheet_name = f"Scenario_{scenario_id}"
        scenario_df.to_excel(writer, sheet_name=sheet_name, index=False)

print(f"‚úÖ Saved: {excel_path}")
print(f"\nüìÅ Output files saved to: {results_dir}")
print(f"   - results_master.csv (all results)")
print(f"   - results_by_scenario.xlsx (Excel with {len(SCENARIOS)} scenario sheets + Master)")

‚úÖ Saved: c:\Users\Abdulkadir\Desktop\Uygulama √ßalƒ±≈ümalarƒ±\Fraud_Detection\Fraud_Detection\results\results_master.csv
‚úÖ Saved: c:\Users\Abdulkadir\Desktop\Uygulama √ßalƒ±≈ümalarƒ±\Fraud_Detection\Fraud_Detection\results\results_by_scenario.xlsx

üìÅ Output files saved to: c:\Users\Abdulkadir\Desktop\Uygulama √ßalƒ±≈ümalarƒ±\Fraud_Detection\Fraud_Detection\results
   - results_master.csv (all results)
   - results_by_scenario.xlsx (Excel with 5 scenario sheets + Master)


## 6. Summary Analysis

Quick comparison of scenario performance.

In [11]:
# =============================================================================
# Summary Analysis: Best Approaches by Scenario
# =============================================================================

# Best approach per scenario
print("üìä Best Approach per Scenario (by Test AUC):")
print("-" * 60)

summary_rows = []
for scenario_id in sorted(SCENARIOS.keys()):
    scenario_df = master_results[master_results['Scenario'] == scenario_id]
    best_row = scenario_df.loc[scenario_df['Test_AUC'].idxmax()]
    
    summary_rows.append({
        'Scenario': scenario_id,
        'Name': SCENARIOS[scenario_id]['name'],
        'Features': len(SCENARIOS[scenario_id]['features']),
        'Best_Approach': best_row['Approach'],
        'Test_AUC': best_row['Test_AUC'],
        'Test_AP': best_row['Test_AP']
    })
    
    print(f"  Scenario {scenario_id} ({SCENARIOS[scenario_id]['name']}): "
          f"{best_row['Approach']} (AUC={best_row['Test_AUC']:.4f})")

summary_df = pd.DataFrame(summary_rows)

# Best overall scenario
best_scenario_row = summary_df.loc[summary_df['Test_AUC'].idxmax()]
print(f"\nüèÜ Best Overall Scenario: {int(best_scenario_row['Scenario'])} - {best_scenario_row['Name']}")
print(f"   Best Approach: {best_scenario_row['Best_Approach']}")
print(f"   Test AUC: {best_scenario_row['Test_AUC']:.4f}")
print(f"   Test AP (PR-AUC): {best_scenario_row['Test_AP']:.4f}")

# Comparison: Base Models vs Stacking
print("\nüìà Stacking vs Base Models Comparison:")
print("-" * 60)

base_approaches = ['RandomForest', 'XGBoost', 'LightGBM', 'CatBoost']
stacking_approaches = ['Stacking_Weighted', 'Stacking_Logistic', 'Stacking_Ridge']

for scenario_id in sorted(SCENARIOS.keys()):
    scenario_df = master_results[master_results['Scenario'] == scenario_id]
    
    best_base_auc = scenario_df[scenario_df['Approach'].isin(base_approaches)]['Test_AUC'].max()
    best_stack_auc = scenario_df[scenario_df['Approach'].isin(stacking_approaches)]['Test_AUC'].max()
    
    improvement = best_stack_auc - best_base_auc
    status = "‚úÖ" if improvement > 0 else "‚ö†Ô∏è"
    
    print(f"  Scenario {scenario_id}: Best Base={best_base_auc:.4f}, Best Stack={best_stack_auc:.4f}, "
          f"Œî={improvement:+.4f} {status}")

üìä Best Approach per Scenario (by Test AUC):
------------------------------------------------------------
  Scenario 1 (Baseline (Control)): Stacking_Weighted (AUC=0.9027)
  Scenario 2 (NaN Strategy): LightGBM (AUC=0.9075)
  Scenario 3 (No Interaction Features): Stacking_Weighted (AUC=0.9018)
  Scenario 4 (Strong Features Only): LightGBM (AUC=0.8998)
  Scenario 5 (Strong + Moderate Features): Stacking_Weighted (AUC=0.9023)

üèÜ Best Overall Scenario: 2 - NaN Strategy
   Best Approach: LightGBM
   Test AUC: 0.9075
   Test AP (PR-AUC): 0.5128

üìà Stacking vs Base Models Comparison:
------------------------------------------------------------
  Scenario 1: Best Base=0.9003, Best Stack=0.9027, Œî=+0.0024 ‚úÖ
  Scenario 2: Best Base=0.9075, Best Stack=0.9053, Œî=-0.0022 ‚ö†Ô∏è
  Scenario 3: Best Base=0.9008, Best Stack=0.9018, Œî=+0.0010 ‚úÖ
  Scenario 4: Best Base=0.8998, Best Stack=0.8987, Œî=-0.0010 ‚ö†Ô∏è
  Scenario 5: Best Base=0.9014, Best Stack=0.9023, Œî=+0.0009 ‚úÖ


## 7. Execution Complete

All 5 scenarios have been evaluated. Results are saved to:
- `results/results_master.csv`
- `results/results_by_scenario.xlsx`

In [12]:
# =============================================================================
# Final Summary Table
# =============================================================================

print("\n" + "="*120)
print("FINAL SUMMARY - Scenario Comparison")
print("="*120)

# Create summary comparison table
final_summary = summary_df[['Scenario', 'Name', 'Features', 'Best_Approach', 'Test_AUC', 'Test_AP']].copy()
final_summary.columns = ['Scenario', 'Name', 'N_Features', 'Best_Approach', 'Test_AUC', 'Test_AP']
final_summary['Test_AUC'] = final_summary['Test_AUC'].round(4)
final_summary['Test_AP'] = final_summary['Test_AP'].round(4)

print(final_summary.to_string(index=False))

print("\n‚úÖ Notebook execution complete!")
print(f"   Total scenarios: {len(SCENARIOS)}")
print(f"   Approaches per scenario: 7 (4 base + 3 stacking)")
print(f"   Total results: {len(master_results)} rows")
print(f"\nüìÅ Files saved:")
print(f"   - {results_dir / 'results_master.csv'}")
print(f"   - {results_dir / 'results_by_scenario.xlsx'}")


FINAL SUMMARY - Scenario Comparison
 Scenario                       Name  N_Features     Best_Approach  Test_AUC  Test_AP
        1         Baseline (Control)         175 Stacking_Weighted    0.9027   0.4994
        2               NaN Strategy         175          LightGBM    0.9075   0.5128
        3    No Interaction Features         145 Stacking_Weighted    0.9018   0.5022
        4       Strong Features Only         118          LightGBM    0.8998   0.4907
        5 Strong + Moderate Features         175 Stacking_Weighted    0.9023   0.4994

‚úÖ Notebook execution complete!
   Total scenarios: 5
   Approaches per scenario: 7 (4 base + 3 stacking)
   Total results: 35 rows

üìÅ Files saved:
   - c:\Users\Abdulkadir\Desktop\Uygulama √ßalƒ±≈ümalarƒ±\Fraud_Detection\Fraud_Detection\results\results_master.csv
   - c:\Users\Abdulkadir\Desktop\Uygulama √ßalƒ±≈ümalarƒ±\Fraud_Detection\Fraud_Detection\results\results_by_scenario.xlsx
