# Machine Learning Models Training & Evaluation

## Objectives
1. Train 5 different ML models on preprocessed fraud detection data
2. Use cross-validation to assess model robustness
3. Evaluate with metrics appropriate for imbalanced data
4. Compare models using F1-Score, Recall, ROC-AUC, PR-AUC
5. Identify best performing model
6. Analyze and interpret results

In [1]:
"""
IMPORT LIBRARIES AND SETUP
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import warnings
from sklearn.model_selection import cross_val_score, cross_validate
import joblib

warnings.filterwarnings('ignore')

# Add src to path
sys.path.insert(0, '../src')
from models import ModelFactory
from evaluation import ModelEvaluator

# Configuration
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# Visualization
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)

print('✓ Libraries imported successfully')

✓ Libraries imported successfully


## Section 1: Load Preprocessed Data

Load the transformed data from the preprocessing notebook.

In [2]:
# Load preprocessed data
X_train = np.load('../data/processed/X_train_transformed.npy')
X_val = np.load('../data/processed/X_val_transformed.npy')
X_test = np.load('../data/processed/X_test_transformed.npy')
y_train = np.load('../data/processed/y_train.npy')
y_val = np.load('../data/processed/y_val.npy')
y_test = np.load('../data/processed/y_test.npy')

print(f'✓ Data loaded successfully')
print(f'  Training:   X={X_train.shape}, y={y_train.shape}')
print(f'  Validation: X={X_val.shape}, y={y_val.shape}')
print(f'  Test:       X={X_test.shape}, y={y_test.shape}')

✓ Data loaded successfully
  Training:   X=(170883, 30), y=(170883,)
  Validation: X=(28481, 30), y=(28481,)
  Test:       X=(85443, 30), y=(85443,)


## Section 2: Metrics Selection for Imbalanced Data

For fraud detection with severe class imbalance (~0.17% fraud), we use:

| Metric | Purpose | Why Important |
|--------|---------|---------------|
| **Recall (Sensitivity)** | % of fraud cases detected | Miss fraud = bad outcome |
| **Precision** | % of alerts that are fraud | False alarms = operational cost |
| **F1-Score** | Harmonic mean of Precision & Recall | Balanced metric for imbalanced data |
| **ROC-AUC** | Area under ROC curve | Threshold-independent performance |
| **PR-AUC** | Area under Precision-Recall curve | Better for imbalanced data than ROC-AUC |
| **MCC** | Matthews Correlation Coefficient | Correlation-based metric (unbiased) |

**⚠️ We AVOID Accuracy** (99.8% by predicting all normal)

In [3]:
# Define evaluation metrics for cross-validation
scoring_metrics = {
    'accuracy': 'accuracy',
    'precision': 'precision',
    'recall': 'recall',  # Most important for fraud detection
    'f1': 'f1',
    'roc_auc': 'roc_auc'
}

# Create all models
print('='*70)
print('INITIALIZING ML MODELS')
print('='*70)

models = ModelFactory.get_all_models(random_state=RANDOM_SEED)

print(f'\nModels to train:')
for i, (name, model) in enumerate(models.items(), 1):
    print(f'  {i}. {name}')
    print(f'     → {model.__class__.__name__}')

print(f'\nCross-validation folds: 5-fold stratified')
print(f'Evaluation metrics: Accuracy, Precision, Recall, F1-Score, ROC-AUC')

INITIALIZING ML MODELS

Models to train:
  1. Logistic Regression
     → LogisticRegression
  2. KNN (k=5)
     → KNeighborsClassifier
  3. Decision Tree
     → DecisionTreeClassifier
  4. Random Forest
     → RandomForestClassifier
  5. SVM (RBF)
     → SVC

Cross-validation folds: 5-fold stratified
Evaluation metrics: Accuracy, Precision, Recall, F1-Score, ROC-AUC


## Section 3: Cross-Validation on Training Data

Evaluate models using 5-fold stratified cross-validation to assess stability.

In [4]:
print('\n' + '='*70)
print('CROSS-VALIDATION RESULTS (5-Fold Stratified)')
print('='*70)

cv_results = {}

for model_name, model in models.items():
    print(f'\nTraining {model_name}...')
    
    # Perform cross-validation
    cv_scores = cross_validate(
        model, X_train, y_train,
        cv=5,
        scoring=scoring_metrics,
        n_jobs=-1
    )
    
    # Extract and store results
    cv_results[model_name] = {
        'Accuracy': cv_scores['test_accuracy'].mean(),
        'Precision': cv_scores['test_precision'].mean(),
        'Recall': cv_scores['test_recall'].mean(),
        'F1-Score': cv_scores['test_f1'].mean(),
        'ROC-AUC': cv_scores['test_roc_auc'].mean(),
    }
    
    print(f'  ✓ Recall: {cv_scores["test_recall"].mean():.4f} (+/- {cv_scores["test_recall"].std():.4f})')
    print(f'  ✓ F1-Score: {cv_scores["test_f1"].mean():.4f} (+/- {cv_scores["test_f1"].std():.4f})')
    print(f'  ✓ ROC-AUC: {cv_scores["test_roc_auc"].mean():.4f} (+/- {cv_scores["test_roc_auc"].std():.4f})')


CROSS-VALIDATION RESULTS (5-Fold Stratified)

Training Logistic Regression...
  ✓ Recall: 0.9085 (+/- 0.0437)
  ✓ F1-Score: 0.1228 (+/- 0.0051)
  ✓ ROC-AUC: 0.9754 (+/- 0.0216)

Training KNN (k=5)...
  ✓ Recall: 0.7797 (+/- 0.0503)
  ✓ F1-Score: 0.8341 (+/- 0.0368)
  ✓ ROC-AUC: 0.9185 (+/- 0.0292)

Training Decision Tree...
  ✓ Recall: 0.8169 (+/- 0.0561)
  ✓ F1-Score: 0.4026 (+/- 0.1552)
  ✓ ROC-AUC: 0.9068 (+/- 0.0284)

Training Random Forest...
  ✓ Recall: 0.7661 (+/- 0.0507)
  ✓ F1-Score: 0.8302 (+/- 0.0328)
  ✓ ROC-AUC: 0.9535 (+/- 0.0188)

Training SVM (RBF)...
  ✓ Recall: 0.7424 (+/- 0.0561)
  ✓ F1-Score: 0.4906 (+/- 0.0337)
  ✓ ROC-AUC: 0.9696 (+/- 0.0143)


In [None]:
# Create CV results table
cv_df = pd.DataFrame(cv_results).T
cv_df = cv_df.round(4)
cv_df = cv_df.sort_values('F1-Score', ascending=False)

print('\n' + '='*70)
print('CROSS-VALIDATION SUMMARY TABLE')
print('='*70)
print(cv_df.to_string())
print('='*70)

# Visualize CV results
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for idx, metric in enumerate(['Recall', 'F1-Score', 'ROC-AUC']):
    cv_df[metric].plot(kind='barh', ax=axes[idx], color='steelblue', edgecolor='black')
    axes[idx].set_title(f'Cross-Validation {metric}', fontweight='bold', fontsize=12)
    axes[idx].set_xlabel(metric)
    axes[idx].grid(alpha=0.3, axis='x')
    
    # Add value labels
    for i, v in enumerate(cv_df[metric].values):
        axes[idx].text(v, i, f'{v:.4f}', ha='left', va='center', fontweight='bold')

plt.tight_layout()
plt.savefig('../results/figures/09_cross_validation_results.png', dpi=300, bbox_inches='tight')
print('✓ Figure saved: 09_cross_validation_results.png')
plt.show()

## Section 4: Train Final Models and Evaluate on Test Set

Train each model on full training set and evaluate on test set for final performance.

In [None]:
print('\n' + '='*70)
print('TRAINING FINAL MODELS ON FULL TRAINING SET')
print('='*70)

# Train models on full training set
trained_models = {}
test_results = {}

for model_name, model in models.items():
    print(f'\nTraining {model_name}...')
    
    # Train on full training set
    model.fit(X_train, y_train)
    trained_models[model_name] = model
    
    # Evaluate on test set
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    # Compute metrics
    test_results[model_name] = ModelEvaluator.compute_metrics(
        y_test, y_pred, y_pred_proba
    )
    
    print(f'  ✓ Test Recall: {test_results[model_name]["Recall"]:.4f}')
    print(f'  ✓ Test F1-Score: {test_results[model_name]["F1-Score"]:.4f}')
    print(f'  ✓ Test ROC-AUC: {test_results[model_name]["ROC-AUC"]:.4f}')

In [None]:
# Create test results dataframe
test_df = pd.DataFrame(test_results).T
test_df = test_df.round(4)
test_df = test_df.sort_values('F1-Score', ascending=False)

print('\n' + '='*70)
print('TEST SET PERFORMANCE COMPARISON')
print('='*70)
print(test_df.to_string())
print('='*70)

# Save results
test_df.to_csv('../results/metrics/model_comparison_test_set.csv')
print('✓ Results saved: model_comparison_test_set.csv')

In [None]:
# Visualize test set results
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

metrics_to_plot = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC', 'PR-AUC']

for idx, metric in enumerate(metrics_to_plot):
    ax = axes.flatten()[idx]
    
    if metric in test_df.columns:
        values = test_df[metric].values
        colors = plt.cm.viridis(np.linspace(0.3, 0.9, len(values)))
        bars = ax.barh(test_df.index, values, color=colors, edgecolor='black', linewidth=1.5)
        
        ax.set_title(f'{metric}', fontweight='bold', fontsize=12)
        ax.set_xlabel(metric)
        ax.set_xlim([0, 1.0])
        ax.grid(alpha=0.3, axis='x')
        
        # Add value labels
        for i, v in enumerate(values):
            ax.text(v + 0.02, i, f'{v:.4f}', va='center', fontweight='bold', fontsize=9)
    else:
        ax.text(0.5, 0.5, f'{metric}\n(not computed)', 
               ha='center', va='center', fontsize=11, color='gray')
        ax.set_xlim([0, 1])
        ax.axis('off')

plt.tight_layout()
plt.savefig('../results/figures/10_test_set_performance.png', dpi=300, bbox_inches='tight')
print('✓ Figure saved: 10_test_set_performance.png')
plt.show()

## Section 5: Detailed Analysis of Best Model

In [None]:
# Identify best model
best_model_name, best_f1 = ModelEvaluator.get_best_model(test_results, metric='F1-Score')
best_model = trained_models[best_model_name]

print('\n' + '='*70)
print('BEST MODEL ANALYSIS')
print('='*70)
print(f'\nBest Model (by F1-Score): {best_model_name}')
print(f'F1-Score: {test_results[best_model_name]["F1-Score"]:.4f}')
print(f'\nFull test set metrics:')
for metric, value in test_results[best_model_name].items():
    print(f'  {metric:15s}: {value:.4f}')

In [None]:
# Get predictions from best model
y_pred_best = best_model.predict(X_test)
y_pred_proba_best = best_model.predict_proba(X_test)[:, 1]

# Confusion matrix for best model
cm = ModelEvaluator.get_confusion_matrix(y_test, y_pred_best)
cm_dict = ModelEvaluator.get_confusion_matrix_dict(y_test, y_pred_best)

print('\n' + '='*70)
print('CONFUSION MATRIX INTERPRETATION')
print('='*70)
print(f'\n{best_model_name}:')
print(f'\n             Predicted')
print(f'           Normal  Fraud')
print(f'Actual Normal  {cm[0,0]:6d}  {cm[0,1]:6d}')
print(f'       Fraud   {cm[1,0]:6d}  {cm[1,1]:6d}')
print(f'\nMetrics:')
print(f'  True Positives (TP):  {cm_dict["TP"]:6d} - Fraud correctly detected')
print(f'  False Positives (FP): {cm_dict["FP"]:6d} - Normal flagged as fraud')
print(f'  False Negatives (FN): {cm_dict["FN"]:6d} - Fraud missed (worst case!)')
print(f'  True Negatives (TN):  {cm_dict["TN"]:6d} - Normal correctly identified')

# Calculate percentages
print(f'\nOperational Metrics:')
if cm_dict["TP"] + cm_dict["FN"] > 0:
    fraud_detected = 100 * cm_dict["TP"] / (cm_dict["TP"] + cm_dict["FN"])
    print(f'  Fraud Detection Rate (Recall): {fraud_detected:.2f}%')
if cm_dict["FP"] + cm_dict["TN"] > 0:
    false_alarm_rate = 100 * cm_dict["FP"] / (cm_dict["FP"] + cm_dict["TN"])
    print(f'  False Alarm Rate: {false_alarm_rate:.2f}%')
if cm_dict["TP"] + cm_dict["FP"] > 0:
    precision_pct = 100 * cm_dict["TP"] / (cm_dict["TP"] + cm_dict["FP"])
    print(f'  Precision: {precision_pct:.2f}% of fraud alerts are true fraud')

print('='*70)

In [None]:
# Visualize confusion matrix
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Confusion matrix heatmap
import seaborn as sns
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
           xticklabels=['Normal', 'Fraud'],
           yticklabels=['Normal', 'Fraud'],
           cbar_kws={'label': 'Count'})
axes[0].set_title(f'Confusion Matrix - {best_model_name}', fontweight='bold', fontsize=12)
axes[0].set_ylabel('True Label')
axes[0].set_xlabel('Predicted Label')

# Normalized confusion matrix (percentages)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_normalized, annot=True, fmt='.2%', cmap='RdYlGn', ax=axes[1],
           xticklabels=['Normal', 'Fraud'],
           yticklabels=['Normal', 'Fraud'],
           cbar_kws={'label': 'Percentage'},
           vmin=0, vmax=1)
axes[1].set_title(f'Normalized Confusion Matrix - {best_model_name}', fontweight='bold', fontsize=12)
axes[1].set_ylabel('True Label')
axes[1].set_xlabel('Predicted Label')

plt.tight_layout()
plt.savefig('../results/figures/11_best_model_confusion_matrix.png', dpi=300, bbox_inches='tight')
print('✓ Figure saved: 11_best_model_confusion_matrix.png')
plt.show()

In [None]:
# Classification report
print('\n' + '='*70)
print('DETAILED CLASSIFICATION REPORT')
print('='*70)
print(f'\n{best_model_name}:')
print(ModelEvaluator.get_classification_report(y_test, y_pred_best))
print('='*70)

## Section 6: ROC and Precision-Recall Curves

In [None]:
# ROC curves for all models
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# ROC Curve
for model_name, model in trained_models.items():
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    fpr, tpr, roc_auc = ModelEvaluator.get_roc_curve(y_test, y_pred_proba)
    
    label = f'{model_name} (AUC = {roc_auc:.4f})'
    if model_name == best_model_name:
        axes[0].plot(fpr, tpr, linewidth=2.5, label=label, alpha=0.8)
    else:
        axes[0].plot(fpr, tpr, linewidth=1.5, label=label, alpha=0.6)

axes[0].plot([0, 1], [0, 1], 'k--', linewidth=1.5, label='Random Classifier')
axes[0].set_xlabel('False Positive Rate', fontsize=11)
axes[0].set_ylabel('True Positive Rate', fontsize=11)
axes[0].set_title('ROC Curves - All Models', fontweight='bold', fontsize=12)
axes[0].legend(loc='lower right', fontsize=9)
axes[0].grid(alpha=0.3)

# Precision-Recall Curve
for model_name, model in trained_models.items():
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    precision, recall, pr_auc = ModelEvaluator.get_precision_recall_curve(y_test, y_pred_proba)
    
    label = f'{model_name} (AUC = {pr_auc:.4f})'
    if model_name == best_model_name:
        axes[1].plot(recall, precision, linewidth=2.5, label=label, alpha=0.8)
    else:
        axes[1].plot(recall, precision, linewidth=1.5, label=label, alpha=0.6)

axes[1].set_xlabel('Recall', fontsize=11)
axes[1].set_ylabel('Precision', fontsize=11)
axes[1].set_title('Precision-Recall Curves - All Models', fontweight='bold', fontsize=12)
axes[1].legend(loc='upper right', fontsize=9)
axes[1].grid(alpha=0.3)
axes[1].set_xlim([0, 1])
axes[1].set_ylim([0, 1])

plt.tight_layout()
plt.savefig('../results/figures/12_roc_pr_curves.png', dpi=300, bbox_inches='tight')
print('✓ Figure saved: 12_roc_pr_curves.png')
plt.show()

## Section 7: Model Comparison Summary

In [None]:
print('\n' + '='*90)
print('MODEL TRAINING & EVALUATION SUMMARY')
print('='*90)

print('\n1. CROSS-VALIDATION RESULTS (5-Fold):')
print('-' * 90)
print(cv_df.to_string())

print('\n\n2. TEST SET RESULTS:')
print('-' * 90)
print(test_df.to_string())

print('\n\n3. BEST MODEL PERFORMANCE:')
print('-' * 90)
print(f'Model: {best_model_name}')
print(f'\nTest Set Metrics:')
for metric, value in sorted(test_results[best_model_name].items()):
    print(f'  {metric:15s}: {value:.4f}')

print(f'\nConfusion Matrix Analysis:')
print(f'  Fraud Detection Rate (Recall): {100 * cm_dict["TP"] / (cm_dict["TP"] + cm_dict["FN"]):.2f}%')
print(f'  False Alarm Rate: {100 * cm_dict["FP"] / (cm_dict["FP"] + cm_dict["TN"]):.2f}%')
print(f'  Fraud Caught: {cm_dict["TP"]} out of {cm_dict["TP"] + cm_dict["FN"]} fraudulent transactions')

print('\n\n4. KEY INSIGHTS:')
print('-' * 90)
print('✓ Cross-validation shows model consistency (low variance)')
print('✓ Recall metric is most important for fraud detection (minimize missed fraud)')
print('✓ Test set performance validates generalization to unseen data')
print('✓ ROC-AUC and PR-AUC measure threshold-independent performance')
print('✓ Choose model based on business requirements (recall vs. false alarms)')

print('\n' + '='*90)

## Section 8: Save Best Model

In [None]:
# Save best model and results
joblib.dump(best_model, f'../results/{best_model_name.lower().replace(" ", "_")}_model.pkl')
print(f'✓ Best model saved: {best_model_name.lower().replace(" ", "_")}_model.pkl')

# Save all models
for model_name, model in trained_models.items():
    joblib.dump(model, f'../results/{model_name.lower().replace(" ", "_").replace("(", "").replace(")", "")}_model.pkl')

# Save results as JSON
import json
results_summary = {
    'best_model': best_model_name,
    'best_f1_score': float(test_results[best_model_name]['F1-Score']),
    'all_test_results': {k: {m: float(v) for m, v in test_results[k].items()} 
                        for k in test_results}
}

with open('../results/metrics/model_results_summary.json', 'w') as f:
    json.dump(results_summary, f, indent=2)

print('✓ Results saved: model_results_summary.json')