# Phase 5: Evaluation

## CRISP-DM - Evaluation Phase

**Objective:** Comprehensive performance evaluation of all trained models on the test set.

**Key Activities:**
1. Test set evaluation (final model performance)
2. ROC and Precision-Recall curves
3. Confusion matrices and classification reports
4. Model comparison visualizations (radar plots, bar charts)
5. Feature importance analysis
6. Error analysis (false positives/negatives investigation)
7. Detection latency measurement
8. Final model selection and justification

---

## 1. Setup and Imports

In [1]:
# Standard library
import warnings
import pickle
from pathlib import Path
import time
from collections import Counter

# Core libraries
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

# Metrics
from sklearn.metrics import (
    precision_score, recall_score, f1_score, accuracy_score,
    confusion_matrix, classification_report, roc_curve, auc,
    precision_recall_curve, average_precision_score,
    roc_auc_score
)

# Suppress warnings
warnings.filterwarnings('ignore')

# Configure matplotlib
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

# Set random seed
np.random.seed(42)

print("‚úÖ All libraries imported successfully")

‚úÖ All libraries imported successfully


## 2. Load Data and Models

In [3]:
# Define paths
DATA_DIR = Path('../data/processed')
MODELS_DIR = Path('../models')
REPORTS_DIR = Path('../reports/figures')

# Load test data
X_test = pd.read_csv(DATA_DIR / 'X_test.csv', index_col=0, parse_dates=True)
y_test = pd.read_csv(DATA_DIR / 'y_test.csv', index_col=0, parse_dates=True).squeeze()

# Load validation data (for comparison)
X_val = pd.read_csv(DATA_DIR / 'X_val.csv', index_col=0, parse_dates=True)
y_val = pd.read_csv(DATA_DIR / 'y_val.csv', index_col=0, parse_dates=True).squeeze()

# Load feature names
with open(MODELS_DIR / 'feature_names.pkl', 'rb') as f:
    feature_names = pickle.load(f)

# Load FINAL model (One-Class SVM)
with open(MODELS_DIR / 'one_class_svm_final.pkl', 'rb') as f:
    final_model = pickle.load(f)
print(f"‚úÖ Loaded final model: One-Class SVM")

# Load model configuration
with open(MODELS_DIR / 'final_model_config.pkl', 'rb') as f:
    model_config = pickle.load(f)
print(f"‚úÖ Loaded model configuration")

print(f"\n‚úÖ Model and data loaded successfully")
print(f"Test set: {X_test.shape[0]} samples, {X_test.shape[1]} features")
print(f"Test outliers: {y_test.sum()} / {len(y_test)} ({y_test.mean()*100:.1f}%)")
print(f"\nModel: {model_config['model_name']}")
print(f"Hyperparameters: {model_config['sklearn_params']}")

‚úÖ Loaded final model: One-Class SVM
‚úÖ Loaded model configuration

‚úÖ Model and data loaded successfully
Test set: 35 samples, 104 features
Test outliers: 32 / 35 (91.4%)

Model: One-Class SVM (Tuned)
Hyperparameters: {'kernel': 'rbf', 'nu': 0.18, 'gamma': 'scale'}


## ‚ö†Ô∏è ATTENTION: Probl√®me de Donn√©es D√©tect√©

**Observation:** Le test set charg√© contient 91.4% d'anomalies, ce qui correspond √† l'ancien split temporel NON stratifi√©!

**Solution:** Nous devons recharger les donn√©es depuis le notebook `04_modeling.ipynb` o√π nous avons cr√©√© le split stratifi√© correct (70/15/15 avec ~26% d'anomalies dans chaque set).

**Action:** Recr√©ons le split stratifi√© ici pour l'√©valuation.

In [4]:
from sklearn.model_selection import train_test_split

print("üîÑ Recr√©ation du split stratifi√© correct pour l'√©valuation...")

# Charger toutes les donn√©es
X_train_orig = pd.read_csv(DATA_DIR / 'X_train.csv', index_col=0, parse_dates=True)
X_val_orig = pd.read_csv(DATA_DIR / 'X_val.csv', index_col=0, parse_dates=True)
X_test_orig = pd.read_csv(DATA_DIR / 'X_test.csv', index_col=0, parse_dates=True)

y_train_orig = pd.read_csv(DATA_DIR / 'y_train.csv', index_col=0, parse_dates=True).squeeze()
y_val_orig = pd.read_csv(DATA_DIR / 'y_val.csv', index_col=0, parse_dates=True).squeeze()
y_test_orig = pd.read_csv(DATA_DIR / 'y_test.csv', index_col=0, parse_dates=True).squeeze()

# Combiner tout
X_all = pd.concat([X_train_orig, X_val_orig, X_test_orig])
y_all = pd.concat([y_train_orig, y_val_orig, y_test_orig])

print(f"Dataset complet: {X_all.shape[0]} samples, {y_all.sum()} anomalies ({y_all.mean()*100:.2f}%)")

# Split stratifi√© 70/15/15
X_train, X_temp, y_train, y_temp = train_test_split(
    X_all, y_all, 
    test_size=0.30,
    stratify=y_all,
    random_state=42
)

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp,
    test_size=0.50,
    stratify=y_temp,
    random_state=42
)

print("\n‚úÖ Split stratifi√© cr√©√©:")
print(f"Train: {X_train.shape[0]} samples ({y_train.sum()} anomalies, {y_train.mean()*100:.2f}%)")
print(f"Val:   {X_val.shape[0]} samples ({y_val.sum()} anomalies, {y_val.mean()*100:.2f}%)")
print(f"Test:  {X_test.shape[0]} samples ({y_test.sum()} anomalies, {y_test.mean()*100:.2f}%)")
print("\nüéØ Utilisation des donn√©es TEST pour l'√©valuation finale")

üîÑ Recr√©ation du split stratifi√© correct pour l'√©valuation...
Dataset complet: 230 samples, 60 anomalies (26.09%)

‚úÖ Split stratifi√© cr√©√©:
Train: 161 samples (42 anomalies, 26.09%)
Val:   34 samples (9 anomalies, 26.47%)
Test:  35 samples (9 anomalies, 25.71%)

üéØ Utilisation des donn√©es TEST pour l'√©valuation finale


## 3. Helper Functions

In [5]:
def evaluate_model_comprehensive(model, X, y, model_name="Model"):
    """
    Comprehensive model evaluation with timing
    """
    # Measure prediction time
    start_time = time.time()
    y_pred = model.predict(X)
    predict_time = time.time() - start_time
    
    # Convert to binary
    y_pred_binary = (y_pred == -1).astype(int)
    
    # Metrics
    precision = precision_score(y, y_pred_binary, zero_division=0)
    recall = recall_score(y, y_pred_binary, zero_division=0)
    f1 = f1_score(y, y_pred_binary, zero_division=0)
    accuracy = accuracy_score(y, y_pred_binary)
    
    # Confusion matrix
    cm = confusion_matrix(y, y_pred_binary)
    tn, fp, fn, tp = cm.ravel()
    
    # Rates
    fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
    fnr = fn / (fn + tp) if (fn + tp) > 0 else 0
    tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
    tnr = tn / (tn + fp) if (tn + fp) > 0 else 0
    
    # Latency (per sample)
    latency_per_sample = (predict_time / len(X)) * 1000  # milliseconds
    
    return {
        'Model': model_name,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1,
        'Accuracy': accuracy,
        'FPR': fpr,
        'FNR': fnr,
        'TPR': tpr,
        'TNR': tnr,
        'TP': int(tp),
        'FP': int(fp),
        'TN': int(tn),
        'FN': int(fn),
        'Predict_Time': predict_time,
        'Latency_ms': latency_per_sample,
        'Predictions': y_pred_binary
    }

print("‚úÖ Helper functions defined")

‚úÖ Helper functions defined


## 4. Test Set Evaluation

### 4.1 Evaluate All Models on Test Set

In [7]:
print("üéØ Evaluating FINAL MODEL on TEST set...")
print("=" * 80)

# Evaluate the final model
print(f"\nEvaluating {model_config['model_name']}...")

test_metrics = evaluate_model_comprehensive(final_model, X_test, y_test, model_config['model_name'])

print("\n" + "=" * 80)
print("  üìä TEST SET RESULTS - FINAL MODEL")
print("=" * 80)
print(f"  Model: {test_metrics['Model']}")
print(f"  Precision: {test_metrics['Precision']:.4f}")
print(f"  Recall: {test_metrics['Recall']:.4f}")
print(f"  F1-Score: {test_metrics['F1-Score']:.4f}")
print(f"  Accuracy: {test_metrics['Accuracy']:.4f}")
print(f"  FPR: {test_metrics['FPR']:.4f}")
print(f"  FNR: {test_metrics['FNR']:.4f}")
print("\n  Confusion Matrix:")
print(f"    TP: {test_metrics['TP']:>3}  |  FP: {test_metrics['FP']:>3}")
print(f"    FN: {test_metrics['FN']:>3}  |  TN: {test_metrics['TN']:>3}")
print(f"\n  Performance:")
print(f"    Prediction Time: {test_metrics['Predict_Time']:.4f}s")
print(f"    Latency: {test_metrics['Latency_ms']:.2f} ms/sample")
print("=" * 80)

# Store predictions for later analysis
y_test_pred = test_metrics['Predictions']
y_test_pred_labels = (y_test_pred == 1)  # True = Anomaly

print(f"\n‚úÖ Evaluation complete!")
print(f"Detected anomalies: {y_test_pred.sum()} / {len(y_test)} ({y_test_pred.mean()*100:.1f}%)")
print(f"True anomalies: {y_test.sum()} / {len(y_test)} ({y_test.mean()*100:.1f}%)")

üéØ Evaluating FINAL MODEL on TEST set...

Evaluating One-Class SVM (Tuned)...

  üìä TEST SET RESULTS - FINAL MODEL
  Model: One-Class SVM (Tuned)
  Precision: 0.7143
  Recall: 0.5556
  F1-Score: 0.6250
  Accuracy: 0.8286
  FPR: 0.0769
  FNR: 0.4444

  Confusion Matrix:
    TP:   5  |  FP:   2
    FN:   4  |  TN:  24

  Performance:
    Prediction Time: 0.0049s
    Latency: 0.14 ms/sample

‚úÖ Evaluation complete!
Detected anomalies: 7 / 35 (20.0%)
True anomalies: 9 / 35 (25.7%)


In [None]:
# üìä Performance Comparison Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.flatten()

metrics_to_plot = ['Precision', 'Recall', 'F1-Score', 'Accuracy', 'FPR', 'Latency_ms']
colors = ['#3498db']

for idx, metric in enumerate(metrics_to_plot):
    ax = axes[idx]
    
    values = test_results_df[metric]
    bars = ax.bar(range(len(test_results_df)), values, color=colors[0])
    
    ax.set_title(f'{metric}', fontsize=12, fontweight='bold')
    ax.set_xticks(range(len(test_results_df)))
    ax.set_xticklabels(test_results_df['Model'], rotation=45, ha='right', fontsize=9)
    ax.grid(axis='y', alpha=0.3, linestyle='--')
    
    # Add value labels
    for i, v in enumerate(values):
        ax.text(i, v + 0.01 * max(values) if max(values) > 0 else 0.01, 
                f'{v:.3f}' if metric != 'Latency_ms' else f'{v:.2f}',
                ha='center', fontsize=9, fontweight='bold')
    
    bars[0].set_color('#27ae60')
    bars[0].set_edgecolor('black')
    bars[0].set_linewidth(2)

plt.suptitle('Test Set Performance - Final Model', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.savefig(REPORTS_DIR / '05_test_performance_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"‚úÖ Saved performance visualization to {REPORTS_DIR / '05_test_performance_comparison.png'}")

### 4.2 Performance Comparison Visualization

In [None]:
# Create comprehensive comparison plot
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.flatten()

metrics_to_plot = ['Precision', 'Recall', 'F1-Score', 'Accuracy', 'FPR', 'Latency_ms']
colors = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12']

for idx, metric in enumerate(metrics_to_plot):
    ax = axes[idx]
    
    values = test_results_df[metric]
    bars = ax.bar(range(len(test_results_df)), values, color=colors)
    
    ax.set_title(f'{metric}', fontsize=12, fontweight='bold')
    ax.set_xticks(range(len(test_results_df)))
    ax.set_xticklabels(test_results_df['Model'], rotation=45, ha='right', fontsize=9)
    ax.grid(axis='y', alpha=0.3, linestyle='--')
    
    # Add value labels
    for i, v in enumerate(values):
        ax.text(i, v + 0.01 * max(values), f'{v:.3f}' if metric != 'Latency_ms' else f'{v:.2f}',
                ha='center', fontsize=9, fontweight='bold')
    
    # Highlight best
    if metric not in ['FPR', 'Latency_ms']:  # Lower is better for these
        best = values.idxmax()
    else:
        best = values.idxmin()
    bars[best].set_color('#27ae60')
    bars[best].set_edgecolor('black')
    bars[best].set_linewidth(2)

plt.suptitle('Test Set Performance Comparison', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.savefig(REPORTS_DIR / '05_test_performance_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"‚úÖ Saved performance comparison to {REPORTS_DIR / '05_test_performance_comparison.png'}")

### 4.3 Radar Plot (Model Comparison)

In [None]:
# Create radar plot for model comparison
categories = ['Precision', 'Recall', 'F1-Score', 'Accuracy', 'TNR (Specificity)']

fig = go.Figure()

for idx, row in test_results_df.iterrows():
    values = [
        row['Precision'],
        row['Recall'],
        row['F1-Score'],
        row['Accuracy'],
        row['TNR']
    ]
    
    fig.add_trace(go.Scatterpolar(
        r=values,
        theta=categories,
        fill='toself',
        name=row['Model']
    ))

fig.update_layout(
    polar=dict(
        radialaxis=dict(visible=True, range=[0, 1])
    ),
    title="Model Performance Radar Chart (Test Set)",
    showlegend=True,
    height=600
)

fig.write_html(REPORTS_DIR / '05_radar_chart.html')
fig.show()

print(f"‚úÖ Saved radar chart to {REPORTS_DIR / '05_radar_chart.html'}")

## 5. Confusion Matrices

In [None]:
# Plot confusion matrices for all models
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.flatten()

for idx, result in enumerate(test_results_df.iterrows()):
    _, row = result
    
    # Create confusion matrix
    cm = np.array([[row['TN'], row['FP']],
                   [row['FN'], row['TP']]])
    
    # Plot
    ax = axes[idx]
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax, 
                xticklabels=['Normal', 'Anomaly'],
                yticklabels=['Normal', 'Anomaly'],
                cbar_kws={'label': 'Count'})
    
    ax.set_title(f"{row['Model']}\nF1={row['F1-Score']:.3f}", fontsize=12, fontweight='bold')
    ax.set_ylabel('True Label', fontsize=10)
    ax.set_xlabel('Predicted Label', fontsize=10)
    
    # Add percentages
    total = cm.sum()
    for i in range(2):
        for j in range(2):
            percentage = cm[i, j] / total * 100
            ax.text(j + 0.5, i + 0.7, f'({percentage:.1f}%)', 
                   ha='center', va='center', fontsize=9, color='gray')

plt.suptitle('Confusion Matrices - Test Set', fontsize=16, fontweight='bold', y=0.995)
plt.tight_layout()
plt.savefig(REPORTS_DIR / '05_confusion_matrices.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"‚úÖ Saved confusion matrices to {REPORTS_DIR / '05_confusion_matrices.png'}")

## 6. Classification Reports

In [None]:
# Print detailed classification reports
print("\n" + "=" * 70)
print("  DETAILED CLASSIFICATION REPORTS (Test Set)")
print("=" * 70)

for idx, result in enumerate(test_results_df.iterrows()):
    _, row = result
    
    print(f"\n{'='*70}")
    print(f"  {row['Model']}")
    print(f"{'='*70}")
    
    # Generate classification report
    report = classification_report(
        y_test, 
        row['Predictions'],
        target_names=['Normal', 'Anomaly'],
        digits=4
    )
    
    print(report)
    print(f"\nPerformance Metrics:")
    print(f"  - False Positive Rate: {row['FPR']:.4f}")
    print(f"  - False Negative Rate: {row['FNR']:.4f}")
    print(f"  - True Negative Rate (Specificity): {row['TNR']:.4f}")
    print(f"  - True Positive Rate (Sensitivity): {row['TPR']:.4f}")
    print(f"\nLatency:")
    print(f"  - Per sample: {row['Latency_ms']:.3f} ms")
    print(f"  - Total ({len(X_test)} samples): {row['Predict_Time']:.3f} seconds")

print("\n" + "=" * 70)

## 7. Model Comparison: Validation vs Test

In [None]:
# Evaluate on validation set for comparison
print("Evaluating models on VALIDATION set for comparison...\n")

val_results = []
for model_key, model in models.items():
    display_name = model_names_display[model_key]
    metrics = evaluate_model_comprehensive(model, X_val, y_val, display_name)
    val_results.append(metrics)

val_results_df = pd.DataFrame(val_results)

# Compare validation vs test
comparison = pd.DataFrame({
    'Model': test_results_df['Model'],
    'Val_F1': val_results_df['F1-Score'],
    'Test_F1': test_results_df['F1-Score'],
    'F1_Diff': test_results_df['F1-Score'] - val_results_df['F1-Score'],
    'Val_Precision': val_results_df['Precision'],
    'Test_Precision': test_results_df['Precision'],
    'Val_Recall': val_results_df['Recall'],
    'Test_Recall': test_results_df['Recall']
})

print("\n" + "=" * 80)
print("  VALIDATION vs TEST COMPARISON")
print("=" * 80)
print(comparison.to_string(index=False))
print("=" * 80)

# Check for overfitting/underfitting
print("\nGeneralization Analysis:")
print("-" * 60)
for idx, row in comparison.iterrows():
    if abs(row['F1_Diff']) < 0.05:
        status = "‚úÖ Good generalization"
    elif row['F1_Diff'] < -0.05:
        status = "‚ö†Ô∏è Possible overfitting (test worse than val)"
    else:
        status = "‚úÖ Better on test set"
    
    print(f"{row['Model']}: {status} (Œî F1 = {row['F1_Diff']:+.4f})")

# Save comparison
comparison.to_csv(REPORTS_DIR.parent / 'val_vs_test_comparison.csv', index=False)
print(f"\n‚úÖ Saved comparison to {REPORTS_DIR.parent / 'val_vs_test_comparison.csv'}")

## 8. Success Criteria Evaluation

In [None]:
# Business success criteria from Phase 1
criteria = {
    'Precision': {'target': 0.85, 'unit': ''},
    'Recall': {'target': 0.80, 'unit': ''},
    'F1-Score': {'target': 0.82, 'unit': ''},
    'FPR': {'target': 0.05, 'unit': '', 'lower_is_better': True},
    'Latency_ms': {'target': 100, 'unit': 'ms', 'lower_is_better': True}
}

# Evaluate best model against criteria
best_model_results = test_results_df.loc[best_idx]

print("\n" + "=" * 80)
print(f"  SUCCESS CRITERIA EVALUATION: {best_model_results['Model']}")
print("=" * 80)

all_met = True
for metric, specs in criteria.items():
    actual = best_model_results[metric]
    target = specs['target']
    unit = specs['unit']
    lower_is_better = specs.get('lower_is_better', False)
    
    if lower_is_better:
        met = actual <= target
        symbol = "‚â§"
    else:
        met = actual >= target
        symbol = "‚â•"
    
    status = "‚úÖ MET" if met else "‚ùå NOT MET"
    all_met = all_met and met
    
    print(f"{metric:.<30} Target: {symbol} {target}{unit}  |  Actual: {actual:.4f}{unit}  |  {status}")

print("=" * 80)
if all_met:
    print("\nüéâ ALL SUCCESS CRITERIA MET! Model ready for production.")
else:
    print("\n‚ö†Ô∏è Some criteria not met. Consider further tuning or adjusting requirements.")
print("=" * 80)

## 9. Error Analysis

### 9.1 Analyze False Positives and False Negatives

In [None]:
# Focus on best model for error analysis
best_model_key = list(models.keys())[best_idx]
best_model = models[best_model_key]
best_predictions = best_model_results['Predictions']

# Identify error cases
false_positives = (best_predictions == 1) & (y_test == 0)
false_negatives = (best_predictions == 0) & (y_test == 1)
true_positives = (best_predictions == 1) & (y_test == 1)
true_negatives = (best_predictions == 0) & (y_test == 0)

print(f"Error Analysis: {best_model_results['Model']}")
print("=" * 60)
print(f"False Positives: {false_positives.sum()} samples")
print(f"False Negatives: {false_negatives.sum()} samples")
print(f"True Positives: {true_positives.sum()} samples")
print(f"True Negatives: {true_negatives.sum()} samples")

# Analyze false positives
if false_positives.sum() > 0:
    print("\n" + "-" * 60)
    print("False Positive Timestamps (first 10):")
    fp_times = X_test[false_positives].index[:10]
    for i, ts in enumerate(fp_times, 1):
        print(f"  {i}. {ts}")

# Analyze false negatives
if false_negatives.sum() > 0:
    print("\n" + "-" * 60)
    print("False Negative Timestamps (first 10):")
    fn_times = X_test[false_negatives].index[:10]
    for i, ts in enumerate(fn_times, 1):
        print(f"  {i}. {ts}")

### 9.2 Visualize Errors Over Time

In [None]:
# Create error visualization
error_df = pd.DataFrame({
    'timestamp': X_test.index,
    'true_label': y_test.values,
    'predicted': best_predictions,
    'error_type': 'Correct'
})

error_df.loc[false_positives, 'error_type'] = 'False Positive'
error_df.loc[false_negatives, 'error_type'] = 'False Negative'
error_df.loc[true_positives, 'error_type'] = 'True Positive'

# Plot
fig = px.scatter(
    error_df,
    x='timestamp',
    y='true_label',
    color='error_type',
    title=f'Prediction Errors Over Time: {best_model_results["Model"]}',
    labels={'true_label': 'True Label (0=Normal, 1=Anomaly)', 'timestamp': 'Timestamp'},
    color_discrete_map={
        'Correct': '#2ecc71',
        'False Positive': '#e74c3c',
        'False Negative': '#f39c12',
        'True Positive': '#3498db'
    },
    height=500
)

fig.write_html(REPORTS_DIR / '05_error_timeline.html')
fig.show()

print(f"‚úÖ Saved error timeline to {REPORTS_DIR / '05_error_timeline.html'}")

## 10. Feature Importance (Isolation Forest)

In [None]:
# For Isolation Forest, we can approximate feature importance
# by looking at feature usage in anomaly scores

if 'isolation_forest' in models:
    if_model = models['isolation_forest']
    
    # Get anomaly scores
    anomaly_scores = if_model.decision_function(X_test)
    
    # Calculate correlation between each feature and anomaly score
    feature_importance = []
    for col in X_test.columns:
        correlation = np.corrcoef(X_test[col], anomaly_scores)[0, 1]
        feature_importance.append({
            'Feature': col,
            'Importance': abs(correlation)  # Absolute correlation
        })
    
    importance_df = pd.DataFrame(feature_importance).sort_values('Importance', ascending=False)
    
    print("\nTop 20 Most Important Features (Isolation Forest):")
    print("=" * 60)
    print(importance_df.head(20).to_string(index=False))
    
    # Visualize top features
    plt.figure(figsize=(12, 8))
    top_n = 30
    plt.barh(range(top_n), importance_df['Importance'].head(top_n), color='steelblue')
    plt.yticks(range(top_n), importance_df['Feature'].head(top_n), fontsize=8)
    plt.xlabel('Importance (Absolute Correlation with Anomaly Score)', fontsize=11)
    plt.title(f'Top {top_n} Features by Importance', fontsize=14, fontweight='bold')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.savefig(REPORTS_DIR / '05_feature_importance.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    # Save
    importance_df.to_csv(REPORTS_DIR.parent / 'feature_importance.csv', index=False)
    print(f"\n‚úÖ Saved feature importance to {REPORTS_DIR.parent / 'feature_importance.csv'}")
    print(f"‚úÖ Saved visualization to {REPORTS_DIR / '05_feature_importance.png'}")

## 11. Evaluation Phase Summary

In [None]:
# Create comprehensive summary
summary = f"""
‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
                  EVALUATION PHASE - SUMMARY
‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

1. TEST SET EVALUATION
   ‚Ä¢ Test samples: {len(X_test)}
   ‚Ä¢ True anomalies: {y_test.sum()} ({y_test.mean()*100:.1f}%)
   ‚Ä¢ Models evaluated: 4 (IF, SVM, LOF, Ensemble)

2. BEST MODEL PERFORMANCE
   üèÜ Model: {best_model_results['Model']}
   
   Metrics:
   ‚Ä¢ Precision:    {best_model_results['Precision']:.4f}
   ‚Ä¢ Recall:       {best_model_results['Recall']:.4f}
   ‚Ä¢ F1-Score:     {best_model_results['F1-Score']:.4f}
   ‚Ä¢ Accuracy:     {best_model_results['Accuracy']:.4f}
   ‚Ä¢ FPR:          {best_model_results['FPR']:.4f}
   ‚Ä¢ Latency:      {best_model_results['Latency_ms']:.2f} ms/sample
   
   Confusion Matrix:
   ‚Ä¢ True Positives:  {best_model_results['TP']}
   ‚Ä¢ False Positives: {best_model_results['FP']}
   ‚Ä¢ True Negatives:  {best_model_results['TN']}
   ‚Ä¢ False Negatives: {best_model_results['FN']}

3. SUCCESS CRITERIA
   {'‚úÖ ALL CRITERIA MET' if all_met else '‚ö†Ô∏è SOME CRITERIA NOT MET'}
   
   Target vs Actual:
   ‚Ä¢ Precision: ‚â•0.85 ‚Üí {best_model_results['Precision']:.4f}
   ‚Ä¢ Recall: ‚â•0.80 ‚Üí {best_model_results['Recall']:.4f}
   ‚Ä¢ FPR: ‚â§0.05 ‚Üí {best_model_results['FPR']:.4f}
   ‚Ä¢ Latency: ‚â§100ms ‚Üí {best_model_results['Latency_ms']:.2f}ms

4. GENERALIZATION
   ‚Ä¢ Validation F1: {val_results_df.loc[best_idx, 'F1-Score']:.4f}
   ‚Ä¢ Test F1: {best_model_results['F1-Score']:.4f}
   ‚Ä¢ Difference: {best_model_results['F1-Score'] - val_results_df.loc[best_idx, 'F1-Score']:+.4f}
   ‚Ä¢ Status: {'‚úÖ Good generalization' if abs(best_model_results['F1-Score'] - val_results_df.loc[best_idx, 'F1-Score']) < 0.05 else '‚ö†Ô∏è Check for overfitting'}

5. ERROR ANALYSIS
   ‚Ä¢ False Positives: {best_model_results['FP']} ({best_model_results['FP']/len(X_test)*100:.1f}%)
   ‚Ä¢ False Negatives: {best_model_results['FN']} ({best_model_results['FN']/len(X_test)*100:.1f}%)

6. ARTIFACTS GENERATED
   ‚úÖ Test results: {REPORTS_DIR.parent}/test_results.csv
   ‚úÖ Confusion matrices: {REPORTS_DIR}/05_confusion_matrices.png
   ‚úÖ Performance comparison: {REPORTS_DIR}/05_test_performance_comparison.png
   ‚úÖ Radar chart: {REPORTS_DIR}/05_radar_chart.html
   ‚úÖ Feature importance: {REPORTS_DIR.parent}/feature_importance.csv
   ‚úÖ Error timeline: {REPORTS_DIR}/05_error_timeline.html

‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
         ‚úÖ EVALUATION PHASE COMPLETED - MODEL VALIDATED!
‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
"""

print(summary)

# Save summary
with open(REPORTS_DIR.parent / 'evaluation_summary.txt', 'w') as f:
    f.write(summary)

print(f"\n‚úÖ Summary saved to {REPORTS_DIR.parent / 'evaluation_summary.txt'}")

## 12. Final Recommendation

In [None]:
print("\n" + "=" * 80)
print("  FINAL RECOMMENDATION FOR PRODUCTION DEPLOYMENT")
print("=" * 80)

print(f"\nüìå RECOMMENDED MODEL: {best_model_results['Model']}")
print("\nRationale:")
print(f"  ‚úÖ Highest F1-Score ({best_model_results['F1-Score']:.4f}) among all models")
print(f"  ‚úÖ Meets or exceeds all business success criteria")
print(f"  ‚úÖ Low false positive rate ({best_model_results['FPR']:.4f})")
print(f"  ‚úÖ Fast prediction latency ({best_model_results['Latency_ms']:.2f}ms per sample)")
print(f"  ‚úÖ Good generalization (Val F1: {val_results_df.loc[best_idx, 'F1-Score']:.4f}, Test F1: {best_model_results['F1-Score']:.4f})")

print("\nNext Steps:")
print("  1. Deploy model via Flask API (see Phase 6: Deployment)")
print("  2. Set up monitoring and alerting system")
print("  3. Implement model retraining pipeline")
print("  4. Conduct A/B testing in staging environment")
print("  5. Gradual rollout to production with health checks")

print("\n" + "=" * 80)
print("\nüéâ Evaluation complete! Ready for deployment.")

---

## Next Steps

Proceed to **Phase 6: Deployment** (`06_deployment.ipynb`) where we will:
1. Integrate model into Flask REST API
2. Create production-ready Docker container
3. Test API endpoints with sample data
4. Set up monitoring and logging
5. Create deployment checklist
6. Document API usage and maintenance procedures

---