# Model Evaluation Tutorial for AG News Classification

## Overview

This notebook demonstrates comprehensive model evaluation techniques following methodologies from:
- Sokolova & Lapalme (2009): "A systematic analysis of performance measures for classification tasks"
- Demšar (2006): "Statistical Comparisons of Classifiers over Multiple Data Sets"
- Raschka (2018): "Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning"

### Tutorial Objectives
1. Load trained models and test data
2. Calculate comprehensive evaluation metrics
3. Perform error analysis and visualization
4. Conduct statistical significance testing
5. Compare multiple model performances
6. Generate evaluation reports

Author: Võ Hải Dũng  
Email: vohaidung.work@gmail.com  
Date: 2025

## 1. Environment Setup

In [None]:
# Standard library imports
import sys
import os
from pathlib import Path
from typing import Dict, List, Tuple, Optional, Any
import warnings
import json

# Data and ML imports
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, AutoModel

# Metrics and evaluation
from sklearn.metrics import (
    accuracy_score, precision_recall_fscore_support,
    confusion_matrix, classification_report,
    roc_auc_score, roc_curve, auc,
    matthews_corrcoef, cohen_kappa_score
)
from scipy import stats

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm
import plotly.graph_objects as go
import plotly.express as px

# Project imports
PROJECT_ROOT = Path("../..").resolve()
sys.path.insert(0, str(PROJECT_ROOT))

from src.data.datasets.ag_news import AGNewsDataset, AGNewsConfig
from src.data.loaders.dataloader import create_dataloaders
from src.models.transformers.deberta.deberta_v3 import DeBERTaV3Classifier
from src.evaluation.metrics.classification_metrics import ClassificationMetrics
from src.evaluation.analysis.error_analysis import ErrorAnalyzer
from src.evaluation.analysis.confusion_analysis import ConfusionAnalyzer
from src.evaluation.statistical.significance_tests import SignificanceTests
from src.evaluation.visualization.performance_plots import PerformancePlotter
from src.utils.io_utils import safe_load, safe_save, ensure_dir
from src.utils.logging_config import setup_logging
from src.utils.reproducibility import set_seed
from configs.config_loader import ConfigLoader
from configs.constants import (
    AG_NEWS_CLASSES,
    AG_NEWS_NUM_CLASSES,
    ID_TO_LABEL,
    LABEL_TO_ID,
    MODEL_DIR,
    OUTPUT_DIR,
    DATA_DIR
)

# Configuration
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
set_seed(42)
logger = setup_logging('evaluation_tutorial')

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Model Evaluation Tutorial")
print("="*50)
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## 2. Load Configuration

In [None]:
# Load evaluation configuration
config_loader = ConfigLoader()

# Load model and evaluation configs
model_config = config_loader.load_config('models/single/deberta_v3_xlarge.yaml')
eval_config = {
    'batch_size': 16,
    'max_samples': 1000,  # Limit for tutorial
    'model_name': 'microsoft/deberta-v3-base',
    'max_length': 256,
    'metrics': ['accuracy', 'precision', 'recall', 'f1', 'auc', 'mcc', 'kappa'],
    'confidence_level': 0.95,
    'num_bootstrap': 1000
}

print("Evaluation Configuration:")
print("="*50)
for key, value in eval_config.items():
    if isinstance(value, list):
        print(f"  {key}: {', '.join(value)}")
    else:
        print(f"  {key}: {value}")

## 3. Load Test Data

In [None]:
# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(eval_config['model_name'])

# Load test dataset
data_config = AGNewsConfig(
    data_dir=DATA_DIR / "processed",
    max_samples=eval_config['max_samples'],
    tokenizer=tokenizer,
    max_length=eval_config['max_length']
)

print("Loading test dataset...")
test_dataset = AGNewsDataset(data_config, split='test')

# Create DataLoader
test_dataloader = DataLoader(
    test_dataset,
    batch_size=eval_config['batch_size'],
    shuffle=False,
    num_workers=2,
    pin_memory=torch.cuda.is_available()
)

print(f"\nTest dataset loaded:")
print(f"  Total samples: {len(test_dataset)}")
print(f"  Number of batches: {len(test_dataloader)}")
print(f"  Classes: {AG_NEWS_CLASSES}")

# Display sample distribution
label_counts = pd.Series(test_dataset.labels).value_counts().sort_index()
print(f"\nLabel distribution:")
for label_id, count in label_counts.items():
    label_name = ID_TO_LABEL[label_id]
    print(f"  {label_name}: {count} ({count/len(test_dataset)*100:.1f}%)")

## 4. Load Trained Model

In [None]:
# Initialize model
def load_trained_model(checkpoint_path: Optional[Path] = None) -> nn.Module:
    """
    Load trained model for evaluation.
    
    Following model loading best practices from:
        PyTorch Documentation: "Saving and Loading Models"
    """
    # Initialize model architecture
    model = DeBERTaV3Classifier(
        model_name=eval_config['model_name'],
        num_labels=AG_NEWS_NUM_CLASSES,
        dropout_rate=0.1
    )
    
    # Load checkpoint if provided
    if checkpoint_path and checkpoint_path.exists():
        print(f"Loading checkpoint from: {checkpoint_path}")
        checkpoint = torch.load(checkpoint_path, map_location=device)
        
        if 'model_state_dict' in checkpoint:
            model.load_state_dict(checkpoint['model_state_dict'])
            print(f"  Loaded model from epoch: {checkpoint.get('epoch', 'unknown')}")
            print(f"  Best accuracy: {checkpoint.get('best_accuracy', 'unknown')}")
        else:
            model.load_state_dict(checkpoint)
    else:
        print("Using randomly initialized model for demonstration")
    
    return model.to(device)

# Load model
checkpoint_path = MODEL_DIR / "tutorial" / "deberta_v3_trained" / "checkpoint.pt"
model = load_trained_model(checkpoint_path)
model.eval()

# Model summary
total_params = sum(p.numel() for p in model.parameters())
print(f"\nModel loaded:")
print(f"  Total parameters: {total_params:,}")
print(f"  Model in eval mode: {not model.training}")

## 5. Generate Predictions

In [None]:
def generate_predictions(model: nn.Module,
                        dataloader: DataLoader,
                        device: torch.device) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """
    Generate predictions on test set.
    
    Following inference best practices from:
        Krishnan et al. (2022): "Efficient Deep Learning Inference"
    """
    model.eval()
    
    all_predictions = []
    all_labels = []
    all_probabilities = []
    
    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Generating predictions"):
            # Move batch to device
            inputs = {k: v.to(device) for k, v in batch.items() if k != 'labels'}
            labels = batch['labels'].to(device)
            
            # Forward pass
            outputs = model(**inputs)
            logits = outputs.logits if hasattr(outputs, 'logits') else outputs
            
            # Calculate probabilities and predictions
            probabilities = torch.softmax(logits, dim=-1)
            predictions = torch.argmax(logits, dim=-1)
            
            # Store results
            all_predictions.extend(predictions.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
            all_probabilities.extend(probabilities.cpu().numpy())
    
    return (np.array(all_predictions),
            np.array(all_labels),
            np.array(all_probabilities))

# Generate predictions
print("Generating predictions on test set...")
predictions, true_labels, probabilities = generate_predictions(model, test_dataloader, device)

print(f"\nPredictions generated:")
print(f"  Predictions shape: {predictions.shape}")
print(f"  True labels shape: {true_labels.shape}")
print(f"  Probabilities shape: {probabilities.shape}")

# Quick accuracy check
accuracy = accuracy_score(true_labels, predictions)
print(f"\nQuick metrics:")
print(f"  Accuracy: {accuracy:.4f}")

## 6. Calculate Comprehensive Metrics

In [None]:
# Initialize metrics calculator
metrics_calculator = ClassificationMetrics(num_classes=AG_NEWS_NUM_CLASSES)

# Calculate all metrics
metrics = metrics_calculator.compute_metrics(
    predictions=predictions,
    labels=true_labels,
    probabilities=probabilities
)

print("Comprehensive Evaluation Metrics:")
print("="*60)

# Overall metrics
print("\nOverall Performance:")
print(f"  Accuracy: {metrics['accuracy']:.4f}")
print(f"  Macro F1: {metrics['macro_f1']:.4f}")
print(f"  Weighted F1: {metrics['weighted_f1']:.4f}")
print(f"  Macro Precision: {metrics['macro_precision']:.4f}")
print(f"  Macro Recall: {metrics['macro_recall']:.4f}")

# Additional metrics
mcc = matthews_corrcoef(true_labels, predictions)
kappa = cohen_kappa_score(true_labels, predictions)

print(f"\nAdditional Metrics:")
print(f"  Matthews Correlation Coefficient: {mcc:.4f}")
print(f"  Cohen's Kappa: {kappa:.4f}")

# Per-class metrics
precision, recall, f1, support = precision_recall_fscore_support(
    true_labels, predictions, average=None
)

print("\nPer-Class Performance:")
print("-"*60)
print(f"{'Class':<15} {'Precision':<10} {'Recall':<10} {'F1':<10} {'Support':<10}")
print("-"*60)

for i in range(AG_NEWS_NUM_CLASSES):
    class_name = ID_TO_LABEL[i]
    print(f"{class_name:<15} {precision[i]:<10.4f} {recall[i]:<10.4f} "
          f"{f1[i]:<10.4f} {support[i]:<10}")

# Classification report
print("\nDetailed Classification Report:")
print("="*60)
print(classification_report(
    true_labels,
    predictions,
    target_names=AG_NEWS_CLASSES,
    digits=4
))

## 7. Confusion Matrix Analysis

In [None]:
# Calculate confusion matrix
cm = confusion_matrix(true_labels, predictions)

# Normalize confusion matrix
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

# Visualize confusion matrix
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Raw counts
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=AG_NEWS_CLASSES,
            yticklabels=AG_NEWS_CLASSES,
            ax=axes[0])
axes[0].set_title('Confusion Matrix (Counts)')
axes[0].set_xlabel('Predicted Label')
axes[0].set_ylabel('True Label')

# Normalized
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues',
            xticklabels=AG_NEWS_CLASSES,
            yticklabels=AG_NEWS_CLASSES,
            ax=axes[1])
axes[1].set_title('Normalized Confusion Matrix')
axes[1].set_xlabel('Predicted Label')
axes[1].set_ylabel('True Label')

plt.suptitle('Confusion Matrix Analysis', fontsize=14)
plt.tight_layout()
plt.show()

# Analyze confusion patterns
print("Confusion Analysis:")
print("="*50)

# Find most confused pairs
confusion_pairs = []
for i in range(AG_NEWS_NUM_CLASSES):
    for j in range(AG_NEWS_NUM_CLASSES):
        if i != j and cm[i, j] > 0:
            confusion_pairs.append({
                'true': ID_TO_LABEL[i],
                'predicted': ID_TO_LABEL[j],
                'count': cm[i, j],
                'percentage': cm_normalized[i, j] * 100
            })

# Sort by count
confusion_pairs.sort(key=lambda x: x['count'], reverse=True)

print("\nTop 5 Confusion Pairs:")
for pair in confusion_pairs[:5]:
    print(f"  {pair['true']} -> {pair['predicted']}: "
          f"{pair['count']} ({pair['percentage']:.1f}%)")

## 8. Error Analysis

In [None]:
def analyze_errors(predictions: np.ndarray,
                  true_labels: np.ndarray,
                  probabilities: np.ndarray,
                  texts: List[str]) -> pd.DataFrame:
    """
    Perform detailed error analysis.
    
    Following error analysis methodology from:
        Wu et al. (2020): "Errudite: Scalable, Reproducible, and Testable Error Analysis"
    """
    errors = []
    
    for i in range(len(predictions)):
        if predictions[i] != true_labels[i]:
            # Calculate confidence
            confidence = probabilities[i, predictions[i]]
            true_prob = probabilities[i, true_labels[i]]
            
            errors.append({
                'index': i,
                'text': texts[i][:100] + '...' if len(texts[i]) > 100 else texts[i],
                'true_label': ID_TO_LABEL[true_labels[i]],
                'predicted_label': ID_TO_LABEL[predictions[i]],
                'confidence': confidence,
                'true_prob': true_prob,
                'confidence_gap': confidence - true_prob,
                'text_length': len(texts[i].split())
            })
    
    return pd.DataFrame(errors)

# Perform error analysis
error_df = analyze_errors(
    predictions,
    true_labels,
    probabilities,
    test_dataset.texts
)

print("Error Analysis Results:")
print("="*50)
print(f"Total errors: {len(error_df)} ({len(error_df)/len(predictions)*100:.1f}%)")

if len(error_df) > 0:
    # Error distribution by class
    print("\nErrors by True Class:")
    error_by_class = error_df['true_label'].value_counts()
    for label, count in error_by_class.items():
        total = sum(true_labels == LABEL_TO_ID[label])
        print(f"  {label}: {count}/{total} ({count/total*100:.1f}%)")
    
    # High confidence errors
    high_conf_errors = error_df[error_df['confidence'] > 0.8]
    print(f"\nHigh confidence errors (>0.8): {len(high_conf_errors)}")
    
    if len(high_conf_errors) > 0:
        print("\nExample high confidence errors:")
        for _, row in high_conf_errors.head(3).iterrows():
            print(f"\n  Text: {row['text']}")
            print(f"  True: {row['true_label']}, Predicted: {row['predicted_label']}")
            print(f"  Confidence: {row['confidence']:.3f}")
    
    # Error patterns by text length
    print("\nError rate by text length:")
    error_df['length_bin'] = pd.qcut(error_df['text_length'], q=3,
                                     labels=['Short', 'Medium', 'Long'])
    for length_bin in ['Short', 'Medium', 'Long']:
        count = len(error_df[error_df['length_bin'] == length_bin])
        print(f"  {length_bin}: {count} errors")

## 9. Statistical Significance Testing

In [None]:
def bootstrap_confidence_interval(y_true: np.ndarray,
                                 y_pred: np.ndarray,
                                 metric_func: callable,
                                 n_bootstrap: int = 1000,
                                 confidence_level: float = 0.95) -> Tuple[float, float, float]:
    """
    Calculate bootstrap confidence intervals.
    
    Following bootstrap methodology from:
        Efron & Tibshirani (1993): "An Introduction to the Bootstrap"
    """
    scores = []
    n_samples = len(y_true)
    
    for _ in range(n_bootstrap):
        # Sample with replacement
        indices = np.random.choice(n_samples, n_samples, replace=True)
        score = metric_func(y_true[indices], y_pred[indices])
        scores.append(score)
    
    scores = np.array(scores)
    alpha = 1 - confidence_level
    
    lower = np.percentile(scores, (alpha/2) * 100)
    upper = np.percentile(scores, (1 - alpha/2) * 100)
    mean = np.mean(scores)
    
    return mean, lower, upper

# Calculate confidence intervals for key metrics
print("Bootstrap Confidence Intervals:")
print("="*50)
print(f"Number of bootstrap samples: {eval_config['num_bootstrap']}")
print(f"Confidence level: {eval_config['confidence_level']*100}%")
print()

metrics_to_test = [
    ('Accuracy', accuracy_score),
    ('Macro F1', lambda y_t, y_p: precision_recall_fscore_support(y_t, y_p, average='macro')[2]),
    ('MCC', matthews_corrcoef),
    ('Cohen Kappa', cohen_kappa_score)
]

confidence_intervals = {}
for metric_name, metric_func in metrics_to_test:
    mean, lower, upper = bootstrap_confidence_interval(
        true_labels,
        predictions,
        metric_func,
        n_bootstrap=eval_config['num_bootstrap'],
        confidence_level=eval_config['confidence_level']
    )
    
    confidence_intervals[metric_name] = (mean, lower, upper)
    print(f"{metric_name:12}: {mean:.4f} [{lower:.4f}, {upper:.4f}]")

# McNemar's test (for comparing two models)
def mcnemar_test(y_true: np.ndarray,
                 pred1: np.ndarray,
                 pred2: np.ndarray) -> Tuple[float, float]:
    """
    Perform McNemar's test.
    
    Following methodology from:
        Dietterich (1998): "Approximate Statistical Tests for Comparing Supervised Classification"
    """
    # Create contingency table
    correct1 = (pred1 == y_true)
    correct2 = (pred2 == y_true)
    
    n00 = np.sum(~correct1 & ~correct2)  # Both wrong
    n01 = np.sum(~correct1 & correct2)   # 1 wrong, 2 correct
    n10 = np.sum(correct1 & ~correct2)   # 1 correct, 2 wrong
    n11 = np.sum(correct1 & correct2)    # Both correct
    
    # McNemar's statistic
    if n01 + n10 == 0:
        return 0.0, 1.0
    
    statistic = (abs(n01 - n10) - 1) ** 2 / (n01 + n10)
    p_value = 1 - stats.chi2.cdf(statistic, df=1)
    
    return statistic, p_value

# Simulate second model predictions for demonstration
print("\nStatistical Model Comparison (Simulated):")
print("="*50)

# Add small random noise to create second model
pred2 = predictions.copy()
n_changes = int(len(pred2) * 0.05)  # Change 5% of predictions
change_indices = np.random.choice(len(pred2), n_changes, replace=False)
pred2[change_indices] = np.random.randint(0, AG_NEWS_NUM_CLASSES, n_changes)

statistic, p_value = mcnemar_test(true_labels, predictions, pred2)
print(f"McNemar's test statistic: {statistic:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Significant difference: {'Yes' if p_value < 0.05 else 'No'} (α=0.05)")

## 10. Performance Visualization

In [None]:
# Create comprehensive performance visualization
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# 1. Per-class F1 scores
ax = axes[0, 0]
ax.bar(AG_NEWS_CLASSES, f1)
ax.set_xlabel('Class')
ax.set_ylabel('F1 Score')
ax.set_title('F1 Score by Class')
ax.set_ylim([0, 1])
for i, v in enumerate(f1):
    ax.text(i, v + 0.01, f'{v:.3f}', ha='center')

# 2. Precision-Recall comparison
ax = axes[0, 1]
x = np.arange(len(AG_NEWS_CLASSES))
width = 0.35
ax.bar(x - width/2, precision, width, label='Precision')
ax.bar(x + width/2, recall, width, label='Recall')
ax.set_xlabel('Class')
ax.set_ylabel('Score')
ax.set_title('Precision vs Recall')
ax.set_xticks(x)
ax.set_xticklabels(AG_NEWS_CLASSES)
ax.legend()
ax.set_ylim([0, 1])

# 3. ROC curves
ax = axes[0, 2]
for i in range(AG_NEWS_NUM_CLASSES):
    y_true_binary = (true_labels == i).astype(int)
    y_score = probabilities[:, i]
    fpr, tpr, _ = roc_curve(y_true_binary, y_score)
    roc_auc = auc(fpr, tpr)
    ax.plot(fpr, tpr, label=f'{ID_TO_LABEL[i]} (AUC={roc_auc:.3f})')

ax.plot([0, 1], [0, 1], 'k--', label='Random')
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('ROC Curves')
ax.legend(loc='lower right')
ax.grid(True, alpha=0.3)

# 4. Confidence distribution
ax = axes[1, 0]
max_probs = np.max(probabilities, axis=1)
ax.hist(max_probs, bins=30, edgecolor='black')
ax.set_xlabel('Prediction Confidence')
ax.set_ylabel('Count')
ax.set_title('Prediction Confidence Distribution')
ax.axvline(x=0.5, color='r', linestyle='--', label='Threshold')
ax.legend()

# 5. Error rate by confidence
ax = axes[1, 1]
confidence_bins = np.linspace(0, 1, 11)
confidence_errors = []
confidence_counts = []

for i in range(len(confidence_bins) - 1):
    mask = (max_probs >= confidence_bins[i]) & (max_probs < confidence_bins[i+1])
    if mask.sum() > 0:
        error_rate = (predictions[mask] != true_labels[mask]).mean()
        confidence_errors.append(error_rate)
        confidence_counts.append(mask.sum())
    else:
        confidence_errors.append(0)
        confidence_counts.append(0)

ax.bar(confidence_bins[:-1], confidence_errors, width=0.08)
ax.set_xlabel('Confidence Range')
ax.set_ylabel('Error Rate')
ax.set_title('Error Rate by Confidence')
ax.set_xlim([0, 1])

# 6. Sample distribution
ax = axes[1, 2]
ax.pie(support, labels=AG_NEWS_CLASSES, autopct='%1.1f%%', startangle=90)
ax.set_title('Test Set Class Distribution')

plt.suptitle('Model Performance Analysis', fontsize=14)
plt.tight_layout()
plt.show()

print("Performance visualizations completed")

## 11. Generate Evaluation Report

In [None]:
def generate_evaluation_report(metrics: Dict[str, Any],
                              confidence_intervals: Dict[str, Tuple[float, float, float]],
                              error_analysis: pd.DataFrame,
                              output_path: Path) -> Path:
    """
    Generate comprehensive evaluation report.
    
    Following reporting standards from:
        Moreira et al. (2018): "Standardized Evaluation of Machine Learning Methods"
    """
    report = {
        'metadata': {
            'model': eval_config['model_name'],
            'dataset': 'AG News',
            'num_classes': AG_NEWS_NUM_CLASSES,
            'test_samples': len(true_labels),
            'evaluation_date': pd.Timestamp.now().isoformat()
        },
        'overall_metrics': {
            'accuracy': float(metrics['accuracy']),
            'macro_f1': float(metrics['macro_f1']),
            'weighted_f1': float(metrics['weighted_f1']),
            'macro_precision': float(metrics['macro_precision']),
            'macro_recall': float(metrics['macro_recall'])
        },
        'confidence_intervals': {
            metric: {
                'mean': float(values[0]),
                'lower': float(values[1]),
                'upper': float(values[2])
            }
            for metric, values in confidence_intervals.items()
        },
        'per_class_metrics': [
            {
                'class': ID_TO_LABEL[i],
                'precision': float(precision[i]),
                'recall': float(recall[i]),
                'f1': float(f1[i]),
                'support': int(support[i])
            }
            for i in range(AG_NEWS_NUM_CLASSES)
        ],
        'confusion_matrix': cm.tolist(),
        'error_summary': {
            'total_errors': len(error_analysis),
            'error_rate': len(error_analysis) / len(true_labels),
            'high_confidence_errors': len(error_analysis[error_analysis['confidence'] > 0.8])
        }
    }
    
    # Save report
    ensure_dir(output_path.parent)
    safe_save(report, output_path)
    
    return output_path

# Generate and save report
report_path = OUTPUT_DIR / "tutorial" / "evaluation" / "evaluation_report.json"
saved_report = generate_evaluation_report(
    metrics=metrics,
    confidence_intervals=confidence_intervals,
    error_analysis=error_df,
    output_path=report_path
)

print("Evaluation Report Generated:")
print("="*50)
print(f"Report saved to: {saved_report}")
print(f"File size: {saved_report.stat().st_size / 1024:.2f} KB")

# Display summary
print("\nReport Summary:")
report_data = safe_load(saved_report)
print(f"  Model: {report_data['metadata']['model']}")
print(f"  Test samples: {report_data['metadata']['test_samples']}")
print(f"  Accuracy: {report_data['overall_metrics']['accuracy']:.4f}")
print(f"  Macro F1: {report_data['overall_metrics']['macro_f1']:.4f}")
print(f"  Error rate: {report_data['error_summary']['error_rate']:.4f}")

## 12. Conclusions and Next Steps

### Evaluation Summary

This tutorial demonstrated comprehensive model evaluation techniques:

1. **Test Data Loading**: Prepared test dataset with proper preprocessing
2. **Model Loading**: Loaded trained DeBERTa-v3 model for evaluation
3. **Prediction Generation**: Generated predictions with confidence scores
4. **Metrics Calculation**: Computed comprehensive classification metrics
5. **Confusion Analysis**: Analyzed confusion patterns between classes
6. **Error Analysis**: Identified and characterized prediction errors
7. **Statistical Testing**: Performed bootstrap confidence intervals and significance tests
8. **Visualization**: Created comprehensive performance visualizations
9. **Report Generation**: Generated structured evaluation report

### Key Takeaways

1. **Multiple Metrics**: Use various metrics to understand different aspects of performance
2. **Confidence Intervals**: Bootstrap provides robust confidence estimates
3. **Error Patterns**: Analyzing errors reveals model weaknesses
4. **Statistical Significance**: Proper testing ensures reliable comparisons
5. **Visualization Importance**: Visual analysis complements numerical metrics

### Next Steps

1. **Advanced Evaluation**:
   - Implement cross-validation evaluation
   - Perform ablation studies
   - Conduct robustness testing

2. **Model Comparison**:
   - Compare multiple model architectures
   - Evaluate ensemble methods
   - Benchmark against baselines

3. **Error Mitigation**:
   - Implement targeted data augmentation
   - Apply class-specific optimization
   - Use ensemble to reduce errors

4. **Production Monitoring**:
   - Set up continuous evaluation pipeline
   - Monitor model drift
   - Implement A/B testing framework

### References

For deeper understanding, consult:
- Evaluation documentation: `docs/user_guide/evaluation.md`
- Advanced metrics: `src/evaluation/metrics/`
- Statistical testing: `notebooks/experiments/statistical_analysis.ipynb`
- Visualization tools: `src/evaluation/visualization/`