# Training and Evaluation: Individual Models

================================================================================
PURPOSE: Train and evaluate classifiers on features from each model separately
================================================================================

This notebook trains multiple classifiers on Context Tree features extracted
from individual transformer models. Each model (BERT, RoBERTa, DeBERTa, XLNet)
is evaluated separately to assess their individual performance on the clarity
and evasion classification tasks.

**Workflow:**
1. Load features from Google Drive (saved by 02_feature_extraction_separate.ipynb)
2. Train multiple classifiers on each model's features
3. Evaluate on Dev set (model selection and hyperparameter tuning)
4. Save predictions and probabilities for further analysis
5. Generate comprehensive results tables and evaluation plots

**Classifiers:**
- Logistic Regression
- Linear Support Vector Classifier (LinearSVC)
- Random Forest
- XGBoost
- LightGBM

**Output:** 
- Predictions (hard labels) and probabilities saved to Google Drive
- Results tables comparing classifiers for each model/task combination
- Evaluation plots (confusion matrices, PR curves, ROC curves)
- Results metadata saved for final summary generation

================================================================================
INPUTS (What this notebook loads)
================================================================================

**From GitHub:**
- Repository code (cloned automatically if not present)
- Source modules from `src/` directory:
  - `src.storage.manager` (StorageManager)
  - `src.models.trainer` (training and evaluation functions)
  - `src.models.classifiers` (classifier definitions)
  - `src.evaluation.tables` (final summary table functions)

**From Google Drive:**
- Dataset splits: `splits/dataset_splits.pkl`
  - Train split (for label extraction)
  - Dev split (for label extraction)
- Feature matrices: `features/raw/X_{split}_{model}_{task}.npy`
  - For each model (bert, roberta, deberta, xlnet)
  - For each task (clarity, evasion)
  - For Train and Dev splits
  - Loaded via `storage.load_features(model, task, split)`

**From HuggingFace Hub:**
- Nothing (all features already extracted)

================================================================================
OUTPUTS (What this notebook saves)
================================================================================

**To Google Drive:**
- Predictions: `predictions/pred_{split}_{model}_{classifier}_{task}.npy`
  - Hard label predictions for Dev set
  - For each model/classifier/task combination
- Probabilities: `features/probabilities/probs_{split}_{model}_{classifier}_{task}.npy`
  - Probability distributions for Dev set
  - For each model/classifier/task combination
- Evaluation plots: `plots/{model}_{task}_{classifier}/`
  - Confusion matrices
  - Precision-Recall curves
  - ROC curves
- Final summary tables: `results/tables/`
  - `final_summary_all_models_classifiers_tasks.{csv,html,png}`
  - `final_summary_model_wise.{csv,html,png}`
  - `final_summary_classifier_wise.{csv,html,png}`
- Complete results dictionary: `results/all_results_dev.pkl`

**To GitHub:**
- Results metadata: `results/{model}_{task}_separate.json`
  - Metrics for each classifier
  - Train/Dev sample counts
  - Timestamp information
- Results dictionary (JSON): `results/all_results_dev.json`

**Evaluation Metrics Computed and Printed:**
- Accuracy
- Macro Precision, Recall, F1
- Weighted Precision, Recall, F1
- Per-class metrics (precision, recall, F1, support)
- Cohen's Kappa
- Matthews Correlation Coefficient
- Hamming Loss
- Jaccard Score (IoU)
- Confusion Matrix

**What gets passed to next notebook:**
- All predictions and probabilities saved to persistent storage
- Final summary tables (CSV, HTML, PNG formats)
- Results metadata for comparison with fusion approaches


In [None]:
# ============================================================================
# SETUP: Repository Clone, Drive Mount, and Path Configuration
# ============================================================================
# This cell performs minimal setup required for the notebook to run:
# 1. Clones repository from GitHub (if not already present)
# 2. Mounts Google Drive for persistent data storage
# 3. Configures Python paths and initializes StorageManager
# 4. Loads data splits and features created in previous notebooks

import shutil
import os
import subprocess
import time
import requests
import zipfile
import sys
from pathlib import Path
from google.colab import drive
import numpy as np

# Repository configuration
repo_dir = '/content/semeval-context-tree-modular'
repo_url = 'https://github.com/EonTechie/semeval-context-tree-modular.git'
zip_url = 'https://github.com/EonTechie/semeval-context-tree-modular/archive/refs/heads/main.zip'

# Clone repository (if not already present)
if not os.path.exists(repo_dir):
    print("Cloning repository from GitHub...")
    max_retries = 2
    clone_success = False
    
    for attempt in range(max_retries):
        try:
            result = subprocess.run(
                ['git', 'clone', repo_url],
                cwd='/content',
                capture_output=True,
                text=True,
                timeout=60
            )
            if result.returncode == 0:
                print("Repository cloned successfully via git")
                clone_success = True
                break
            else:
                if attempt < max_retries - 1:
                    time.sleep(3)
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(3)
    
    # Fallback: Download as ZIP if git clone fails
    if not clone_success:
        print("Git clone failed. Downloading repository as ZIP archive...")
        zip_path = '/tmp/repo.zip'
        try:
            response = requests.get(zip_url, stream=True, timeout=60)
            response.raise_for_status()
            with open(zip_path, 'wb') as f:
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)
            with zipfile.ZipFile(zip_path, 'r') as zip_ref:
                zip_ref.extractall('/content')
            extracted_dir = '/content/semeval-context-tree-modular-main'
            if os.path.exists(extracted_dir):
                os.rename(extracted_dir, repo_dir)
            os.remove(zip_path)
            print("Repository downloaded and extracted successfully")
        except Exception as e:
            raise RuntimeError(f"Failed to obtain repository: {e}")

# Mount Google Drive (if not already mounted)
try:
    drive.mount('/content/drive', force_remount=False)
except Exception:
    pass  # Already mounted

# Configure paths
BASE_PATH = Path('/content/semeval-context-tree-modular')
DATA_PATH = Path('/content/drive/MyDrive/semeval_data')

# Verify repository structure exists
if not BASE_PATH.exists():
    raise RuntimeError(f"Repository directory not found: {BASE_PATH}")
if not (BASE_PATH / 'src').exists():
    raise RuntimeError(f"src directory not found in repository: {BASE_PATH / 'src'}")
if not (BASE_PATH / 'src' / 'storage' / 'manager.py').exists():
    raise RuntimeError(f"Required file not found: {BASE_PATH / 'src' / 'storage' / 'manager.py'}")

# Add repository to Python path
sys.path.insert(0, str(BASE_PATH))

# Verify imports work
try:
    from src.storage.manager import StorageManager
    from src.models.trainer import train_and_evaluate
    from src.models.classifiers import get_classifier_dict
except ImportError as e:
    raise ImportError(
        f"Failed to import required modules. "
        f"Repository path: {BASE_PATH}, "
        f"Python path: {sys.path[:3]}, "
        f"Error: {e}"
    )

# Initialize StorageManager
storage = StorageManager(
    base_path=str(BASE_PATH),
    data_path=str(DATA_PATH),
    github_path=str(BASE_PATH)
)

# Data splits will be loaded per-task in the training loop
# Clarity and Evasion have different splits (Evasion uses majority voting)

print("Setup complete")
print(f"  Repository: {BASE_PATH}")
print(f"  Data storage: {DATA_PATH}")
print(f"\nNOTE: Data splits will be loaded per-task (task-specific splits)")
print(f"      Clarity and Evasion have different splits due to majority voting")


In [None]:
# ============================================================================
# CONFIGURE MODELS, TASKS, AND CLASSIFIERS
# ============================================================================
# Defines the models to evaluate, tasks to perform, and classifiers to train
# Label mappings are defined for clarity (3-class) and evasion (9-class) tasks

MODELS = ['bert', 'bert_political', 'bert_ambiguity', 'roberta', 'deberta', 'xlnet']
TASKS = ['clarity', 'evasion']

# Label mappings for each task
CLARITY_LABELS = ['Ambivalent', 'Clear Non-Reply', 'Clear Reply']
EVASION_LABELS = ['Claims ignorance', 'Clarification', 'Declining to answer', 
                  'Deflection', 'Dodging', 'Explicit', 
                  'General', 'Implicit', 'Partial/half-answer']

# Initialize classifiers with fixed random seed for reproducibility
classifiers = get_classifier_dict(random_state=42)

print("Configuration:")
print(f"  Models: {MODELS}")
print(f"  Tasks: {TASKS}")
print(f"  Classifiers: {list(classifiers.keys())}")
print(f"  Clarity labels: {len(CLARITY_LABELS)} classes")
print(f"  Evasion labels: {len(EVASION_LABELS)} classes")


In [None]:
# ============================================================================
# TRAIN AND EVALUATE CLASSIFIERS FOR EACH MODEL AND TASK
# ============================================================================
# Iterates through each model and task, trains all classifiers, and evaluates
# on the Dev set. Results are saved for later analysis and final summary generation.

all_results = {}

for model in MODELS:
    all_results[model] = {}
    
    for task in TASKS:
        # Select appropriate label list and dataset key for this task
        if task == 'clarity':
            label_list = CLARITY_LABELS
            label_key = 'clarity_label'
        else:  # evasion
            label_list = EVASION_LABELS
            label_key = 'evasion_label'
        
        # Load task-specific splits (Clarity and Evasion have different splits)
        # Evasion splits are filtered by majority voting
        train_ds = storage.load_split('train', task=task)
        dev_ds = storage.load_split('dev', task=task)
        
        # Load features from persistent storage
        X_train = storage.load_features(model, task, 'train')
        X_dev = storage.load_features(model, task, 'dev')
        
        # Extract labels from dataset splits
        y_train = np.array([train_ds[i][label_key] for i in range(len(train_ds))])
        y_dev = np.array([dev_ds[i][label_key] for i in range(len(dev_ds))])
        
        # Train all classifiers and evaluate on Dev set
        # This function handles training, prediction, metric computation, and visualization
        results = train_and_evaluate(
            X_train, y_train, X_dev, y_dev,
            label_list=label_list,
            task_name=f"{model}_{task}",
            classifiers=classifiers,
            random_state=42,
            print_report=False,     # Don't print classification reports
            print_table=True,       # Print results comparison table only
            create_plots=True,      # Generate confusion matrices and PR/ROC curves (silently)
            save_plots_dir=str(DATA_PATH / 'plots')
        )
        
        # Save predictions and probabilities to persistent storage
        # These will be used for further analysis and final summary generation
        for classifier_name, result in results.items():
            # Save hard label predictions
            storage.save_predictions(
                result['dev_pred'],
                model, classifier_name, task, 'dev'
            )
            
            # Save probability distributions (if classifier supports it)
            if result['dev_proba'] is not None:
                storage.save_probabilities(
                    result['dev_proba'],
                    model, classifier_name, task, 'dev'
                )
        
        all_results[model][task] = results
        
        # Save results summary to metadata for final summary generation
        experiment_id = f"{model}_{task}_separate"
        storage.save_results({
            'model': model,
            'task': task,
            'results': {
                name: {
                    'metrics': res['metrics'],
                    'n_train': len(y_train),
                    'n_dev': len(y_dev)
                }
                for name, res in results.items()
            }
        }, experiment_id)


# ============================================================================
# FINAL SUMMARY: PIVOT TABLES AND STYLED TABLES
# ============================================================================
# This cell creates comprehensive final summary tables from all training results:
# 1. Final Summary Pivot Table: ALL MODELS × CLASSIFIERS × TASKS
# 2. Model-wise Summary: Classifier × Tasks (grouped by Model)
# 3. Classifier-wise Summary: Model × Tasks (grouped by Classifier)
# 
# All tables are styled with color gradients and saved in multiple formats
# (CSV, HTML, PNG) for easy sharing and publication.


In [None]:
# ============================================================================
# FINAL SUMMARY GENERATION
# ============================================================================
# Generate comprehensive summary tables and save all results

import pandas as pd
from src.evaluation.tables import (
    create_final_summary_pivot,
    create_model_wise_summary_pivot,
    create_classifier_wise_summary_pivot,
    style_table
)

# Save all_results dictionary to persistent storage
storage.save_all_results_dict(all_results, filename='all_results_dev.pkl')

# Extract classifier names from results
classifier_names = set()
all_tasks = ['clarity', 'evasion', 'hierarchical_evasion_to_clarity']  # 3 tasks: Clarity, Evasion, Hierarchical

for model in MODELS:
    if model in all_results:
        for task in all_tasks:
            if task in all_results[model]:
                classifier_names.update(all_results[model][task].keys())
classifier_names = sorted(list(classifier_names))

# ============================================================================
# FINAL SUMMARY: ALL MODELS × CLASSIFIERS × TASKS (3 Tasks Side-by-Side)
# ============================================================================
# Shows Clarity, Evasion, and Hierarchical results side-by-side for comparison
# Color coding helps identify which task performs better for each model/classifier

df_final_pivot = create_final_summary_pivot(
    all_results, MODELS, classifier_names, all_tasks, metric='macro_f1'
)

if not df_final_pivot.empty:
    # Display styled pivot table with color coding
    styled_final = style_table(df_final_pivot, precision=4)
    display(styled_final)
    
    # Save table in multiple formats
    storage.save_table(
        styled_final,
        table_name='final_summary_all_models_classifiers_tasks',
        formats=['csv', 'html', 'png']
    )
else:
    print("WARNING: No results available for final summary pivot table")


# ============================================================================
# HIERARCHICAL EVASION → CLARITY APPROACH
# ============================================================================
# This cell implements the hierarchical approach where evasion predictions
# are mapped to clarity predictions using a predefined mapping function.
# This approach leverages the hierarchical relationship between evasion
# (fine-grained) and clarity (coarse-grained) labels.


In [None]:
# ============================================================================
# HIERARCHICAL EVALUATION: EVASION PREDICTIONS → CLARITY PREDICTIONS
# ============================================================================
# For each model, uses evasion predictions to generate clarity predictions
# via hierarchical mapping, then evaluates against true clarity labels
# This is treated as a 3rd task alongside Clarity and Evasion

from src.models.hierarchical import evaluate_hierarchical_approach
import numpy as np

hierarchical_results = {}

# Load dev split for clarity task (to get true labels)
dev_ds_clarity = storage.load_split('dev', task='clarity')

for model in MODELS:
    # Check if we have evasion predictions for this model
    if 'evasion' not in all_results.get(model, {}):
        continue
    
    # Get evasion predictions and true labels
    evasion_results = all_results[model]['evasion']
    
    # Use predictions from best classifier (by Macro F1)
    best_classifier = None
    best_f1 = -1
    
    for clf_name, clf_result in evasion_results.items():
        if 'metrics' in clf_result:
            f1 = clf_result['metrics'].get('macro_f1', 0.0)
            if f1 > best_f1:
                best_f1 = f1
                best_classifier = clf_name
    
    if best_classifier is None:
        continue
    
    # Get predictions (already string labels from train_classifiers)
    y_evasion_pred = evasion_results[best_classifier]['dev_pred']
    
    # Get true clarity labels
    y_clarity_true = np.array([dev_ds_clarity[i]['clarity_label'] for i in range(len(dev_ds_clarity))])
    
    # Encode clarity labels for evaluation (hierarchical function expects encoded)
    from sklearn.preprocessing import LabelEncoder
    
    le_clarity = LabelEncoder()
    y_clarity_true_encoded = le_clarity.fit_transform(y_clarity_true)
    
    # Evaluate hierarchical approach
    # y_evasion_pred is already string labels, y_clarity_true_encoded is encoded
    # We pass dummy encoded evasion_true (not used in mapping, only for consistency)
    y_evasion_true_dummy = np.zeros(len(y_evasion_pred), dtype=int)  # Dummy, not used
    
    hierarchical_metrics = evaluate_hierarchical_approach(
        y_evasion_true_dummy,  # Not used in mapping, only for function signature
        y_evasion_pred,  # String labels - function will handle both string and int
        y_clarity_true_encoded,  # Encoded integers
        EVASION_LABELS,
        CLARITY_LABELS
    )
    
    hierarchical_results[model] = {
        'classifier': best_classifier,
        'metrics': hierarchical_metrics,
        'evasion_f1': best_f1
    }

# Add hierarchical results to all_results for final summary (as 3rd task)
for model in hierarchical_results:
    if model not in all_results:
        all_results[model] = {}
    if 'hierarchical_evasion_to_clarity' not in all_results[model]:
        all_results[model]['hierarchical_evasion_to_clarity'] = {}
    
    # Store hierarchical results in same format as other tasks
    # Use best_classifier name instead of 'MLP' for consistency
    best_clf_name = hierarchical_results[model]['classifier']
    all_results[model]['hierarchical_evasion_to_clarity'][best_clf_name] = {
        'metrics': hierarchical_results[model]['metrics'],
        'dev_pred': hierarchical_results[model]['metrics']['predictions'],
        'dev_proba': None  # Hierarchical approach doesn't produce probabilities
    }
