# Early Fusion: Concatenated Multi-Model Features

================================================================================
PURPOSE: Perform early fusion by concatenating features from all models
================================================================================

This notebook implements early fusion at the feature level by concatenating
Context Tree features extracted from multiple transformer models. This approach
combines complementary information from different models to potentially improve
classification performance.

**Fusion Strategy:**
- **Early Fusion (Feature-level)**: Concatenate features from all models
- Fused feature vector = [BERT features | RoBERTa features | DeBERTa features | XLNet features]
- Total feature dimension = sum of individual model feature dimensions

**Workflow:**
1. Load features from all models (saved by 02_feature_extraction_separate.ipynb)
2. Concatenate features horizontally to create fused feature vectors
3. Train classifiers on fused features
4. Evaluate on Dev set and compare with individual model performance
5. Save fused features, predictions, and results

**Output:**
- Fused feature matrices saved to Google Drive
- Predictions and probabilities for fused features
- Results tables and evaluation plots
- Comparison with individual model results

================================================================================
INPUTS (What this notebook loads)
================================================================================

**From GitHub:**
- Repository code (cloned automatically if not present)
- Source modules from `src/` directory:
  - `src.storage.manager` (StorageManager)
  - `src.features.fusion` (feature fusion functions)
  - `src.models.trainer` (training and evaluation functions)
  - `src.models.classifiers` (classifier definitions)

**From Google Drive:**
- Dataset splits: `splits/dataset_splits.pkl`
  - Train split (for label extraction)
  - Dev split (for label extraction)
- Feature matrices: `features/raw/X_{split}_{model}_{task}.npy`
  - For all models (bert, roberta, deberta, xlnet)
  - For each task (clarity, evasion)
  - For Train and Dev splits
  - Loaded via `storage.load_features(model, task, split)`
- Feature metadata: `metadata/features_{split}_{model}_{task}.json`
  - For feature name extraction

**From HuggingFace Hub:**
- Nothing (all features already extracted)

================================================================================
OUTPUTS (What this notebook saves)
================================================================================

**To Google Drive:**
- Fused features: `features/fused/X_{split}_fused_{models}_{task}.npy`
  - Concatenated features from all models
  - For Train and Dev splits
  - Shape: (N_samples, sum_of_all_model_features)
- Predictions: `predictions/pred_{split}_fused_{classifier}_{task}.npy`
  - Hard label predictions for Dev set
  - For each classifier/task combination
- Probabilities: `features/probabilities/probs_{split}_fused_{classifier}_{task}.npy`
  - Probability distributions for Dev set
  - For each classifier/task combination
- Evaluation plots: `plots/early_fusion/{task}_{classifier}/`
  - Confusion matrices
  - Precision-Recall curves
  - ROC curves

**To GitHub:**
- Fused feature metadata: `metadata/fused_{split}_{models}_{task}.json`
  - Fused feature names
  - Component model information
  - Timestamp and data paths
- Results metadata: `results/early_fusion_{task}.json`
  - Metrics for each classifier
  - Fusion method information
  - Train/Dev sample counts

**Evaluation Metrics Computed and Printed:**
- Same comprehensive metrics as 03_train_evaluate.ipynb
- Comparison with individual model results from 03_train_evaluate.ipynb

**What gets passed to next notebook:**
- Fused features saved to persistent storage
- Predictions and probabilities for comparison
- Results metadata for final evaluation comparison


In [None]:
# ============================================================================
# SETUP: Repository Clone, Drive Mount, and Path Configuration
# ============================================================================
# This cell performs minimal setup required for the notebook to run:
# 1. Clones repository from GitHub (if not already present)
# 2. Mounts Google Drive for persistent data storage
# 3. Configures Python paths and initializes StorageManager
# 4. Loads data splits and features created in previous notebooks

import shutil
import os
import subprocess
import time
import requests
import zipfile
import sys
from pathlib import Path
from google.colab import drive
import numpy as np

# Repository configuration
repo_dir = '/content/semeval-context-tree-modular'
repo_url = 'https://github.com/EonTechie/semeval-context-tree-modular.git'
zip_url = 'https://github.com/EonTechie/semeval-context-tree-modular/archive/refs/heads/main.zip'

# Clone repository (if not already present)
if not os.path.exists(repo_dir):
    print("Cloning repository from GitHub...")
    max_retries = 2
    clone_success = False
    
    for attempt in range(max_retries):
        try:
            result = subprocess.run(
                ['git', 'clone', repo_url],
                cwd='/content',
                capture_output=True,
                text=True,
                timeout=60
            )
            if result.returncode == 0:
                print("Repository cloned successfully via git")
                clone_success = True
                break
            else:
                if attempt < max_retries - 1:
                    time.sleep(3)
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(3)
    
    # Fallback: Download as ZIP if git clone fails
    if not clone_success:
        print("Git clone failed. Downloading repository as ZIP archive...")
        zip_path = '/tmp/repo.zip'
        try:
            response = requests.get(zip_url, stream=True, timeout=60)
            response.raise_for_status()
            with open(zip_path, 'wb') as f:
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)
            with zipfile.ZipFile(zip_path, 'r') as zip_ref:
                zip_ref.extractall('/content')
            extracted_dir = '/content/semeval-context-tree-modular-main'
            if os.path.exists(extracted_dir):
                os.rename(extracted_dir, repo_dir)
            os.remove(zip_path)
            print("Repository downloaded and extracted successfully")
        except Exception as e:
            raise RuntimeError(f"Failed to obtain repository: {e}")

# Mount Google Drive (if not already mounted)
try:
    drive.mount('/content/drive', force_remount=False)
except Exception:
    pass  # Already mounted

# Configure paths
BASE_PATH = Path('/content/semeval-context-tree-modular')
DATA_PATH = Path('/content/drive/MyDrive/semeval_data')

# Verify repository structure exists
if not BASE_PATH.exists():
    raise RuntimeError(f"Repository directory not found: {BASE_PATH}")
if not (BASE_PATH / 'src').exists():
    raise RuntimeError(f"src directory not found in repository: {BASE_PATH / 'src'}")
if not (BASE_PATH / 'src' / 'storage' / 'manager.py').exists():
    raise RuntimeError(f"Required file not found: {BASE_PATH / 'src' / 'storage' / 'manager.py'}")

# Add repository to Python path
sys.path.insert(0, str(BASE_PATH))

# Verify imports work
try:
    from src.storage.manager import StorageManager
    from src.features.fusion import fuse_attention_features
    from src.models.trainer import train_and_evaluate
    from src.models.classifiers import get_classifier_dict
except ImportError as e:
    raise ImportError(
        f"Failed to import required modules. "
        f"Repository path: {BASE_PATH}, "
        f"Python path: {sys.path[:3]}, "
        f"Error: {e}"
    )

# Initialize StorageManager
storage = StorageManager(
    base_path=str(BASE_PATH),
    data_path=str(DATA_PATH),
    github_path=str(BASE_PATH)
)

# Load data splits for label extraction
train_ds = storage.load_split('train')
dev_ds = storage.load_split('dev')

print("Setup complete")
print(f"  Repository: {BASE_PATH}")
print(f"  Data storage: {DATA_PATH}")
print(f"  Train samples: {len(train_ds)}")
print(f"  Dev samples: {len(dev_ds)}")


In [None]:
# ============================================================================
# CONFIGURE MODELS, TASKS, AND CLASSIFIERS
# ============================================================================
# Defines the models to fuse, tasks to perform, and classifiers to train
# Label mappings are defined for clarity (3-class) and evasion (9-class) tasks

MODELS = ['bert', 'bert_political', 'bert_ambiguity', 'roberta', 'deberta', 'xlnet']
TASKS = ['clarity', 'evasion']

# Label mappings for each task
CLARITY_LABELS = ['Clear Reply', 'Ambiguous', 'Clear Non-Reply']
EVASION_LABELS = ['Direct Answer', 'Partial Answer', 'Implicit Answer', 
                  'Uncertainty', 'Refusal', 'Clarification', 
                  'Question', 'Topic Shift', 'Other']

# Initialize classifiers with fixed random seed for reproducibility
classifiers = get_classifier_dict(random_state=42)

print("Configuration:")
print(f"  Models to fuse: {MODELS}")
print(f"  Tasks: {TASKS}")
print(f"  Classifiers: {list(classifiers.keys())}")
print(f"  Fusion method: Early fusion (feature concatenation)")


In [None]:
# ============================================================================
# PERFORM EARLY FUSION AND TRAIN CLASSIFIERS
# ============================================================================
# For each task, loads features from all models, concatenates them horizontally,
# trains classifiers on fused features, and evaluates on Dev set

for task in TASKS:
    print(f"\n{'='*80}")
    print(f"TASK: {task.upper()} - EARLY FUSION")
    print(f"{'='*80}")
    
    # Select appropriate label list and dataset key for this task
    if task == 'clarity':
        label_list = CLARITY_LABELS
        label_key = 'clarity_label'
    else:  # evasion
        label_list = EVASION_LABELS
        label_key = 'evasion_label'
    
    # Load features from all models
    print("Loading features from all models...")
    model_features = {}
    model_feature_names = {}
    
    for model in MODELS:
        X_train = storage.load_features(model, task, 'train')
        X_dev = storage.load_features(model, task, 'dev')
        
        # Get feature names from metadata for proper labeling
        meta = storage.load_metadata(model, task, 'train')
        feature_names = meta['feature_names']
        
        model_features[model] = {
            'train': X_train,
            'dev': X_dev
        }
        model_feature_names[model] = feature_names
        
        print(f"  {model}: {X_train.shape[1]} features")
    
    # Fuse features by horizontal concatenation
    # Fused feature vector = [BERT | RoBERTa | DeBERTa | XLNet]
    print("\nFusing features (concatenation)...")
    X_train_fused, fused_feature_names = fuse_attention_features(
        {model: model_features[model]['train'] for model in MODELS},
        model_feature_names
    )
    X_dev_fused, _ = fuse_attention_features(
        {model: model_features[model]['dev'] for model in MODELS},
        model_feature_names
    )
    
    print(f"  Fused features: {X_train_fused.shape[1]} features (sum of all models)")
    print(f"  Train: {X_train_fused.shape[0]} samples")
    print(f"  Dev: {X_dev_fused.shape[0]} samples")
    
    # Save fused features to persistent storage
    storage.save_fused_features(
        X_train_fused, MODELS, task, 'train',
        fused_feature_names, fusion_method='concat'
    )
    storage.save_fused_features(
        X_dev_fused, MODELS, task, 'dev',
        fused_feature_names, fusion_method='concat'
    )
    
    # Extract labels from dataset splits
    y_train = np.array([train_ds[i][label_key] for i in range(len(train_ds))])
    y_dev = np.array([dev_ds[i][label_key] for i in range(len(dev_ds))])
    
    # Train all classifiers on fused features and evaluate on Dev set
    print("\nTraining classifiers on fused features...")
    results = train_and_evaluate(
        X_train_fused, y_train, X_dev_fused, y_dev,
        label_list=label_list,
        task_name=f"early_fusion_{task}",
        classifiers=classifiers,
        random_state=42,
        print_report=True,      # Print detailed classification report
        print_table=True,       # Print results comparison table
        create_plots=True,      # Generate confusion matrices and PR/ROC curves
        save_plots_dir=str(DATA_PATH / 'plots' / 'early_fusion')
    )
    
    # Save predictions and probabilities to persistent storage
    for classifier_name, result in results.items():
        # Save hard label predictions
        storage.save_predictions(
            result['dev_pred'],
            'fused', classifier_name, task, 'dev'
        )
        
        # Save probability distributions (if classifier supports it)
        if result['dev_proba'] is not None:
            storage.save_probabilities(
                result['dev_proba'],
                'fused', classifier_name, task, 'dev'
            )
    
    # Save results summary to metadata for comparison with individual models
    experiment_id = f"early_fusion_{task}"
    storage.save_results({
        'fusion_method': 'early_concat',
        'models': MODELS,
        'task': task,
        'results': {
            name: {
                'metrics': res['metrics'],
                'n_train': len(y_train),
                'n_dev': len(y_dev)
            }
            for name, res in results.items()
        }
    }, experiment_id)

print(f"\n{'='*80}")
print("Early fusion complete for all tasks")
print(f"{'='*80}")
print("\nSummary:")
print("  - Features from all models concatenated")
print("  - Classifiers trained and evaluated on fused features")
print("  - Predictions and probabilities saved to Google Drive")
print("  - Results tables and plots generated")
print("  - Compare with individual model results from 03_train_evaluate.ipynb")
