# Ablation Study: Feature Selection and Ranking

================================================================================
PURPOSE: Comprehensive feature ablation study to identify optimal feature subsets
================================================================================

This notebook performs comprehensive ablation studies on Context Tree features
to identify the most important features for each model and task combination.
The goal is to maximize Macro F1-score by selecting optimal feature subsets.

**Workflow:**
1. Load features and labels from persistent storage (saved by 03_train_evaluate.ipynb)
2. Single-Feature Ablation: Evaluate each feature individually across all models and classifiers
3. Global Feature Ranking: Aggregate results to rank features by importance
4. Top-K Seed Features: Identify top-performing features for greedy selection
5. Greedy Forward Selection: Iteratively add best features to maximize Macro F1
6. Save selected feature sets for use in subsequent notebooks

**Methods:**
- Single-Feature Ablation: Test each of 19 features individually
- Global Feature Ranking: Aggregate across all model×classifier combinations
- Greedy Forward Selection: Iteratively add features that maximize Macro F1

**Output:**
- Feature rankings and ablation results saved to Google Drive
- Selected feature sets saved for each model×task combination
- Comprehensive tables and visualizations

================================================================================
INPUTS (What this notebook loads)
================================================================================

**From Google Drive:**
- Feature matrices: `features/raw/X_{split}_{model}_{task}.npy`
  - For each model (bert, roberta, deberta, xlnet, bert_political, bert_ambiguity)
  - For each task (clarity, evasion)
  - For Train and Dev splits
- Dataset splits: `splits/dataset_splits_{task}.pkl`
  - For label extraction

**From GitHub:**
- Feature metadata: `metadata/features_{split}_{model}_{task}.json`
  - Contains feature names (19 features)

================================================================================
OUTPUTS (What this notebook saves)
================================================================================

**To Google Drive:**
- Ablation results: `results/ablation/single_feature_{model}_{task}.csv`
- Feature rankings: `results/ablation/feature_ranking_{model}_{task}.csv`
- Selected features: `results/ablation/selected_features_{model}_{task}.json`
- Greedy trajectories: `results/ablation/greedy_trajectory_{model}_{task}.csv`

**To GitHub:**
- Ablation metadata: `results/ablation_metadata_{model}_{task}.json`

**What gets passed to next notebook:**
- Selected feature indices for each model×task combination
- Feature rankings for analysis
- Optimal feature subsets for Early Fusion experiments


# ============================================================================
# SETUP: Repository Clone, Drive Mount, and Path Configuration
# ============================================================================


In [None]:
import shutil
import os
import subprocess
import time
import requests
import zipfile
import sys
from pathlib import Path
from google.colab import drive
import numpy as np
import pandas as pd

# Repository configuration
repo_dir = '/content/semeval-context-tree-modular'
repo_url = 'https://github.com/EonTechie/semeval-context-tree-modular.git'
zip_url = 'https://github.com/EonTechie/semeval-context-tree-modular/archive/refs/heads/main.zip'

# Clone repository (if not already present)
if not os.path.exists(repo_dir):
    print("Cloning repository from GitHub...")
    max_retries = 2
    clone_success = False
    
    for attempt in range(max_retries):
        try:
            result = subprocess.run(
                ['git', 'clone', repo_url],
                cwd='/content',
                capture_output=True,
                text=True,
                timeout=60
            )
            if result.returncode == 0:
                print("Repository cloned successfully via git")
                clone_success = True
                break
            else:
                if attempt < max_retries - 1:
                    time.sleep(3)
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(3)
    
    # Fallback: Download as ZIP if git clone fails
    if not clone_success:
        print("Git clone failed. Downloading repository as ZIP archive...")
        zip_path = '/tmp/repo.zip'
        try:
            response = requests.get(zip_url, stream=True, timeout=60)
            response.raise_for_status()
            with open(zip_path, 'wb') as f:
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)
            with zipfile.ZipFile(zip_path, 'r') as zip_ref:
                zip_ref.extractall('/content')
            extracted_dir = '/content/semeval-context-tree-modular-main'
            if os.path.exists(extracted_dir):
                os.rename(extracted_dir, repo_dir)
            os.remove(zip_path)
            print("Repository downloaded and extracted successfully")
        except Exception as e:
            raise RuntimeError(f"Failed to obtain repository: {e}")

# Mount Google Drive (if not already mounted)
try:
    drive.mount('/content/drive', force_remount=False)
except Exception:
    pass  # Already mounted

# Configure paths
BASE_PATH = Path('/content/semeval-context-tree-modular')
DATA_PATH = Path('/content/drive/MyDrive/semeval_data')

# Verify repository structure exists
if not BASE_PATH.exists():
    raise RuntimeError(f"Repository directory not found: {BASE_PATH}")
if not (BASE_PATH / 'src').exists():
    raise RuntimeError(f"src directory not found in repository: {BASE_PATH / 'src'}")
if not (BASE_PATH / 'src' / 'storage' / 'manager.py').exists():
    raise RuntimeError(f"Required file not found: {BASE_PATH / 'src' / 'storage' / 'manager.py'}")

# Add repository to Python path
sys.path.insert(0, str(BASE_PATH))

# Verify imports work
try:
    from src.storage.manager import StorageManager
    from src.models.classifiers import get_classifier_dict
    from sklearn.metrics import f1_score
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    from sklearn.base import clone
except ImportError as e:
    raise ImportError(
        f"Failed to import required modules. "
        f"Repository path: {BASE_PATH}, "
        f"Python path: {sys.path[:3]}, "
        f"Error: {e}"
    )

# Initialize StorageManager
storage = StorageManager(
    base_path=str(BASE_PATH),
    data_path=str(DATA_PATH),
    github_path=str(BASE_PATH)
)

# Create ablation results directory
ablation_dir = DATA_PATH / 'results' / 'ablation'
ablation_dir.mkdir(parents=True, exist_ok=True)

print("Setup complete")
print(f"  Repository: {BASE_PATH}")
print(f"  Data storage: {DATA_PATH}")
print(f"  Ablation results: {ablation_dir}")
