<div style="font-size: 10px;">

# Ablation Study: Feature Selection and Ranking

================================================================================
PURPOSE: Comprehensive feature ablation study to identify optimal feature subsets
================================================================================

This notebook performs comprehensive ablation studies on Context Tree features
to identify the most important features for each model and task combination.
The goal is to maximize Macro F1-score by selecting optimal feature subsets.

**Scope:**
- **3 Tasks**: Clarity, Evasion, and Hierarchical Evasion → Clarity
- **6 Models**: bert, bert_political, bert_ambiguity, roberta, deberta, xlnet
- **6 Classifiers**: LogisticRegression, LinearSVC, RandomForest, XGBoost, LightGBM, MLP
- **Total Combinations**: 6 models × 6 classifiers = 36 combinations per task

**Workflow:**
1. Load features and labels from persistent storage (saved by 03_train_evaluate.ipynb)
2. Single-Feature Ablation: Evaluate each feature individually across all 36 model×classifier combinations
3. Statistical Aggregation: Calculate min, median, std, best (max), and runs (count) for each feature
4. Weighted Score Calculation: Compute weighted score combining multiple statistics
5. Feature Ranking: Rank features by weighted score (separately for each task)
6. Top-K Feature Selection: Identify top-performing features for greedy selection
7. Greedy Forward Selection: Iteratively add best features to maximize Macro F1

**Statistical Metrics Computed:**
- **min_f1**: Minimum Macro F1 across all 36 combinations (worst-case performance)
- **median_f1**: Median Macro F1 across all 36 combinations (typical performance)
- **mean_f1**: Mean Macro F1 across all 36 combinations (average performance)
- **std_f1**: Standard deviation of Macro F1 across all 36 combinations (consistency measure)
- **best_f1**: Maximum Macro F1 across all 36 combinations (best-case performance)
- **runs**: Number of evaluations (should be 36 for complete data)

**Weighted Score Formula:**
The weighted score combines multiple statistics to balance average performance, consistency, and peak performance:
```
weighted_score = 0.5*mean_f1 + 0.3*best_f1 + 0.2*(1 - normalized_std)
```
where normalized_std = std_f1 / (mean_f1 + epsilon) to account for scale differences.

Features are ranked by weighted_score in descending order.

================================================================================
INPUTS (What this notebook loads)
================================================================================

**From Google Drive:**
- Feature matrices: `features/raw/X_{split}_{model}_{task}.npy`
  - For each model (6 models)
  - For each task (clarity, evasion)
  - For Train and Dev splits
- Predictions: `predictions/pred_dev_{model}_{classifier}_hierarchical_evasion_to_clarity.npy`
  - For hierarchical task evaluation
- Dataset splits: `splits/dataset_splits_{task}.pkl`
  - For label extraction

**From GitHub:**
- Feature metadata: `metadata/features_{split}_{model}_{task}.json`
  - Contains feature names (19 features)

================================================================================
OUTPUTS (What this notebook saves)
================================================================================

**To Google Drive:**
- Ablation results: `results/ablation/single_feature_{task}.csv`
- Feature rankings: `results/ablation/feature_ranking_{task}.csv`
  - Includes all statistics: min, median, mean, std, best, runs, weighted_score
- Selected features: `results/ablation/selected_features_{task}.json`
- Greedy trajectories: `results/ablation/greedy_trajectory_{model}_{task}.csv`

**To GitHub:**
- Ablation metadata: `results/ablation_metadata_{task}.json`

**What gets passed to next notebook:**
- Feature rankings for each task (clarity, evasion, hierarchical)
- Selected feature sets for greedy selection
- Comprehensive statistical analysis of feature importance

</div>

<div style="font-size: 10px;">

# ============================================================================
# SETUP: Repository Clone, Drive Mount, and Path Configuration
# ============================================================================
#
# This cell performs the initial setup required for the notebook to run:
# 1. Clones the repository from GitHub (if not already present)
# 2. Mounts Google Drive for persistent data storage
# 3. Configures Python paths and initializes StorageManager
# 4. Creates necessary directories for ablation results
#
# The StorageManager handles loading features, predictions, and dataset splits
# from Google Drive, and saving results back to Drive and GitHub.
#
# No modifications are needed in this cell unless you change repository URLs
# or data storage paths.

</div>

In [6]:
# ============================================================================
# SETUP: Repository Clone, Drive Mount, and Path Configuration
# ============================================================================
import shutil
import os
import subprocess
import time
import requests
import zipfile
import sys
from pathlib import Path
from google.colab import drive
import numpy as np
import pandas as pd

# Repository configuration
repo_dir = '/content/semeval-context-tree-modular'
repo_url = 'https://github.com/EonTechie/semeval-context-tree-modular.git'
zip_url = 'https://github.com/EonTechie/semeval-context-tree-modular/archive/refs/heads/main.zip'

# Clone repository (if not already present)
if not os.path.exists(repo_dir):
    print("Cloning repository from GitHub...")
    max_retries = 2
    clone_success = False

    for attempt in range(max_retries):
        try:
            result = subprocess.run(
                ['git', 'clone', repo_url],
                cwd='/content',
                capture_output=True,
                text=True,
                timeout=60
            )
            if result.returncode == 0:
                print("Repository cloned successfully via git")
                clone_success = True
                break
            else:
                if attempt < max_retries - 1:
                    time.sleep(3)
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(3)

    # Fallback: Download as ZIP if git clone fails
    if not clone_success:
        print("Git clone failed. Downloading repository as ZIP archive...")
        zip_path = '/tmp/repo.zip'
        try:
            response = requests.get(zip_url, stream=True, timeout=60)
            response.raise_for_status()
            with open(zip_path, 'wb') as f:
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)
            with zipfile.ZipFile(zip_path, 'r') as zip_ref:
                zip_ref.extractall('/content')
            extracted_dir = '/content/semeval-context-tree-modular-main'
            if os.path.exists(extracted_dir):
                os.rename(extracted_dir, repo_dir)
            os.remove(zip_path)
            print("Repository downloaded and extracted successfully")
        except Exception as e:
            raise RuntimeError(f"Failed to obtain repository: {e}")

# Mount Google Drive (if not already mounted)
try:
    drive.mount('/content/drive', force_remount=False)
except Exception:
    pass  # Already mounted

# Configure paths
BASE_PATH = Path('/content/semeval-context-tree-modular')
DATA_PATH = Path('/content/drive/MyDrive/semeval_data')

# Verify repository structure exists
if not BASE_PATH.exists():
    raise RuntimeError(f"Repository directory not found: {BASE_PATH}")
if not (BASE_PATH / 'src').exists():
    raise RuntimeError(f"src directory not found in repository: {BASE_PATH / 'src'}")
if not (BASE_PATH / 'src' / 'storage' / 'manager.py').exists():
    raise RuntimeError(f"Required file not found: {BASE_PATH / 'src' / 'storage' / 'manager.py'}")

# Add repository to Python path
sys.path.insert(0, str(BASE_PATH))

# Verify imports work
try:
    from src.storage.manager import StorageManager
    from src.models.classifiers import get_classifier_dict
    from src.features.extraction import get_feature_names
    from sklearn.metrics import f1_score
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler, LabelEncoder
    from sklearn.base import clone
except ImportError as e:
    raise ImportError(
        f"Failed to import required modules. "
        f"Repository path: {BASE_PATH}, "
        f"Python path: {sys.path[:3]}, "
        f"Error: {e}"
    )

# Initialize StorageManager
storage = StorageManager(
    base_path=str(BASE_PATH),
    data_path=str(DATA_PATH),
    github_path=str(BASE_PATH)
)

# Create ablation results directory
ablation_dir = DATA_PATH / 'results' / 'ablation'
ablation_dir.mkdir(parents=True, exist_ok=True)

print("Setup complete")
print(f"  Repository: {BASE_PATH}")
print(f"  Data storage: {DATA_PATH}")
print(f"  Ablation results: {ablation_dir}")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Setup complete
  Repository: /content/semeval-context-tree-modular
  Data storage: /content/drive/MyDrive/semeval_data
  Ablation results: /content/drive/MyDrive/semeval_data/results/ablation


In [7]:
# ============================================================================
# CONFIGURE MODELS, TASKS, AND CLASSIFIERS
# ============================================================================
# This cell defines the models, tasks, and classifiers to be used in the ablation study.
# All combinations (6 models × 6 classifiers = 36) will be evaluated for each task.
# Three tasks are included: Clarity, Evasion, and Hierarchical Evasion → Clarity.

# Check if get_classifier_dict is imported (from Cell 1 - Setup)
if 'get_classifier_dict' not in globals():
    raise NameError(
        "get_classifier_dict not found. Please run Cell 1 (Setup) first.\n"
        "Cell 1 imports get_classifier_dict from src.models.classifiers."
    )

MODELS = ['bert', 'bert_political', 'bert_ambiguity', 'roberta', 'deberta', 'xlnet']
TASKS = ['clarity', 'evasion', 'hierarchical_evasion_to_clarity']  # 3 tasks

# Label mappings for each task
CLARITY_LABELS = ['Ambivalent', 'Clear Non-Reply', 'Clear Reply']
EVASION_LABELS = ['Claims ignorance', 'Clarification', 'Declining to answer',
                  'Deflection', 'Dodging', 'Explicit',
                  'General', 'Implicit', 'Partial/half-answer']

# Initialize classifiers with fixed random seed for reproducibility
# Includes MLP (Multi-Layer Perceptron) as requested
classifiers = get_classifier_dict(random_state=42)

print("="*80)
print("CONFIGURATION")
print("="*80)
print(f"  Models: {len(MODELS)} models")
print(f"    {MODELS}")
print(f"  Tasks: {len(TASKS)} tasks")
print(f"    {TASKS}")
print(f"  Classifiers: {len(classifiers)} classifiers")
print(f"    {list(classifiers.keys())}")
print(f"  Total combinations per task: {len(MODELS)} × {len(classifiers)} = {len(MODELS) * len(classifiers)}")
print(f"  Evaluation set: Dev set (not test)")
print("="*80)


CONFIGURATION
  Models: 6 models
    ['bert', 'bert_political', 'bert_ambiguity', 'roberta', 'deberta', 'xlnet']
  Tasks: 3 tasks
    ['clarity', 'evasion', 'hierarchical_evasion_to_clarity']
  Classifiers: 6 classifiers
    ['LogisticRegression', 'LinearSVC', 'RandomForest', 'MLP', 'XGBoost', 'LightGBM']
  Total combinations per task: 6 × 6 = 36
  Evaluation set: Dev set (not test)


In [None]:
# ============================================================================# SINGLE-FEATURE ABLATION STUDY# ============================================================================# This cell performs single-feature ablation: evaluates each of the 19 Context Tree# features individually across all model×classifier combinations (36 combinations per task).# # For each feature, we train a classifier using only that feature and evaluate on the# dev set. This helps identify which features are most informative for each task.## Process:# 1. For each task (clarity, evasion, hierarchical_evasion_to_clarity)# 2. For each model (6 models)# 3. For each classifier (6 classifiers, including MLP)# 4. For each feature (19 features)# 5. Train classifier on single feature and evaluate Macro F1 on dev set## Total evaluations: 3 tasks × 6 models × 6 classifiers × 19 features = 2,052 evaluationsdef eval_single_feature(X_train, X_dev, y_train, y_dev, feature_idx, clf):    """    Evaluate a single feature using a classifier.        This function trains a classifier using only one feature and evaluates its    performance on the dev set. StandardScaler is applied to normalize the    single feature before classification.        Args:        X_train: Training feature matrix (N, F) where F is total number of features        X_dev: Dev feature matrix (M, F)        y_train: Training labels (N,)        y_dev: Dev labels (M,)        feature_idx: Index of the feature to evaluate (0 to F-1)        clf: Classifier instance (will be cloned to avoid state issues)        Returns:        Macro F1 score on dev set (float)    """    # Select only the specified feature (single column)    X_train_f = X_train[:, [feature_idx]]    X_dev_f = X_dev[:, [feature_idx]]        # Pipeline with scaling (critical for single features to work properly)    # StandardScaler normalizes the feature to have zero mean and unit variance    pipe = Pipeline([        ("scaler", StandardScaler()),        ("clf", clone(clf))  # Clone to avoid modifying the original classifier    ])        # Train on single feature and evaluate on dev set    pipe.fit(X_train_f, y_train)    pred = pipe.predict(X_dev_f)    macro_f1 = f1_score(y_dev, pred, average='macro')        return macro_f1# Check if required variables are defined (from Cell 2 - Configuration)if 'TASKS' not in globals() or 'MODELS' not in globals() or 'classifiers' not in globals():    raise NameError(        "Required variables not defined. Please run Cell 2 (Configuration) first.\n"        "Cell 2 defines: TASKS, MODELS, CLARITY_LABELS, EVASION_LABELS, and classifiers."    )# Check if storage is defined (from Cell 1 - Setup)if 'storage' not in globals():    raise NameError(        "storage not found. Please run Cell 1 (Setup) first.\n"        "Cell 1 initializes StorageManager as 'storage'."    )print("="*80)print("SINGLE-FEATURE ABLATION STUDY")print("="*80)print("Evaluating each feature individually across all model×task×classifier combinations")print(f"Total evaluations: {len(TASKS)} tasks × {len(MODELS)} models × {len(classifiers)} classifiers × 19 features")print("This may take 15-30 minutes depending on your hardware...\n")# Store all ablation results# Each entry contains: model, task, classifier, feature, feature_idx, macro_f1ablation_results = []for task in TASKS:    print(f"\n{'='*80}")    print(f"TASK: {task.upper()}")    print(f"{'='*80}")        # Select appropriate label list and dataset key based on task    if task == 'clarity':        label_list = CLARITY_LABELS        label_key = 'clarity_label'        task_for_split = 'clarity'    elif task == 'evasion':        label_list = EVASION_LABELS        label_key = 'evasion_label'        task_for_split = 'evasion'    else:  # hierarchical_evasion_to_clarity        # For hierarchical task, we need to load evasion dev set to get clarity labels        # (hierarchical uses evasion predictions mapped to clarity labels)        label_list = CLARITY_LABELS        label_key = 'clarity_label'        # We'll load from evasion dev set (same filtered samples)        task_for_split = 'evasion'            # Load task-specific splits    # For hierarchical task, we load evasion split (which has clarity labels)    train_ds = storage.load_split('train', task=task_for_split)    dev_ds = storage.load_split('dev', task=task_for_split)        # Extract labels    y_train = np.array([train_ds[i][label_key] for i in range(len(train_ds))])    y_dev = np.array([dev_ds[i][label_key] for i in range(len(dev_ds))])        print(f"  Train: {len(y_train)} samples")    print(f"  Dev: {len(y_dev)} samples")        # Get feature names directly from extraction module (same for all models)    # This avoids dependency on metadata files in GitHub    # Feature names are the same across all models (19 Context Tree features)    feature_names = get_feature_names()    n_features = len(feature_names)        print(f"  Features: {n_features} features")    print(f"  Feature names: {feature_names}\n")        # For hierarchical task, we need to use evasion features    # (hierarchical approach uses evasion predictions, so we evaluate on evasion features)    feature_task = 'evasion' if task == 'hierarchical_evasion_to_clarity' else task        # For each model    for model in MODELS:        print(f"  Model: {model}")                # Load features        # For hierarchical task, use evasion features (since we're evaluating        # how well evasion features predict clarity via hierarchical mapping)        try:            X_train = storage.load_features(model, feature_task, 'train')            X_dev = storage.load_features(model, feature_task, 'dev')        except FileNotFoundError:            print(f"    ⚠ Features not found for {model} × {feature_task}, skipping...")            continue                # Verify feature count matches        if X_train.shape[1] != n_features:            print(f"    ⚠ Feature count mismatch: expected {n_features}, got {X_train.shape[1]}, skipping...")            continue                # For each classifier        for clf_name, clf in classifiers.items():            print(f"    Classifier: {clf_name}")                        # Evaluate each feature individually            for feature_idx, feature_name in enumerate(feature_names):                try:                    macro_f1 = eval_single_feature(                        X_train, X_dev,                        y_train, y_dev,                        feature_idx, clf                    )                                        ablation_results.append({                        'model': model,                        'task': task,                        'classifier': clf_name,                        'feature': feature_name,                        'feature_idx': feature_idx,                        'macro_f1': float(macro_f1)                    })                except Exception as e:                    print(f"      ⚠ Error evaluating feature {feature_name}: {e}")                    continueprint(f"\n{'='*80}")print("SINGLE-FEATURE ABLATION COMPLETE")print(f"{'='*80}")print(f"Total evaluations completed: {len(ablation_results)}")print(f"Expected: {len(TASKS)} tasks × {len(MODELS)} models × {len(classifiers)} classifiers × {n_features} features = {len(TASKS) * len(MODELS) * len(classifiers) * n_features}")

<div style="font-size: 10px;">

# ============================================================================
# SINGLE-FEATURE ABLATION STUDY
# ============================================================================
#
# This section performs single-feature ablation: evaluates each of the 19 Context Tree
# features individually across all model×classifier combinations.
#
# **What this cell does:**
# - For each task (clarity, evasion, hierarchical_evasion_to_clarity)
# - For each model (6 models)
# - For each classifier (6 classifiers, including MLP)
# - For each feature (19 features)
# - Trains a classifier using only that single feature
# - Evaluates Macro F1 on the dev set
#
# **Total evaluations:** 3 tasks × 6 models × 6 classifiers × 19 features = 2,052 evaluations
#
# **Expected runtime:** 15-30 minutes depending on hardware
#
# **Output:** List of ablation results with model, task, classifier, feature, and macro_f1

</div>

In [8]:
# ============================================================================
# SINGLE-FEATURE ABLATION STUDY
# ============================================================================
# This cell performs single-feature ablation: evaluates each of the 19 Context Tree
# features individually across all model×classifier combinations (36 combinations per task).
#
# For each feature, we train a classifier using only that feature and evaluate on the
# dev set. This helps identify which features are most informative for each task.
#
# Process:
# 1. For each task (clarity, evasion, hierarchical_evasion_to_clarity)
# 2. For each model (6 models)
# 3. For each classifier (6 classifiers, including MLP)
# 4. For each feature (19 features)
# 5. Train classifier on single feature and evaluate Macro F1 on dev set
#
# Total evaluations: 3 tasks × 6 models × 6 classifiers × 19 features = 2,052 evaluations

def eval_single_feature(X_train, X_dev, y_train, y_dev, feature_idx, clf):
    """
    Evaluate a single feature using a classifier.

    This function trains a classifier using only one feature and evaluates its
    performance on the dev set. StandardScaler is applied to normalize the
    single feature before classification.

    Args:
        X_train: Training feature matrix (N, F) where F is total number of features
        X_dev: Dev feature matrix (M, F)
        y_train: Training labels (N,)
        y_dev: Dev labels (M,)
        feature_idx: Index of the feature to evaluate (0 to F-1)
        clf: Classifier instance (will be cloned to avoid state issues)

    Returns:
        Macro F1 score on dev set (float)
    """
    # Encode labels to numeric (required for MLP, XGBoost, LightGBM)
    # This matches the approach in siparismaili01 notebook
    label_encoder = LabelEncoder()
    y_train_encoded = label_encoder.fit_transform(y_train)
    y_dev_encoded = label_encoder.transform(y_dev)

    # Select only the specified feature (single column)
    X_train_f = X_train[:, [feature_idx]]
    X_dev_f = X_dev[:, [feature_idx]]

    # Pipeline with scaling (critical for single features to work properly)
    # StandardScaler normalizes the feature to have zero mean and unit variance
    pipe = Pipeline([
        ("scaler", StandardScaler()),
        ("clf", clone(clf))  # Clone to avoid modifying the original classifier
    ])

    # Train on single feature and evaluate on dev set
    pipe.fit(X_train_f, y_train_encoded)
    pred = pipe.predict(X_dev_f)
    macro_f1 = f1_score(y_dev_encoded, pred, average='macro')

    return macro_f1

# Check if required variables are defined (from Cell 2 - Configuration)
if 'TASKS' not in globals() or 'MODELS' not in globals() or 'classifiers' not in globals():
    raise NameError(
        "Required variables not defined. Please run Cell 2 (Configuration) first.\n"
        "Cell 2 defines: TASKS, MODELS, CLARITY_LABELS, EVASION_LABELS, and classifiers."
    )

# Check if storage is defined (from Cell 1 - Setup)
if 'storage' not in globals():
    raise NameError(
        "storage not found. Please run Cell 1 (Setup) first.\n"
        "Cell 1 initializes StorageManager as 'storage'."
    )

print("="*80)
print("SINGLE-FEATURE ABLATION STUDY")
print("="*80)
print("Evaluating each feature individually across all model×task×classifier combinations")
print(f"Total evaluations: {len(TASKS)} tasks × {len(MODELS)} models × {len(classifiers)} classifiers × 19 features")
print("This may take 15-30 minutes depending on your hardware...\n")

# Store all ablation results
# Each entry contains: model, task, classifier, feature, feature_idx, macro_f1
ablation_results = []

for task in TASKS:
    print(f"\n{'='*80}")
    print(f"TASK: {task.upper()}")
    print(f"{'='*80}")

    # Select appropriate label list and dataset key based on task
    if task == 'clarity':
        label_list = CLARITY_LABELS
        label_key = 'clarity_label'
        task_for_split = 'clarity'
    elif task == 'evasion':
        label_list = EVASION_LABELS
        label_key = 'evasion_label'
        task_for_split = 'evasion'
    else:  # hierarchical_evasion_to_clarity
        # For hierarchical task, we need to load evasion dev set to get clarity labels
        # (hierarchical uses evasion predictions mapped to clarity labels)
        label_list = CLARITY_LABELS
        label_key = 'clarity_label'
        # We'll load from evasion dev set (same filtered samples)
        task_for_split = 'evasion'



    # Load task-specific splits
    # For hierarchical task, we load evasion split (which has clarity labels)
    train_ds = storage.load_split('train', task=task_for_split)
    dev_ds = storage.load_split('dev', task=task_for_split)

    # Extract labels
    y_train = np.array([train_ds[i][label_key] for i in range(len(train_ds))])
    y_dev = np.array([dev_ds[i][label_key] for i in range(len(dev_ds))])

    print(f"  Train: {len(y_train)} samples")
    print(f"  Dev: {len(y_dev)} samples")

    # Get feature names directly from extraction module (same for all models)
    # This avoids dependency on metadata files in GitHub
    # Feature names are the same across all models (19 Context Tree features)
    feature_names = get_feature_names()
    n_features = len(feature_names)

    print(f"  Features: {n_features} features")
    print(f"  Feature names: {feature_names}\n")

    # For hierarchical task, we need to use evasion features
    # (hierarchical approach uses evasion predictions, so we evaluate on evasion features)
    feature_task = 'evasion' if task == 'hierarchical_evasion_to_clarity' else task

    # For each model
    for model in MODELS:
        print(f"  Model: {model}")

        # Load features
        # For hierarchical task, use evasion features (since we're evaluating
        # how well evasion features predict clarity via hierarchical mapping)
        try:
            X_train = storage.load_features(model, feature_task, 'train')
            X_dev = storage.load_features(model, feature_task, 'dev')
        except FileNotFoundError:
            print(f"    ⚠ Features not found for {model} × {feature_task}, skipping...")
            continue

        # Verify feature count matches
        if X_train.shape[1] != n_features:
            print(f"    ⚠ Feature count mismatch: expected {n_features}, got {X_train.shape[1]}, skipping...")
            continue

        # For each classifier
        for clf_name, clf in classifiers.items():
            print(f"    Classifier: {clf_name}")

            # Evaluate each feature individually
            for feature_idx, feature_name in enumerate(feature_names):
                try:
                    macro_f1 = eval_single_feature(
                        X_train, X_dev,
                        y_train, y_dev,
                        feature_idx, clf
                    )

                    ablation_results.append({
                        'model': model,
                        'task': task,
                        'classifier': clf_name,
                        'feature': feature_name,
                        'feature_idx': feature_idx,
                        'macro_f1': float(macro_f1)
                    })
                except Exception as e:
                    print(f"      ⚠ Error evaluating feature {feature_name}: {e}")
                    continue

print(f"\n{'='*80}")
print("SINGLE-FEATURE ABLATION COMPLETE")
print(f"{'='*80}")
print(f"Total evaluations completed: {len(ablation_results)}")
print(f"Expected: {len(TASKS)} tasks × {len(MODELS)} models × {len(classifiers)} classifiers × {n_features} features = {len(TASKS) * len(MODELS) * len(classifiers) * n_features}")


SINGLE-FEATURE ABLATION STUDY
Evaluating each feature individually across all model×task×classifier combinations
Total evaluations: 3 tasks × 6 models × 6 classifiers × 19 features
This may take 15-30 minutes depending on your hardware...


TASK: CLARITY


data/train-00000-of-00001.parquet:   0%|          | 0.00/3.90M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/259k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3448 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/308 [00:00<?, ? examples/s]

  Train: 2758 samples
  Dev: 690 samples
  Features: 19 features
  Feature names: ['question_model_token_count', 'answer_model_token_count', 'attention_mass_q_to_a_per_qtoken', 'attention_mass_a_to_q_per_atoken', 'focus_token_to_answer_strength', 'answer_token_to_focus_strength', 'focus_token_coverage_ratio', 'tfidf_cosine_similarity_q_a', 'content_word_jaccard_q_a', 'question_content_coverage_in_answer', 'answer_content_word_ratio', 'answer_digit_groups_per_word', 'refusal_pattern_match_count', 'clarification_pattern_match_count', 'answer_question_mark_count', 'answer_word_count', 'answer_is_short_question', 'answer_negation_ratio', 'answer_hedge_ratio']

  Model: bert
    Classifier: LogisticRegression
    Classifier: LinearSVC
    Classifier: RandomForest
    Classifier: MLP
    Classifier: XGBoost
    Classifier: LightGBM
  Model: bert_political
    Classifier: LogisticRegression
    Classifier: LinearSVC
    Classifier: RandomForest
    Classifier: MLP
    Classifier: XGBoost
    

<div style="font-size: 10px;">

# ============================================================================
# FEATURE RANKING AND STATISTICAL ANALYSIS
# ============================================================================
#
# This section performs comprehensive statistical analysis of the ablation results
# and ranks features by a weighted score. This is a GLOBAL analysis that aggregates
# results across all 36 model×classifier combinations to identify features that work
# well across different models and classifiers.
#
# **What this cell does:**
# 1. Aggregates results across all 36 model×classifier combinations for each feature
# 2. Computes statistics: min, median, mean, std, best (max), and runs (count)
# 3. Calculates normalized_std = std_f1 / mean_f1 (scale-normalized consistency)
# 4. Calculates weighted_score = 0.5*mean + 0.3*best + 0.2*(1 - normalized_std)
# 5. Ranks features by weighted_score (separately for each task)
# 6. Displays top 15 features with all statistics
# 7. Saves complete rankings to CSV files
# 8. Selects top-K features for Early Fusion (to be used across all models)
#
# **Statistics computed:**
# - min_f1: Worst-case performance across 36 combinations
# - median_f1: Typical performance (robust to outliers)
# - mean_f1: Average performance
# - std_f1: Consistency measure (lower is better)
# - best_f1: Best-case performance (peak potential)
# - runs: Number of evaluations (should be 36 for complete data)
#
# **Weighted Score Formula:**
# weighted_score = 0.5*mean_f1 + 0.3*best_f1 + 0.2*(1 - normalized_std)
#
# This formula balances average performance (50%), peak performance (30%), and
# consistency (20%). Features with high mean, high best, and low std score highest.
#
# **Why Global Analysis?**
# We use global ranking (not model-specific) to avoid overfitting and noise accumulation.
# The selected features will be used across all models in Early Fusion, ensuring
# consistency and better generalization.
#
# **Output:**
# - Feature rankings saved separately for clarity, evasion, and hierarchical tasks
# - Top-K features selected for Early Fusion (saved to JSON)

</div>

In [9]:
# ============================================================================
# FEATURE RANKING AND STATISTICAL ANALYSIS
# ============================================================================
# This cell performs comprehensive statistical analysis and ranking of features
# based on the single-feature ablation results from Cell 3.
#
# For each task, we aggregate results across all 36 model×classifier combinations
# and compute the following statistics for each feature:
# - min_f1: Minimum Macro F1 (worst-case performance)
# - median_f1: Median Macro F1 (typical performance)
# - mean_f1: Mean Macro F1 (average performance)
# - std_f1: Standard deviation (consistency measure - lower is better)
# - best_f1: Maximum Macro F1 (best-case performance)
# - runs: Number of evaluations (should be 36 for complete data)
#
# Features are then ranked by a weighted score that combines multiple statistics:
# weighted_score = 0.5*mean_f1 + 0.3*best_f1 + 0.2*(1 - normalized_std)
#
# This ranking is performed separately for each of the 3 tasks.

# Check if ablation_results exists (from Cell 4 - Single-Feature Ablation)
if 'ablation_results' not in globals():
    raise NameError(
        "ablation_results not found. Please run Cell 4 (Single-Feature Ablation) first.\n"
        "Cell 4 performs the ablation study and creates ablation_results list."
    )

# Check if storage and ablation_dir are defined (from Cell 1 - Setup)
if 'storage' not in globals():
    raise NameError(
        "storage not found. Please run Cell 1 (Setup) first.\n"
        "Cell 1 initializes StorageManager as 'storage'."
    )
if 'ablation_dir' not in globals():
    raise NameError(
        "ablation_dir not found. Please run Cell 1 (Setup) first.\n"
        "Cell 1 creates ablation_dir directory."
    )

df_ablation = pd.DataFrame(ablation_results)

if len(df_ablation) == 0:
    print("⚠ No ablation results found. Make sure Cell 4 completed successfully.")
    print("  You need to run Cell 4 (Single-Feature Ablation) first.")
else:
    print("="*80)
    print("FEATURE RANKING AND STATISTICAL ANALYSIS")
    print("="*80)
    print(f"Total ablation results: {len(df_ablation)} evaluations")
    print(f"Expected per task: {len(MODELS)} models × {len(classifiers)} classifiers × 19 features = {len(MODELS) * len(classifiers) * 19}")

    # Save raw ablation results for each task
    print(f"\n{'='*80}")
    print("SAVING RAW ABLATION RESULTS")
    print(f"{'='*80}")

    for task in TASKS:
        df_task = df_ablation[df_ablation['task'] == task]
        if len(df_task) > 0:
            csv_path = ablation_dir / f'single_feature_{task}.csv'
            df_task.to_csv(csv_path, index=False)
            print(f"  Saved {task}: {len(df_task)} evaluations → {csv_path}")

    # ========================================================================
    # STATISTICAL AGGREGATION AND WEIGHTED SCORE CALCULATION
    # ========================================================================
    # Aggregate results across all 36 model×classifier combinations for each feature
    # Compute comprehensive statistics and calculate weighted score for ranking

    print(f"\n{'='*80}")
    print("STATISTICAL AGGREGATION AND FEATURE RANKING")
    print(f"{'='*80}")
    print("Computing statistics across all 36 model×classifier combinations...")

    # Calculate comprehensive statistics for each feature×task combination
    # Using 'median' in addition to mean/std/min/max to get more robust statistics
    df_stats = df_ablation.groupby(['task', 'feature'])['macro_f1'].agg([
        'min',      # Minimum F1 (worst-case)
        'median',   # Median F1 (typical performance)
        'mean',     # Mean F1 (average performance)
        'std',      # Standard deviation (consistency)
        'max',      # Maximum F1 (best-case, same as best_f1)
        'count'     # Number of evaluations (should be 36)
    ]).reset_index()

    # Rename columns for clarity
    df_stats.columns = ['task', 'feature', 'min_f1', 'median_f1', 'mean_f1', 'std_f1', 'best_f1', 'runs']

    # Calculate weighted score
    # Formula: weighted_score = 0.5*mean + 0.3*best + 0.2*(1 - normalized_std)
    # This balances:
    # - Average performance (50% weight)
    # - Peak performance (30% weight)
    # - Consistency (20% weight, lower std = higher score)
    #
    # Normalize std by mean to account for scale differences:
    # normalized_std = std_f1 / (mean_f1 + epsilon)
    # where epsilon prevents division by zero
    EPSILON = 1e-6
    df_stats['normalized_std'] = df_stats['std_f1'] / (df_stats['mean_f1'] + EPSILON)

    # Calculate weighted score
    # Higher is better: we want high mean, high best, and low std (high 1-normalized_std)
    df_stats['weighted_score'] = (
        0.5 * df_stats['mean_f1'] +
        0.3 * df_stats['best_f1'] +
        0.2 * (1 - df_stats['normalized_std'])
    )

    # Sort by weighted_score (descending) for ranking
    # Secondary sort by mean_f1 for tie-breaking
    df_stats = df_stats.sort_values(['weighted_score', 'mean_f1'], ascending=False)

    # ========================================================================
    # DISPLAY AND SAVE RANKINGS FOR EACH TASK
    # ========================================================================

    for task in TASKS:
        print(f"\n{'='*80}")
        print(f"TASK: {task.upper()} - FEATURE RANKING")
        print(f"{'='*80}")

        df_task = df_stats[df_stats['task'] == task].copy()

        if len(df_task) == 0:
            print(f"  ⚠ No results found for task: {task}")
            continue

        # Round all numeric columns for display
        numeric_cols = ['min_f1', 'median_f1', 'mean_f1', 'std_f1', 'best_f1', 'runs', 'normalized_std', 'weighted_score']
        df_task[numeric_cols] = df_task[numeric_cols].round(4)

        # Display top 15 features with all statistics
        print(f"\nTop 15 Features (ranked by weighted_score):")
        print("Weighted Score Formula: 0.5*mean + 0.3*best + 0.2*(1 - normalized_std)")
        print("\nColumns:")
        print("  - min_f1: Minimum Macro F1 across 36 combinations (worst-case)")
        print("  - median_f1: Median Macro F1 (typical performance)")
        print("  - mean_f1: Mean Macro F1 (average performance)")
        print("  - std_f1: Standard deviation (lower = more consistent)")
        print("  - best_f1: Maximum Macro F1 (best-case)")
        print("  - runs: Number of evaluations (should be 36)")
        print("  - normalized_std: std_f1 / mean_f1 (scale-normalized consistency)")
        print("  - weighted_score: Combined score for ranking\n")

        display(df_task[['feature', 'min_f1', 'median_f1', 'mean_f1', 'std_f1', 'best_f1', 'runs', 'normalized_std', 'weighted_score']].head(15))

        # Save complete ranking to CSV
        csv_path = ablation_dir / f'feature_ranking_{task}.csv'
        df_task.to_csv(csv_path, index=False)
        print(f"\n  ✓ Saved complete ranking: {csv_path}")
        print(f"    Total features ranked: {len(df_task)}")
        print(f"    Expected runs per feature: 36 (6 models × 6 classifiers)")

        # Verify data completeness
        incomplete = df_task[df_task['runs'] < 36]
        if len(incomplete) > 0:
            print(f"    ⚠ Warning: {len(incomplete)} features have incomplete data (< 36 runs)")

    # ========================================================================
    # TOP-K FEATURE SELECTION FOR EARLY FUSION
    # ========================================================================
    # Select top-K features for each task to use in Early Fusion
    # These features will be used across all models in Early Fusion

    print(f"\n{'='*80}")
    print("TOP-K FEATURE SELECTION FOR EARLY FUSION")
    print(f"{'='*80}")
    print("Selecting top-K features for each task (to be used in Early Fusion)")

    TOP_K_FEATURES = 10  # Number of top features to select for Early Fusion

    selected_features_for_fusion = {}

    for task in TASKS:
        df_task = df_stats[df_stats['task'] == task].copy()

        if len(df_task) == 0:
            print(f"  ⚠ No ranking data found for task: {task}")
            continue

        # Select top-K features by weighted_score
        top_k_features = df_task.head(TOP_K_FEATURES)['feature'].tolist()

        selected_features_for_fusion[task] = {
            'top_k': TOP_K_FEATURES,
            'features': top_k_features,
            'ranking': df_task.head(TOP_K_FEATURES)[['feature', 'weighted_score', 'mean_f1', 'best_f1', 'std_f1']].to_dict('records')
        }

        print(f"\n  {task.upper()} - Top {TOP_K_FEATURES} Features:")
        for i, feat in enumerate(top_k_features, 1):
            row = df_task[df_task['feature'] == feat].iloc[0]
            print(f"    {i:2d}. {feat}")
            print(f"        weighted_score={row['weighted_score']:.4f}, mean_f1={row['mean_f1']:.4f}, best_f1={row['best_f1']:.4f}")

    # Save selected features for Early Fusion
    import json
    fusion_features_path = ablation_dir / 'selected_features_for_early_fusion.json'
    with open(fusion_features_path, 'w') as f:
        json.dump(selected_features_for_fusion, f, indent=2)

    print(f"\n{'='*80}")
    print("FEATURE RANKING COMPLETE")
    print(f"{'='*80}")
    print("Rankings saved separately for each task:")
    for task in TASKS:
        print(f"  - {task}: {ablation_dir / f'feature_ranking_{task}.csv'}")
    print(f"\nTop-K features for Early Fusion saved:")
    print(f"  - {fusion_features_path}")
    print(f"  - Top {TOP_K_FEATURES} features per task (to be used across all models in Early Fusion)")


FEATURE RANKING AND STATISTICAL ANALYSIS
Total ablation results: 2052 evaluations
Expected per task: 6 models × 6 classifiers × 19 features = 684

SAVING RAW ABLATION RESULTS
  Saved clarity: 684 evaluations → /content/drive/MyDrive/semeval_data/results/ablation/single_feature_clarity.csv
  Saved evasion: 684 evaluations → /content/drive/MyDrive/semeval_data/results/ablation/single_feature_evasion.csv
  Saved hierarchical_evasion_to_clarity: 684 evaluations → /content/drive/MyDrive/semeval_data/results/ablation/single_feature_hierarchical_evasion_to_clarity.csv

STATISTICAL AGGREGATION AND FEATURE RANKING
Computing statistics across all 36 model×classifier combinations...

TASK: CLARITY - FEATURE RANKING

Top 15 Features (ranked by weighted_score):
Weighted Score Formula: 0.5*mean + 0.3*best + 0.2*(1 - normalized_std)

Columns:
  - min_f1: Minimum Macro F1 across 36 combinations (worst-case)
  - median_f1: Median Macro F1 (typical performance)
  - mean_f1: Mean Macro F1 (average perfor

Unnamed: 0,feature,min_f1,median_f1,mean_f1,std_f1,best_f1,runs,normalized_std,weighted_score
10,attention_mass_q_to_a_per_qtoken,0.2489,0.3576,0.3524,0.0648,0.4895,36,0.1838,0.4863
14,focus_token_to_answer_strength,0.2591,0.3416,0.3488,0.0598,0.4695,36,0.1714,0.481
7,answer_token_to_focus_strength,0.2492,0.343,0.3488,0.0583,0.4641,36,0.1671,0.4802
4,answer_model_token_count,0.2496,0.3666,0.358,0.0547,0.4322,36,0.1529,0.4781
8,answer_word_count,0.2496,0.3516,0.344,0.047,0.4021,36,0.1367,0.4653
9,attention_mass_a_to_q_per_atoken,0.2496,0.3189,0.3298,0.0688,0.4699,36,0.2085,0.4642
5,answer_negation_ratio,0.2799,0.3361,0.3294,0.0403,0.395,36,0.1224,0.4587
15,question_content_coverage_in_answer,0.2661,0.3132,0.3243,0.0522,0.4246,36,0.1611,0.4573
0,answer_content_word_ratio,0.2687,0.3048,0.3205,0.0579,0.4315,36,0.1807,0.4536
2,answer_hedge_ratio,0.253,0.3042,0.3093,0.0523,0.4013,36,0.1691,0.4412



  ✓ Saved complete ranking: /content/drive/MyDrive/semeval_data/results/ablation/feature_ranking_clarity.csv
    Total features ranked: 19
    Expected runs per feature: 36 (6 models × 6 classifiers)

TASK: EVASION - FEATURE RANKING

Top 15 Features (ranked by weighted_score):
Weighted Score Formula: 0.5*mean + 0.3*best + 0.2*(1 - normalized_std)

Columns:
  - min_f1: Minimum Macro F1 across 36 combinations (worst-case)
  - median_f1: Median Macro F1 (typical performance)
  - mean_f1: Mean Macro F1 (average performance)
  - std_f1: Standard deviation (lower = more consistent)
  - best_f1: Maximum Macro F1 (best-case)
  - runs: Number of evaluations (should be 36)
  - normalized_std: std_f1 / mean_f1 (scale-normalized consistency)
  - weighted_score: Combined score for ranking



Unnamed: 0,feature,min_f1,median_f1,mean_f1,std_f1,best_f1,runs,normalized_std,weighted_score
26,answer_token_to_focus_strength,0.0517,0.139,0.2507,0.2541,0.8075,36,1.0135,0.3649
29,attention_mass_q_to_a_per_qtoken,0.0842,0.1509,0.2492,0.252,0.8,36,1.0113,0.3623
28,attention_mass_a_to_q_per_atoken,0.0524,0.1444,0.2453,0.2528,0.7967,36,1.0308,0.3555
33,focus_token_to_answer_strength,0.0503,0.1383,0.2428,0.2593,0.8083,36,1.0679,0.3503
19,answer_content_word_ratio,0.06,0.1458,0.1901,0.1534,0.481,36,0.8069,0.278
23,answer_model_token_count,0.0595,0.1312,0.1649,0.1109,0.3955,36,0.6722,0.2667
24,answer_negation_ratio,0.06,0.1463,0.1777,0.134,0.425,36,0.754,0.2656
37,tfidf_cosine_similarity_q_a,0.0434,0.1288,0.1706,0.147,0.4644,36,0.8614,0.2523
31,content_word_jaccard_q_a,0.06,0.1301,0.156,0.1055,0.3465,36,0.6763,0.2467
21,answer_hedge_ratio,0.035,0.1343,0.1636,0.1253,0.3917,36,0.7659,0.2461



  ✓ Saved complete ranking: /content/drive/MyDrive/semeval_data/results/ablation/feature_ranking_evasion.csv
    Total features ranked: 19
    Expected runs per feature: 36 (6 models × 6 classifiers)

TASK: HIERARCHICAL_EVASION_TO_CLARITY - FEATURE RANKING

Top 15 Features (ranked by weighted_score):
Weighted Score Formula: 0.5*mean + 0.3*best + 0.2*(1 - normalized_std)

Columns:
  - min_f1: Minimum Macro F1 across 36 combinations (worst-case)
  - median_f1: Median Macro F1 (typical performance)
  - mean_f1: Mean Macro F1 (average performance)
  - std_f1: Standard deviation (lower = more consistent)
  - best_f1: Maximum Macro F1 (best-case)
  - runs: Number of evaluations (should be 36)
  - normalized_std: std_f1 / mean_f1 (scale-normalized consistency)
  - weighted_score: Combined score for ranking



Unnamed: 0,feature,min_f1,median_f1,mean_f1,std_f1,best_f1,runs,normalized_std,weighted_score
45,answer_token_to_focus_strength,0.2269,0.3483,0.4151,0.2051,0.863,36,0.494,0.5676
52,focus_token_to_answer_strength,0.2269,0.3424,0.4159,0.2082,0.8639,36,0.5006,0.567
48,attention_mass_q_to_a_per_qtoken,0.2269,0.3361,0.4132,0.2052,0.8564,36,0.4967,0.5642
47,attention_mass_a_to_q_per_atoken,0.2269,0.3289,0.3953,0.2114,0.8565,36,0.5349,0.5476
42,answer_model_token_count,0.2269,0.3366,0.3606,0.1079,0.5564,36,0.2993,0.4874
43,answer_negation_ratio,0.2446,0.3514,0.3592,0.1057,0.5551,36,0.2943,0.4873
38,answer_content_word_ratio,0.2269,0.3219,0.3495,0.1313,0.6038,36,0.3755,0.4808
46,answer_word_count,0.2269,0.3211,0.3371,0.1076,0.5169,36,0.3191,0.4598
40,answer_hedge_ratio,0.2438,0.2853,0.318,0.0945,0.5091,36,0.297,0.4523
53,question_content_coverage_in_answer,0.2381,0.2874,0.3038,0.0576,0.4225,36,0.1897,0.4407



  ✓ Saved complete ranking: /content/drive/MyDrive/semeval_data/results/ablation/feature_ranking_hierarchical_evasion_to_clarity.csv
    Total features ranked: 19
    Expected runs per feature: 36 (6 models × 6 classifiers)

TOP-K FEATURE SELECTION FOR EARLY FUSION
Selecting top-K features for each task (to be used in Early Fusion)

  CLARITY - Top 10 Features:
     1. attention_mass_q_to_a_per_qtoken
        weighted_score=0.4863, mean_f1=0.3524, best_f1=0.4895
     2. focus_token_to_answer_strength
        weighted_score=0.4810, mean_f1=0.3488, best_f1=0.4695
     3. answer_token_to_focus_strength
        weighted_score=0.4802, mean_f1=0.3488, best_f1=0.4641
     4. answer_model_token_count
        weighted_score=0.4781, mean_f1=0.3580, best_f1=0.4322
     5. answer_word_count
        weighted_score=0.4653, mean_f1=0.3440, best_f1=0.4021
     6. attention_mass_a_to_q_per_atoken
        weighted_score=0.4642, mean_f1=0.3298, best_f1=0.4699
     7. answer_negation_ratio
        weight

<div style="font-size: 10px;">

# ============================================================================
# GREEDY FORWARD SELECTION (OPTIONAL)
# ============================================================================
#
# This section performs greedy forward selection: iteratively adds features that
# maximize Macro F1 on the dev set.
#
# **What this cell does:**
# 1. Starts with top-K features (by weighted_score from Cell 4)
# 2. For each iteration, tries adding each remaining feature
# 3. Selects the feature that gives the highest Macro F1 improvement
# 4. Continues until no improvement or max_features reached
# 5. Saves selected feature sets and trajectories
#
# **Process:**
# - For each task (clarity, evasion, hierarchical)
# - For each model (6 models)
# - Uses the best classifier for that model×task combination
# - Starts with top 5 features by weighted_score
# - Iteratively adds up to 15 features total
#
# **Output:**
# - Selected feature sets for each model×task combination
# - Trajectories showing feature count vs Macro F1 progression
# - All results saved to Google Drive
#
# **Note:** This cell requires Cell 4 (Feature Ranking) to complete first.

</div>

In [10]:
# ============================================================================
# GREEDY FORWARD SELECTION (OPTIONAL - FOR TOP FEATURES)
# ============================================================================
# Iteratively adds features that maximize Macro F1 on dev set
# Starts with top-K features from single-feature ablation

import json
from tqdm import tqdm

def greedy_forward_selection(X_train, X_dev, y_train, y_dev, feature_names,
                            seed_features, clf, max_features=None):
    """
    Greedy forward selection: iteratively add best feature

    Args:
        X_train: Training features
        X_dev: Dev features
        y_train: Training labels
        y_dev: Dev labels
        feature_names: List of feature names
        seed_features: Initial feature set (list of feature names)
        clf: Classifier instance
        max_features: Maximum number of features to select (None = all)

    Returns:
        selected_features: List of selected feature names
        trajectory: List of (n_features, macro_f1) tuples
    """
    # Encode labels to numeric (required for MLP, XGBoost, LightGBM)
    # This matches the approach in siparismaili01 notebook
    label_encoder = LabelEncoder()
    y_train_encoded = label_encoder.fit_transform(y_train)
    y_dev_encoded = label_encoder.transform(y_dev)

    selected_indices = [feature_names.index(f) for f in seed_features]
    available_indices = [i for i in range(len(feature_names)) if i not in selected_indices]

    trajectory = []

    # Evaluate initial set
    X_train_selected = X_train[:, selected_indices]
    X_dev_selected = X_dev[:, selected_indices]

    pipe = Pipeline([
        ("scaler", StandardScaler()),
        ("clf", clone(clf))
    ])
    pipe.fit(X_train_selected, y_train_encoded)
    pred = pipe.predict(X_dev_selected)
    current_f1 = f1_score(y_dev_encoded, pred, average='macro')
    trajectory.append((len(selected_indices), current_f1))

    # Greedy selection
    max_iter = max_features if max_features else len(available_indices)

    for iteration in tqdm(range(max_iter), desc="Greedy selection"):
        best_f1 = current_f1
        best_idx = None

        # Try each available feature
        for idx in available_indices:
            candidate_indices = selected_indices + [idx]
            X_train_candidate = X_train[:, candidate_indices]
            X_dev_candidate = X_dev[:, candidate_indices]

            pipe = Pipeline([
                ("scaler", StandardScaler()),
                ("clf", clone(clf))
            ])
            pipe.fit(X_train_candidate, y_train_encoded)
            pred = pipe.predict(X_dev_candidate)
            candidate_f1 = f1_score(y_dev_encoded, pred, average='macro')

            if candidate_f1 > best_f1:
                best_f1 = candidate_f1
                best_idx = idx

        # If no improvement, stop
        if best_idx is None:
            break

        # Add best feature
        selected_indices.append(best_idx)
        available_indices.remove(best_idx)
        current_f1 = best_f1
        trajectory.append((len(selected_indices), current_f1))

    selected_features = [feature_names[i] for i in selected_indices]
    return selected_features, trajectory

# Check if required variables are defined
if 'df_stats' not in globals():
    raise NameError(
        "df_stats not found. Please run Cell 6 (Feature Ranking) first.\n"
        "Cell 6 performs statistical analysis and creates df_stats DataFrame."
    )
if 'df_ablation' not in globals():
    raise NameError(
        "df_ablation not found. Please run Cell 6 (Feature Ranking) first.\n"
        "Cell 6 creates df_ablation DataFrame from ablation_results."
    )
if 'TASKS' not in globals() or 'MODELS' not in globals() or 'classifiers' not in globals():
    raise NameError(
        "Required variables not defined. Please run Cell 2 (Configuration) first."
    )
if 'storage' not in globals() or 'ablation_dir' not in globals():
    raise NameError(
        "storage or ablation_dir not found. Please run Cell 1 (Setup) first."
    )

print("="*80)
print("GREEDY FORWARD SELECTION")
print("="*80)
print("Starting with top-K features from weighted score ranking (Cell 6)\n")

TOP_K_SEED = 5  # Start with top 5 features
MAX_FEATURES = 15  # Maximum features to select

selected_features_dict = {}
greedy_trajectories = {}

for task in TASKS:
    print(f"\n{'='*80}")
    print(f"TASK: {task.upper()} - GREEDY FORWARD SELECTION")
    print(f"{'='*80}")

    # Get top-K features for this task (from Cell 4, ranked by weighted_score)
    df_task_stats = df_stats[df_stats['task'] == task].copy()

    if len(df_task_stats) == 0:
        print(f"  ⚠ No ranking data found for task: {task}")
        continue

    # Select top-K features by weighted_score
    top_k_features = df_task_stats.head(TOP_K_SEED)['feature'].tolist()

    print(f"  Top {TOP_K_SEED} seed features (by weighted_score):")
    for i, feat in enumerate(top_k_features, 1):
        row = df_task_stats[df_task_stats['feature'] == feat].iloc[0]
        print(f"    {i}. {feat}")
        print(f"       weighted_score={row['weighted_score']:.4f}, mean={row['mean_f1']:.4f}, best={row['best_f1']:.4f}, std={row['std_f1']:.4f}")

    # Determine which task to use for loading data
    if task == 'hierarchical_evasion_to_clarity':
        # Hierarchical task: use evasion features but evaluate against clarity labels
        # (hierarchical approach maps evasion predictions to clarity)
        data_task = 'evasion'  # Use evasion features and splits
        label_key = 'clarity_label'  # But evaluate against clarity labels
    elif task == 'clarity':
        data_task = 'clarity'
        label_key = 'clarity_label'
    else:  # evasion
        data_task = 'evasion'
        label_key = 'evasion_label'

    # Load task-specific splits
    train_ds = storage.load_split('train', task=data_task)
    dev_ds = storage.load_split('dev', task=data_task)

    # Extract labels
    y_train = np.array([train_ds[i][label_key] for i in range(len(train_ds))])
    y_dev = np.array([dev_ds[i][label_key] for i in range(len(dev_ds))])

    # Get feature names directly from extraction module (same for all tasks and models)
    # This avoids dependency on metadata files in GitHub
    feature_names = get_feature_names()

    # For each model, use the best classifier for this model×task
    for model in MODELS:
        try:
            X_train = storage.load_features(model, data_task, 'train')
            X_dev = storage.load_features(model, data_task, 'dev')
        except FileNotFoundError:
            print(f"  ⚠ Features not found for {model} × {data_task}, skipping...")
            continue

        # Find best classifier for this model×task from ablation results
        df_model_task = df_ablation[
            (df_ablation['task'] == task) &
            (df_ablation['model'] == model)
        ]

        if len(df_model_task) == 0:
            continue

        # Get best classifier (by mean F1 across all features)
        best_clf = df_model_task.groupby('classifier')['macro_f1'].mean().idxmax()
        clf = classifiers[best_clf]

        print(f"\n  {model.upper()} × {best_clf}:")

        # Run greedy selection
        selected_features, trajectory = greedy_forward_selection(
            X_train, X_dev, y_train, y_dev,
            feature_names, top_k_features, clf,
            max_features=MAX_FEATURES
        )

        selected_features_dict[f"{model}_{task}"] = {
            'model': model,
            'task': task,
            'classifier': best_clf,
            'selected_features': selected_features,
            'n_features': len(selected_features)
        }

        greedy_trajectories[f"{model}_{task}"] = trajectory

        print(f"    Selected {len(selected_features)} features")
        print(f"    Final Macro F1: {trajectory[-1][1]:.4f}")

        # Save trajectory
        df_traj = pd.DataFrame(trajectory, columns=['n_features', 'macro_f1'])
        csv_path = ablation_dir / f'greedy_trajectory_{model}_{task}.csv'
        df_traj.to_csv(csv_path, index=False)
        print(f"    Saved trajectory: {csv_path}")

# Save selected features
if selected_features_dict:
    json_path = ablation_dir / 'selected_features_all.json'
    with open(json_path, 'w') as f:
        json.dump(selected_features_dict, f, indent=2)
    print(f"\n{'='*80}")
    print(f"Saved selected features: {json_path}")
    print(f"{'='*80}")

print("\n" + "="*80)
print("ABLATION STUDY COMPLETE")
print("="*80)
print("\nSummary:")
print("  ✓ Single-feature ablation completed")
print("  ✓ Feature rankings generated")
print("  ✓ Greedy forward selection completed")
print("  ✓ All results saved to Google Drive")


GREEDY FORWARD SELECTION
Starting with top-K features from weighted score ranking (Cell 6)


TASK: CLARITY - GREEDY FORWARD SELECTION
  Top 5 seed features (by weighted_score):
    1. attention_mass_q_to_a_per_qtoken
       weighted_score=0.4863, mean=0.3524, best=0.4895, std=0.0648
    2. focus_token_to_answer_strength
       weighted_score=0.4810, mean=0.3488, best=0.4695, std=0.0598
    3. answer_token_to_focus_strength
       weighted_score=0.4802, mean=0.3488, best=0.4641, std=0.0583
    4. answer_model_token_count
       weighted_score=0.4781, mean=0.3580, best=0.4322, std=0.0547
    5. answer_word_count
       weighted_score=0.4653, mean=0.3440, best=0.4021, std=0.0470

  BERT × RandomForest:


Greedy selection:  27%|██▋       | 4/15 [00:23<01:04,  5.85s/it]


    Selected 9 features
    Final Macro F1: 0.5294
    Saved trajectory: /content/drive/MyDrive/semeval_data/results/ablation/greedy_trajectory_bert_clarity.csv

  BERT_POLITICAL × RandomForest:


Greedy selection:  27%|██▋       | 4/15 [00:23<01:04,  5.83s/it]


    Selected 9 features
    Final Macro F1: 0.5294
    Saved trajectory: /content/drive/MyDrive/semeval_data/results/ablation/greedy_trajectory_bert_political_clarity.csv

  BERT_AMBIGUITY × RandomForest:


Greedy selection:  27%|██▋       | 4/15 [00:23<01:04,  5.86s/it]


    Selected 9 features
    Final Macro F1: 0.5294
    Saved trajectory: /content/drive/MyDrive/semeval_data/results/ablation/greedy_trajectory_bert_ambiguity_clarity.csv

  ROBERTA × RandomForest:


Greedy selection:  20%|██        | 3/15 [00:19<01:17,  6.46s/it]


    Selected 8 features
    Final Macro F1: 0.5367
    Saved trajectory: /content/drive/MyDrive/semeval_data/results/ablation/greedy_trajectory_roberta_clarity.csv

  DEBERTA × RandomForest:


Greedy selection:  13%|█▎        | 2/15 [00:15<01:37,  7.53s/it]


    Selected 7 features
    Final Macro F1: 0.5166
    Saved trajectory: /content/drive/MyDrive/semeval_data/results/ablation/greedy_trajectory_deberta_clarity.csv

  XLNET × RandomForest:


Greedy selection:  27%|██▋       | 4/15 [00:23<01:04,  5.87s/it]


    Selected 9 features
    Final Macro F1: 0.5169
    Saved trajectory: /content/drive/MyDrive/semeval_data/results/ablation/greedy_trajectory_xlnet_clarity.csv

TASK: EVASION - GREEDY FORWARD SELECTION
  Top 5 seed features (by weighted_score):
    1. answer_token_to_focus_strength
       weighted_score=0.3649, mean=0.2507, best=0.8075, std=0.2541
    2. attention_mass_q_to_a_per_qtoken
       weighted_score=0.3623, mean=0.2492, best=0.8000, std=0.2520
    3. attention_mass_a_to_q_per_atoken
       weighted_score=0.3555, mean=0.2453, best=0.7967, std=0.2528
    4. focus_token_to_answer_strength
       weighted_score=0.3503, mean=0.2428, best=0.8083, std=0.2593
    5. answer_content_word_ratio
       weighted_score=0.2780, mean=0.1901, best=0.4810, std=0.1534

  BERT × RandomForest:


Greedy selection:  13%|█▎        | 2/15 [00:15<01:41,  7.80s/it]


    Selected 7 features
    Final Macro F1: 0.7961
    Saved trajectory: /content/drive/MyDrive/semeval_data/results/ablation/greedy_trajectory_bert_evasion.csv

  BERT_POLITICAL × RandomForest:


Greedy selection:  13%|█▎        | 2/15 [00:15<01:41,  7.79s/it]


    Selected 7 features
    Final Macro F1: 0.7961
    Saved trajectory: /content/drive/MyDrive/semeval_data/results/ablation/greedy_trajectory_bert_political_evasion.csv

  BERT_AMBIGUITY × RandomForest:


Greedy selection:  13%|█▎        | 2/15 [00:15<01:42,  7.91s/it]


    Selected 7 features
    Final Macro F1: 0.7961
    Saved trajectory: /content/drive/MyDrive/semeval_data/results/ablation/greedy_trajectory_bert_ambiguity_evasion.csv

  ROBERTA × RandomForest:


Greedy selection:   7%|▋         | 1/15 [00:10<02:32, 10.91s/it]


    Selected 6 features
    Final Macro F1: 0.7988
    Saved trajectory: /content/drive/MyDrive/semeval_data/results/ablation/greedy_trajectory_roberta_evasion.csv

  DEBERTA × RandomForest:


Greedy selection:   7%|▋         | 1/15 [00:10<02:32, 10.91s/it]


    Selected 6 features
    Final Macro F1: 0.7921
    Saved trajectory: /content/drive/MyDrive/semeval_data/results/ablation/greedy_trajectory_deberta_evasion.csv

  XLNET × RandomForest:


Greedy selection:   0%|          | 0/15 [00:05<?, ?it/s]


    Selected 5 features
    Final Macro F1: 0.8095
    Saved trajectory: /content/drive/MyDrive/semeval_data/results/ablation/greedy_trajectory_xlnet_evasion.csv

TASK: HIERARCHICAL_EVASION_TO_CLARITY - GREEDY FORWARD SELECTION
  Top 5 seed features (by weighted_score):
    1. answer_token_to_focus_strength
       weighted_score=0.5676, mean=0.4151, best=0.8630, std=0.2051
    2. focus_token_to_answer_strength
       weighted_score=0.5670, mean=0.4159, best=0.8639, std=0.2082
    3. attention_mass_q_to_a_per_qtoken
       weighted_score=0.5642, mean=0.4132, best=0.8564, std=0.2052
    4. attention_mass_a_to_q_per_atoken
       weighted_score=0.5476, mean=0.3953, best=0.8565, std=0.2114
    5. answer_model_token_count
       weighted_score=0.4874, mean=0.3606, best=0.5564, std=0.1079

  BERT × RandomForest:


Greedy selection:   7%|▋         | 1/15 [00:10<02:26, 10.47s/it]


    Selected 6 features
    Final Macro F1: 0.8546
    Saved trajectory: /content/drive/MyDrive/semeval_data/results/ablation/greedy_trajectory_bert_hierarchical_evasion_to_clarity.csv

  BERT_POLITICAL × RandomForest:


Greedy selection:   7%|▋         | 1/15 [00:10<02:26, 10.45s/it]


    Selected 6 features
    Final Macro F1: 0.8546
    Saved trajectory: /content/drive/MyDrive/semeval_data/results/ablation/greedy_trajectory_bert_political_hierarchical_evasion_to_clarity.csv

  BERT_AMBIGUITY × RandomForest:


Greedy selection:   7%|▋         | 1/15 [00:10<02:26, 10.45s/it]


    Selected 6 features
    Final Macro F1: 0.8546
    Saved trajectory: /content/drive/MyDrive/semeval_data/results/ablation/greedy_trajectory_bert_ambiguity_hierarchical_evasion_to_clarity.csv

  ROBERTA × RandomForest:


Greedy selection:   7%|▋         | 1/15 [00:10<02:25, 10.41s/it]


    Selected 6 features
    Final Macro F1: 0.8546
    Saved trajectory: /content/drive/MyDrive/semeval_data/results/ablation/greedy_trajectory_roberta_hierarchical_evasion_to_clarity.csv

  DEBERTA × RandomForest:


Greedy selection:  13%|█▎        | 2/15 [00:15<01:38,  7.55s/it]


    Selected 7 features
    Final Macro F1: 0.8514
    Saved trajectory: /content/drive/MyDrive/semeval_data/results/ablation/greedy_trajectory_deberta_hierarchical_evasion_to_clarity.csv

  XLNET × RandomForest:


Greedy selection:   7%|▋         | 1/15 [00:10<02:25, 10.41s/it]


    Selected 6 features
    Final Macro F1: 0.8644
    Saved trajectory: /content/drive/MyDrive/semeval_data/results/ablation/greedy_trajectory_xlnet_hierarchical_evasion_to_clarity.csv

Saved selected features: /content/drive/MyDrive/semeval_data/results/ablation/selected_features_all.json

ABLATION STUDY COMPLETE

Summary:
  ✓ Single-feature ablation completed
  ✓ Feature rankings generated
  ✓ Greedy forward selection completed
  ✓ All results saved to Google Drive


In [11]:
# ============================================================================
# SAVE ABLATION RESULTS TO BOTH DRIVE AND GITHUB
# ============================================================================
# Save memory-resident data (ablation_results, df_ablation, df_stats) to
# both Google Drive and GitHub for persistence and version control
# Only saves if files don't already exist (prevents overwriting)

print("\n" + "="*80)
print("SAVING ABLATION RESULTS TO PERSISTENT STORAGE")
print("="*80)
print("Note: Existing files will NOT be overwritten\n")

# Create GitHub results/ablation directory (if not exists)
github_ablation_dir = BASE_PATH / 'results' / 'ablation'
github_ablation_dir.mkdir(parents=True, exist_ok=True)

# 1. Save ablation_results as JSON (both GitHub and Drive)
print("1. Saving ablation_results (raw data) to JSON...")
for task in TASKS:
    task_results = [r for r in ablation_results if r['task'] == task]
    if task_results:
        # Save to Drive (only if not exists)
        json_path_drive = ablation_dir / f'ablation_results_{task}.json'
        if not json_path_drive.exists():
            with open(json_path_drive, 'w') as f:
                json.dump(task_results, f, indent=2)
            print(f"  Saved to Drive: {json_path_drive} ({len(task_results)} evaluations)")
        else:
            print(f"  Skipped (Drive): {json_path_drive} (already exists)")

        # Save to GitHub (only if not exists)
        json_path_github = github_ablation_dir / f'ablation_results_{task}.json'
        if not json_path_github.exists():
            with open(json_path_github, 'w') as f:
                json.dump(task_results, f, indent=2)
            print(f"  Saved to GitHub: {json_path_github}")
        else:
            print(f"  Skipped (GitHub): {json_path_github} (already exists)")

# 2. Save df_ablation as CSV (both GitHub and Drive)
print("\n2. Saving df_ablation (all ablation results) to CSV...")
csv_path_drive = ablation_dir / 'ablation_results_all.csv'
if not csv_path_drive.exists():
    df_ablation.to_csv(csv_path_drive, index=False)
    print(f"  Saved to Drive: {csv_path_drive} ({len(df_ablation)} rows)")
else:
    print(f"  Skipped (Drive): {csv_path_drive} (already exists)")

csv_path_github = github_ablation_dir / 'ablation_results_all.csv'
if not csv_path_github.exists():
    df_ablation.to_csv(csv_path_github, index=False)
    print(f"  Saved to GitHub: {csv_path_github}")
else:
    print(f"  Skipped (GitHub): {csv_path_github} (already exists)")

# 3. Save df_stats as CSV (both GitHub and Drive)
print("\n3. Saving df_stats (feature statistics) to CSV...")
csv_path_drive = ablation_dir / 'feature_stats_all.csv'
if not csv_path_drive.exists():
    df_stats.to_csv(csv_path_drive, index=False)
    print(f"  Saved to Drive: {csv_path_drive} ({len(df_stats)} rows)")
else:
    print(f"  Skipped (Drive): {csv_path_drive} (already exists)")

csv_path_github = github_ablation_dir / 'feature_stats_all.csv'
if not csv_path_github.exists():
    df_stats.to_csv(csv_path_github, index=False)
    print(f"  Saved to GitHub: {csv_path_github}")
else:
    print(f"  Skipped (GitHub): {csv_path_github} (already exists)")

# 4. Also save selected_features_for_early_fusion.json to GitHub (only if not exists)
print("\n4. Copying selected_features_for_early_fusion.json to GitHub...")
fusion_json_github = github_ablation_dir / 'selected_features_for_early_fusion.json'
fusion_json_drive = ablation_dir / 'selected_features_for_early_fusion.json'
if fusion_json_drive.exists():
    if not fusion_json_github.exists():
        import shutil
        shutil.copy(fusion_json_drive, fusion_json_github)
        print(f"  Saved to GitHub: {fusion_json_github} (copied from Drive)")
    else:
        print(f"  Skipped (GitHub): {fusion_json_github} (already exists)")
else:
    print(f"  Warning: Not found in Drive: {fusion_json_drive}")

print("\n" + "="*80)
print("ABLATION RESULTS SAVED TO PERSISTENT STORAGE")
print("="*80)
print("\nSummary:")
print("  - New files saved to Drive + GitHub")
print("  - Existing files skipped (not overwritten)")
print("\nSaved files:")
print("  - ablation_results_{task}.json -> Drive + GitHub (if new)")
print("  - ablation_results_all.csv -> Drive + GitHub (if new)")
print("  - feature_stats_all.csv -> Drive + GitHub (if new)")
print("  - selected_features_for_early_fusion.json -> GitHub (if new)")


SAVING ABLATION RESULTS TO PERSISTENT STORAGE
Note: Existing files will NOT be overwritten

1. Saving ablation_results (raw data) to JSON...
  Saved to Drive: /content/drive/MyDrive/semeval_data/results/ablation/ablation_results_clarity.json (684 evaluations)
  Saved to GitHub: /content/semeval-context-tree-modular/results/ablation/ablation_results_clarity.json
  Saved to Drive: /content/drive/MyDrive/semeval_data/results/ablation/ablation_results_evasion.json (684 evaluations)
  Saved to GitHub: /content/semeval-context-tree-modular/results/ablation/ablation_results_evasion.json
  Saved to Drive: /content/drive/MyDrive/semeval_data/results/ablation/ablation_results_hierarchical_evasion_to_clarity.json (684 evaluations)
  Saved to GitHub: /content/semeval-context-tree-modular/results/ablation/ablation_results_hierarchical_evasion_to_clarity.json

2. Saving df_ablation (all ablation results) to CSV...
  Saved to Drive: /content/drive/MyDrive/semeval_data/results/ablation/ablation_resul