# Ablation Study: Feature Selection and Ranking

================================================================================
PURPOSE: Comprehensive feature ablation study to identify optimal feature subsets
================================================================================

This notebook performs comprehensive ablation studies on Context Tree features
to identify the most important features for each model and task combination.
The goal is to maximize Macro F1-score by selecting optimal feature subsets.

**Scope:**
- **3 Tasks**: Clarity, Evasion, and Hierarchical Evasion → Clarity
- **6 Models**: bert, bert_political, bert_ambiguity, roberta, deberta, xlnet
- **6 Classifiers**: LogisticRegression, LinearSVC, RandomForest, XGBoost, LightGBM, MLP
- **Total Combinations**: 6 models × 6 classifiers = 36 combinations per task

**Workflow:**
1. Load features and labels from persistent storage (saved by 03_train_evaluate.ipynb)
2. Single-Feature Ablation: Evaluate each feature individually across all 36 model×classifier combinations
3. Statistical Aggregation: Calculate min, median, std, best (max), and runs (count) for each feature
4. Weighted Score Calculation: Compute weighted score combining multiple statistics
5. Feature Ranking: Rank features by weighted score (separately for each task)
6. Top-K Feature Selection: Identify top-performing features for greedy selection
7. Greedy Forward Selection: Iteratively add best features to maximize Macro F1

**Statistical Metrics Computed:**
- **min_f1**: Minimum Macro F1 across all 36 combinations (worst-case performance)
- **median_f1**: Median Macro F1 across all 36 combinations (typical performance)
- **mean_f1**: Mean Macro F1 across all 36 combinations (average performance)
- **std_f1**: Standard deviation of Macro F1 across all 36 combinations (consistency measure)
- **best_f1**: Maximum Macro F1 across all 36 combinations (best-case performance)
- **runs**: Number of evaluations (should be 36 for complete data)

**Weighted Score Formula:**
The weighted score combines multiple statistics to balance average performance, consistency, and peak performance:
```
weighted_score = 0.5*mean_f1 + 0.3*best_f1 + 0.2*(1 - normalized_std)
```
where normalized_std = std_f1 / (mean_f1 + epsilon) to account for scale differences.

Features are ranked by weighted_score in descending order.

================================================================================
INPUTS (What this notebook loads)
================================================================================

**From Google Drive:**
- Feature matrices: `features/raw/X_{split}_{model}_{task}.npy`
  - For each model (6 models)
  - For each task (clarity, evasion)
  - For Train and Dev splits
- Predictions: `predictions/pred_dev_{model}_{classifier}_hierarchical_evasion_to_clarity.npy`
  - For hierarchical task evaluation
- Dataset splits: `splits/dataset_splits_{task}.pkl`
  - For label extraction

**From GitHub:**
- Feature metadata: `metadata/features_{split}_{model}_{task}.json`
  - Contains feature names (19 features)

================================================================================
OUTPUTS (What this notebook saves)
================================================================================

**To Google Drive:**
- Ablation results: `results/ablation/single_feature_{task}.csv`
- Feature rankings: `results/ablation/feature_ranking_{task}.csv`
  - Includes all statistics: min, median, mean, std, best, runs, weighted_score
- Selected features: `results/ablation/selected_features_{task}.json`
- Greedy trajectories: `results/ablation/greedy_trajectory_{model}_{task}.csv`

**To GitHub:**
- Ablation metadata: `results/ablation_metadata_{task}.json`

**What gets passed to next notebook:**
- Feature rankings for each task (clarity, evasion, hierarchical)
- Selected feature sets for greedy selection
- Comprehensive statistical analysis of feature importance


# ============================================================================
# SETUP: Repository Clone, Drive Mount, and Path Configuration
# ============================================================================
# 
# This cell performs the initial setup required for the notebook to run:
# 1. Clones the repository from GitHub (if not already present)
# 2. Mounts Google Drive for persistent data storage
# 3. Configures Python paths and initializes StorageManager
# 4. Creates necessary directories for ablation results
#
# The StorageManager handles loading features, predictions, and dataset splits
# from Google Drive, and saving results back to Drive and GitHub.
#
# No modifications are needed in this cell unless you change repository URLs
# or data storage paths.


In [None]:
# ============================================================================# SETUP: Repository Clone, Drive Mount, and Path Configuration# ============================================================================import shutilimport osimport subprocessimport timeimport requestsimport zipfileimport sysfrom pathlib import Pathfrom google.colab import driveimport numpy as npimport pandas as pd# Repository configurationrepo_dir = '/content/semeval-context-tree-modular'repo_url = 'https://github.com/EonTechie/semeval-context-tree-modular.git'zip_url = 'https://github.com/EonTechie/semeval-context-tree-modular/archive/refs/heads/main.zip'# Clone repository (if not already present)if not os.path.exists(repo_dir):    print("Cloning repository from GitHub...")    max_retries = 2    clone_success = False        for attempt in range(max_retries):        try:            result = subprocess.run(                ['git', 'clone', repo_url],                cwd='/content',                capture_output=True,                text=True,                timeout=60            )            if result.returncode == 0:                print("Repository cloned successfully via git")                clone_success = True                break            else:                if attempt < max_retries - 1:                    time.sleep(3)        except Exception as e:            if attempt < max_retries - 1:                time.sleep(3)        # Fallback: Download as ZIP if git clone fails    if not clone_success:        print("Git clone failed. Downloading repository as ZIP archive...")        zip_path = '/tmp/repo.zip'        try:            response = requests.get(zip_url, stream=True, timeout=60)            response.raise_for_status()            with open(zip_path, 'wb') as f:                for chunk in response.iter_content(chunk_size=8192):                    f.write(chunk)            with zipfile.ZipFile(zip_path, 'r') as zip_ref:                zip_ref.extractall('/content')            extracted_dir = '/content/semeval-context-tree-modular-main'            if os.path.exists(extracted_dir):                os.rename(extracted_dir, repo_dir)            os.remove(zip_path)            print("Repository downloaded and extracted successfully")        except Exception as e:            raise RuntimeError(f"Failed to obtain repository: {e}")# Mount Google Drive (if not already mounted)try:    drive.mount('/content/drive', force_remount=False)except Exception:    pass  # Already mounted# Configure pathsBASE_PATH = Path('/content/semeval-context-tree-modular')DATA_PATH = Path('/content/drive/MyDrive/semeval_data')# Verify repository structure existsif not BASE_PATH.exists():    raise RuntimeError(f"Repository directory not found: {BASE_PATH}")if not (BASE_PATH / 'src').exists():    raise RuntimeError(f"src directory not found in repository: {BASE_PATH / 'src'}")if not (BASE_PATH / 'src' / 'storage' / 'manager.py').exists():    raise RuntimeError(f"Required file not found: {BASE_PATH / 'src' / 'storage' / 'manager.py'}")# Add repository to Python pathsys.path.insert(0, str(BASE_PATH))# Verify imports worktry:    from src.storage.manager import StorageManager    from src.models.classifiers import get_classifier_dict    from src.features.extraction import get_feature_names    from sklearn.metrics import f1_score    from sklearn.pipeline import Pipeline    from sklearn.preprocessing import StandardScaler    from sklearn.base import cloneexcept ImportError as e:    raise ImportError(        f"Failed to import required modules. "        f"Repository path: {BASE_PATH}, "        f"Python path: {sys.path[:3]}, "        f"Error: {e}"    )# Initialize StorageManagerstorage = StorageManager(    base_path=str(BASE_PATH),    data_path=str(DATA_PATH),    github_path=str(BASE_PATH))# Create ablation results directoryablation_dir = DATA_PATH / 'results' / 'ablation'ablation_dir.mkdir(parents=True, exist_ok=True)print("Setup complete")print(f"  Repository: {BASE_PATH}")print(f"  Data storage: {DATA_PATH}")print(f"  Ablation results: {ablation_dir}")

In [None]:
# ============================================================================
# CONFIGURE MODELS, TASKS, AND CLASSIFIERS
# ============================================================================
# This cell defines the models, tasks, and classifiers to be used in the ablation study.
# All combinations (6 models × 6 classifiers = 36) will be evaluated for each task.
# Three tasks are included: Clarity, Evasion, and Hierarchical Evasion → Clarity.

# Check if get_classifier_dict is imported (from Cell 1 - Setup)
if 'get_classifier_dict' not in globals():
    raise NameError(
        "get_classifier_dict not found. Please run Cell 1 (Setup) first.\n"
        "Cell 1 imports get_classifier_dict from src.models.classifiers."
    )

MODELS = ['bert', 'bert_political', 'bert_ambiguity', 'roberta', 'deberta', 'xlnet']
TASKS = ['clarity', 'evasion', 'hierarchical_evasion_to_clarity']  # 3 tasks

# Label mappings for each task
CLARITY_LABELS = ['Ambivalent', 'Clear Non-Reply', 'Clear Reply']
EVASION_LABELS = ['Claims ignorance', 'Clarification', 'Declining to answer', 
                  'Deflection', 'Dodging', 'Explicit', 
                  'General', 'Implicit', 'Partial/half-answer']

# Initialize classifiers with fixed random seed for reproducibility
# Includes MLP (Multi-Layer Perceptron) as requested
classifiers = get_classifier_dict(random_state=42)

print("="*80)
print("CONFIGURATION")
print("="*80)
print(f"  Models: {len(MODELS)} models")
print(f"    {MODELS}")
print(f"  Tasks: {len(TASKS)} tasks")
print(f"    {TASKS}")
print(f"  Classifiers: {len(classifiers)} classifiers")
print(f"    {list(classifiers.keys())}")
print(f"  Total combinations per task: {len(MODELS)} × {len(classifiers)} = {len(MODELS) * len(classifiers)}")
print(f"  Evaluation set: Dev set (not test)")
print("="*80)


# ============================================================================
# SINGLE-FEATURE ABLATION STUDY
# ============================================================================
#
# This section performs single-feature ablation: evaluates each of the 19 Context Tree
# features individually across all model×classifier combinations.
#
# **What this cell does:**
# - For each task (clarity, evasion, hierarchical_evasion_to_clarity)
# - For each model (6 models)
# - For each classifier (6 classifiers, including MLP)
# - For each feature (19 features)
# - Trains a classifier using only that single feature
# - Evaluates Macro F1 on the dev set
#
# **Total evaluations:** 3 tasks × 6 models × 6 classifiers × 19 features = 2,052 evaluations
#
# **Expected runtime:** 15-30 minutes depending on hardware
#
# **Output:** List of ablation results with model, task, classifier, feature, and macro_f1


In [None]:
# ============================================================================# SINGLE-FEATURE ABLATION STUDY# ============================================================================# This cell performs single-feature ablation: evaluates each of the 19 Context Tree# features individually across all model×classifier combinations (36 combinations per task).# # For each feature, we train a classifier using only that feature and evaluate on the# dev set. This helps identify which features are most informative for each task.## Process:# 1. For each task (clarity, evasion, hierarchical_evasion_to_clarity)# 2. For each model (6 models)# 3. For each classifier (6 classifiers, including MLP)# 4. For each feature (19 features)# 5. Train classifier on single feature and evaluate Macro F1 on dev set## Total evaluations: 3 tasks × 6 models × 6 classifiers × 19 features = 2,052 evaluationsdef eval_single_feature(X_train, X_dev, y_train, y_dev, feature_idx, clf):    """    Evaluate a single feature using a classifier.        This function trains a classifier using only one feature and evaluates its    performance on the dev set. StandardScaler is applied to normalize the    single feature before classification.        Args:        X_train: Training feature matrix (N, F) where F is total number of features        X_dev: Dev feature matrix (M, F)        y_train: Training labels (N,)        y_dev: Dev labels (M,)        feature_idx: Index of the feature to evaluate (0 to F-1)        clf: Classifier instance (will be cloned to avoid state issues)        Returns:        Macro F1 score on dev set (float)    """    # Select only the specified feature (single column)    X_train_f = X_train[:, [feature_idx]]    X_dev_f = X_dev[:, [feature_idx]]        # Pipeline with scaling (critical for single features to work properly)    # StandardScaler normalizes the feature to have zero mean and unit variance    pipe = Pipeline([        ("scaler", StandardScaler()),        ("clf", clone(clf))  # Clone to avoid modifying the original classifier    ])        # Train on single feature and evaluate on dev set    pipe.fit(X_train_f, y_train)    pred = pipe.predict(X_dev_f)    macro_f1 = f1_score(y_dev, pred, average='macro')        return macro_f1# Check if required variables are defined (from Cell 2 - Configuration)if 'TASKS' not in globals() or 'MODELS' not in globals() or 'classifiers' not in globals():    raise NameError(        "Required variables not defined. Please run Cell 2 (Configuration) first.\n"        "Cell 2 defines: TASKS, MODELS, CLARITY_LABELS, EVASION_LABELS, and classifiers."    )# Check if storage is defined (from Cell 1 - Setup)if 'storage' not in globals():    raise NameError(        "storage not found. Please run Cell 1 (Setup) first.\n"        "Cell 1 initializes StorageManager as 'storage'."    )print("="*80)print("SINGLE-FEATURE ABLATION STUDY")print("="*80)print("Evaluating each feature individually across all model×task×classifier combinations")print(f"Total evaluations: {len(TASKS)} tasks × {len(MODELS)} models × {len(classifiers)} classifiers × 19 features")print("This may take 15-30 minutes depending on your hardware...\n")# Store all ablation results# Each entry contains: model, task, classifier, feature, feature_idx, macro_f1ablation_results = []for task in TASKS:    print(f"\n{'='*80}")    print(f"TASK: {task.upper()}")    print(f"{'='*80}")        # Select appropriate label list and dataset key based on task    if task == 'clarity':        label_list = CLARITY_LABELS        label_key = 'clarity_label'        task_for_split = 'clarity'    elif task == 'evasion':        label_list = EVASION_LABELS        label_key = 'evasion_label'        task_for_split = 'evasion'    else:  # hierarchical_evasion_to_clarity        # For hierarchical task, we need to load evasion dev set to get clarity labels        # (hierarchical uses evasion predictions mapped to clarity labels)        label_list = CLARITY_LABELS        label_key = 'clarity_label'        # We'll load from evasion dev set (same filtered samples)        task_for_split = 'evasion'            # Load task-specific splits    # For hierarchical task, we load evasion split (which has clarity labels)    train_ds = storage.load_split('train', task=task_for_split)    dev_ds = storage.load_split('dev', task=task_for_split)        # Extract labels    y_train = np.array([train_ds[i][label_key] for i in range(len(train_ds))])    y_dev = np.array([dev_ds[i][label_key] for i in range(len(dev_ds))])        print(f"  Train: {len(y_train)} samples")    print(f"  Dev: {len(y_dev)} samples")        # Get feature names directly from extraction module (same for all models)    # This avoids dependency on metadata files in GitHub    # Feature names are the same across all models (19 Context Tree features)    feature_names = get_feature_names()    n_features = len(feature_names)        print(f"  Features: {n_features} features")    print(f"  Feature names: {feature_names}\n")        # For hierarchical task, we need to use evasion features    # (hierarchical approach uses evasion predictions, so we evaluate on evasion features)    feature_task = 'evasion' if task == 'hierarchical_evasion_to_clarity' else task        # For each model    for model in MODELS:        print(f"  Model: {model}")                # Load features        # For hierarchical task, use evasion features (since we're evaluating        # how well evasion features predict clarity via hierarchical mapping)        try:            X_train = storage.load_features(model, feature_task, 'train')            X_dev = storage.load_features(model, feature_task, 'dev')        except FileNotFoundError:            print(f"    ⚠ Features not found for {model} × {feature_task}, skipping...")            continue                # Verify feature count matches        if X_train.shape[1] != n_features:            print(f"    ⚠ Feature count mismatch: expected {n_features}, got {X_train.shape[1]}, skipping...")            continue                # For each classifier        for clf_name, clf in classifiers.items():            print(f"    Classifier: {clf_name}")                        # Evaluate each feature individually            for feature_idx, feature_name in enumerate(feature_names):                try:                    macro_f1 = eval_single_feature(                        X_train, X_dev,                        y_train, y_dev,                        feature_idx, clf                    )                                        ablation_results.append({                        'model': model,                        'task': task,                        'classifier': clf_name,                        'feature': feature_name,                        'feature_idx': feature_idx,                        'macro_f1': float(macro_f1)                    })                except Exception as e:                    print(f"      ⚠ Error evaluating feature {feature_name}: {e}")                    continueprint(f"\n{'='*80}")print("SINGLE-FEATURE ABLATION COMPLETE")print(f"{'='*80}")print(f"Total evaluations completed: {len(ablation_results)}")print(f"Expected: {len(TASKS)} tasks × {len(MODELS)} models × {len(classifiers)} classifiers × {n_features} features = {len(TASKS) * len(MODELS) * len(classifiers) * n_features}")

# ============================================================================
# FEATURE RANKING AND STATISTICAL ANALYSIS
# ============================================================================
#
# This section performs comprehensive statistical analysis of the ablation results
# and ranks features by a weighted score. This is a GLOBAL analysis that aggregates
# results across all 36 model×classifier combinations to identify features that work
# well across different models and classifiers.
#
# **What this cell does:**
# 1. Aggregates results across all 36 model×classifier combinations for each feature
# 2. Computes statistics: min, median, mean, std, best (max), and runs (count)
# 3. Calculates normalized_std = std_f1 / mean_f1 (scale-normalized consistency)
# 4. Calculates weighted_score = 0.5*mean + 0.3*best + 0.2*(1 - normalized_std)
# 5. Ranks features by weighted_score (separately for each task)
# 6. Displays top 15 features with all statistics
# 7. Saves complete rankings to CSV files
# 8. Selects top-K features for Early Fusion (to be used across all models)
#
# **Statistics computed:**
# - min_f1: Worst-case performance across 36 combinations
# - median_f1: Typical performance (robust to outliers)
# - mean_f1: Average performance
# - std_f1: Consistency measure (lower is better)
# - best_f1: Best-case performance (peak potential)
# - runs: Number of evaluations (should be 36 for complete data)
#
# **Weighted Score Formula:**
# weighted_score = 0.5*mean_f1 + 0.3*best_f1 + 0.2*(1 - normalized_std)
#
# This formula balances average performance (50%), peak performance (30%), and
# consistency (20%). Features with high mean, high best, and low std score highest.
#
# **Why Global Analysis?**
# We use global ranking (not model-specific) to avoid overfitting and noise accumulation.
# The selected features will be used across all models in Early Fusion, ensuring
# consistency and better generalization.
#
# **Output:**
# - Feature rankings saved separately for clarity, evasion, and hierarchical tasks
# - Top-K features selected for Early Fusion (saved to JSON)


In [None]:
# ============================================================================
# FEATURE RANKING AND STATISTICAL ANALYSIS
# ============================================================================
# This cell performs comprehensive statistical analysis and ranking of features
# based on the single-feature ablation results from Cell 3.
#
# For each task, we aggregate results across all 36 model×classifier combinations
# and compute the following statistics for each feature:
# - min_f1: Minimum Macro F1 (worst-case performance)
# - median_f1: Median Macro F1 (typical performance)
# - mean_f1: Mean Macro F1 (average performance)
# - std_f1: Standard deviation (consistency measure - lower is better)
# - best_f1: Maximum Macro F1 (best-case performance)
# - runs: Number of evaluations (should be 36 for complete data)
#
# Features are then ranked by a weighted score that combines multiple statistics:
# weighted_score = 0.5*mean_f1 + 0.3*best_f1 + 0.2*(1 - normalized_std)
#
# This ranking is performed separately for each of the 3 tasks.

# Check if ablation_results exists (from Cell 4 - Single-Feature Ablation)
if 'ablation_results' not in globals():
    raise NameError(
        "ablation_results not found. Please run Cell 4 (Single-Feature Ablation) first.\n"
        "Cell 4 performs the ablation study and creates ablation_results list."
    )

# Check if storage and ablation_dir are defined (from Cell 1 - Setup)
if 'storage' not in globals():
    raise NameError(
        "storage not found. Please run Cell 1 (Setup) first.\n"
        "Cell 1 initializes StorageManager as 'storage'."
    )
if 'ablation_dir' not in globals():
    raise NameError(
        "ablation_dir not found. Please run Cell 1 (Setup) first.\n"
        "Cell 1 creates ablation_dir directory."
    )

df_ablation = pd.DataFrame(ablation_results)

if len(df_ablation) == 0:
    print("⚠ No ablation results found. Make sure Cell 4 completed successfully.")
    print("  You need to run Cell 4 (Single-Feature Ablation) first.")
else:
    print("="*80)
    print("FEATURE RANKING AND STATISTICAL ANALYSIS")
    print("="*80)
    print(f"Total ablation results: {len(df_ablation)} evaluations")
    print(f"Expected per task: {len(MODELS)} models × {len(classifiers)} classifiers × 19 features = {len(MODELS) * len(classifiers) * 19}")
    
    # Save raw ablation results for each task
    print(f"\n{'='*80}")
    print("SAVING RAW ABLATION RESULTS")
    print(f"{'='*80}")
    
    for task in TASKS:
        df_task = df_ablation[df_ablation['task'] == task]
        if len(df_task) > 0:
            csv_path = ablation_dir / f'single_feature_{task}.csv'
            df_task.to_csv(csv_path, index=False)
            print(f"  Saved {task}: {len(df_task)} evaluations → {csv_path}")
    
    # ========================================================================
    # STATISTICAL AGGREGATION AND WEIGHTED SCORE CALCULATION
    # ========================================================================
    # Aggregate results across all 36 model×classifier combinations for each feature
    # Compute comprehensive statistics and calculate weighted score for ranking
    
    print(f"\n{'='*80}")
    print("STATISTICAL AGGREGATION AND FEATURE RANKING")
    print(f"{'='*80}")
    print("Computing statistics across all 36 model×classifier combinations...")
    
    # Calculate comprehensive statistics for each feature×task combination
    # Using 'median' in addition to mean/std/min/max to get more robust statistics
    df_stats = df_ablation.groupby(['task', 'feature'])['macro_f1'].agg([
        'min',      # Minimum F1 (worst-case)
        'median',   # Median F1 (typical performance)
        'mean',     # Mean F1 (average performance)
        'std',      # Standard deviation (consistency)
        'max',      # Maximum F1 (best-case, same as best_f1)
        'count'     # Number of evaluations (should be 36)
    ]).reset_index()
    
    # Rename columns for clarity
    df_stats.columns = ['task', 'feature', 'min_f1', 'median_f1', 'mean_f1', 'std_f1', 'best_f1', 'runs']
    
    # Calculate weighted score
    # Formula: weighted_score = 0.5*mean + 0.3*best + 0.2*(1 - normalized_std)
    # This balances:
    # - Average performance (50% weight)
    # - Peak performance (30% weight)
    # - Consistency (20% weight, lower std = higher score)
    #
    # Normalize std by mean to account for scale differences:
    # normalized_std = std_f1 / (mean_f1 + epsilon)
    # where epsilon prevents division by zero
    EPSILON = 1e-6
    df_stats['normalized_std'] = df_stats['std_f1'] / (df_stats['mean_f1'] + EPSILON)
    
    # Calculate weighted score
    # Higher is better: we want high mean, high best, and low std (high 1-normalized_std)
    df_stats['weighted_score'] = (
        0.5 * df_stats['mean_f1'] +
        0.3 * df_stats['best_f1'] +
        0.2 * (1 - df_stats['normalized_std'])
    )
    
    # Sort by weighted_score (descending) for ranking
    # Secondary sort by mean_f1 for tie-breaking
    df_stats = df_stats.sort_values(['weighted_score', 'mean_f1'], ascending=False)
    
    # ========================================================================
    # DISPLAY AND SAVE RANKINGS FOR EACH TASK
    # ========================================================================
    
    for task in TASKS:
        print(f"\n{'='*80}")
        print(f"TASK: {task.upper()} - FEATURE RANKING")
        print(f"{'='*80}")
        
        df_task = df_stats[df_stats['task'] == task].copy()
        
        if len(df_task) == 0:
            print(f"  ⚠ No results found for task: {task}")
            continue
        
        # Round all numeric columns for display
        numeric_cols = ['min_f1', 'median_f1', 'mean_f1', 'std_f1', 'best_f1', 'runs', 'normalized_std', 'weighted_score']
        df_task[numeric_cols] = df_task[numeric_cols].round(4)
        
        # Display top 15 features with all statistics
        print(f"\nTop 15 Features (ranked by weighted_score):")
        print("Weighted Score Formula: 0.5*mean + 0.3*best + 0.2*(1 - normalized_std)")
        print("\nColumns:")
        print("  - min_f1: Minimum Macro F1 across 36 combinations (worst-case)")
        print("  - median_f1: Median Macro F1 (typical performance)")
        print("  - mean_f1: Mean Macro F1 (average performance)")
        print("  - std_f1: Standard deviation (lower = more consistent)")
        print("  - best_f1: Maximum Macro F1 (best-case)")
        print("  - runs: Number of evaluations (should be 36)")
        print("  - normalized_std: std_f1 / mean_f1 (scale-normalized consistency)")
        print("  - weighted_score: Combined score for ranking\n")
        
        display(df_task[['feature', 'min_f1', 'median_f1', 'mean_f1', 'std_f1', 'best_f1', 'runs', 'normalized_std', 'weighted_score']].head(15))
        
        # Save complete ranking to CSV
        csv_path = ablation_dir / f'feature_ranking_{task}.csv'
        df_task.to_csv(csv_path, index=False)
        print(f"\n  ✓ Saved complete ranking: {csv_path}")
        print(f"    Total features ranked: {len(df_task)}")
        print(f"    Expected runs per feature: 36 (6 models × 6 classifiers)")
        
        # Verify data completeness
        incomplete = df_task[df_task['runs'] < 36]
        if len(incomplete) > 0:
            print(f"    ⚠ Warning: {len(incomplete)} features have incomplete data (< 36 runs)")
    
    # ========================================================================
    # TOP-K FEATURE SELECTION FOR EARLY FUSION
    # ========================================================================
    # Select top-K features for each task to use in Early Fusion
    # These features will be used across all models in Early Fusion
    
    print(f"\n{'='*80}")
    print("TOP-K FEATURE SELECTION FOR EARLY FUSION")
    print(f"{'='*80}")
    print("Selecting top-K features for each task (to be used in Early Fusion)")
    
    TOP_K_FEATURES = 10  # Number of top features to select for Early Fusion
    
    selected_features_for_fusion = {}
    
    for task in TASKS:
        df_task = df_stats[df_stats['task'] == task].copy()
        
        if len(df_task) == 0:
            print(f"  ⚠ No ranking data found for task: {task}")
            continue
        
        # Select top-K features by weighted_score
        top_k_features = df_task.head(TOP_K_FEATURES)['feature'].tolist()
        
        selected_features_for_fusion[task] = {
            'top_k': TOP_K_FEATURES,
            'features': top_k_features,
            'ranking': df_task.head(TOP_K_FEATURES)[['feature', 'weighted_score', 'mean_f1', 'best_f1', 'std_f1']].to_dict('records')
        }
        
        print(f"\n  {task.upper()} - Top {TOP_K_FEATURES} Features:")
        for i, feat in enumerate(top_k_features, 1):
            row = df_task[df_task['feature'] == feat].iloc[0]
            print(f"    {i:2d}. {feat}")
            print(f"        weighted_score={row['weighted_score']:.4f}, mean_f1={row['mean_f1']:.4f}, best_f1={row['best_f1']:.4f}")
    
    # Save selected features for Early Fusion
    import json
    fusion_features_path = ablation_dir / 'selected_features_for_early_fusion.json'
    with open(fusion_features_path, 'w') as f:
        json.dump(selected_features_for_fusion, f, indent=2)
    
    print(f"\n{'='*80}")
    print("FEATURE RANKING COMPLETE")
    print(f"{'='*80}")
    print("Rankings saved separately for each task:")
    for task in TASKS:
        print(f"  - {task}: {ablation_dir / f'feature_ranking_{task}.csv'}")
    print(f"\nTop-K features for Early Fusion saved:")
    print(f"  - {fusion_features_path}")
    print(f"  - Top {TOP_K_FEATURES} features per task (to be used across all models in Early Fusion)")


# ============================================================================
# GREEDY FORWARD SELECTION (OPTIONAL)
# ============================================================================
#
# This section performs greedy forward selection: iteratively adds features that
# maximize Macro F1 on the dev set.
#
# **What this cell does:**
# 1. Starts with top-K features (by weighted_score from Cell 4)
# 2. For each iteration, tries adding each remaining feature
# 3. Selects the feature that gives the highest Macro F1 improvement
# 4. Continues until no improvement or max_features reached
# 5. Saves selected feature sets and trajectories
#
# **Process:**
# - For each task (clarity, evasion, hierarchical)
# - For each model (6 models)
# - Uses the best classifier for that model×task combination
# - Starts with top 5 features by weighted_score
# - Iteratively adds up to 15 features total
#
# **Output:**
# - Selected feature sets for each model×task combination
# - Trajectories showing feature count vs Macro F1 progression
# - All results saved to Google Drive
#
# **Note:** This cell requires Cell 4 (Feature Ranking) to complete first.


In [None]:
# ============================================================================# GREEDY FORWARD SELECTION (OPTIONAL - FOR TOP FEATURES)# ============================================================================# Iteratively adds features that maximize Macro F1 on dev set# Starts with top-K features from single-feature ablationimport jsonfrom tqdm import tqdmdef greedy_forward_selection(X_train, X_dev, y_train, y_dev, feature_names,                             seed_features, clf, max_features=None):    """    Greedy forward selection: iteratively add best feature        Args:        X_train: Training features        X_dev: Dev features        y_train: Training labels        y_dev: Dev labels        feature_names: List of feature names        seed_features: Initial feature set (list of feature names)        clf: Classifier instance        max_features: Maximum number of features to select (None = all)        Returns:        selected_features: List of selected feature names        trajectory: List of (n_features, macro_f1) tuples    """    selected_indices = [feature_names.index(f) for f in seed_features]    available_indices = [i for i in range(len(feature_names)) if i not in selected_indices]        trajectory = []        # Evaluate initial set    X_train_selected = X_train[:, selected_indices]    X_dev_selected = X_dev[:, selected_indices]        pipe = Pipeline([        ("scaler", StandardScaler()),        ("clf", clone(clf))    ])    pipe.fit(X_train_selected, y_train)    pred = pipe.predict(X_dev_selected)    current_f1 = f1_score(y_dev, pred, average='macro')    trajectory.append((len(selected_indices), current_f1))        # Greedy selection    max_iter = max_features if max_features else len(available_indices)        for iteration in tqdm(range(max_iter), desc="Greedy selection"):        best_f1 = current_f1        best_idx = None                # Try each available feature        for idx in available_indices:            candidate_indices = selected_indices + [idx]            X_train_candidate = X_train[:, candidate_indices]            X_dev_candidate = X_dev[:, candidate_indices]                        pipe = Pipeline([                ("scaler", StandardScaler()),                ("clf", clone(clf))            ])            pipe.fit(X_train_candidate, y_train)            pred = pipe.predict(X_dev_candidate)            candidate_f1 = f1_score(y_dev, pred, average='macro')                        if candidate_f1 > best_f1:                best_f1 = candidate_f1                best_idx = idx                # If no improvement, stop        if best_idx is None:            break                # Add best feature        selected_indices.append(best_idx)        available_indices.remove(best_idx)        current_f1 = best_f1        trajectory.append((len(selected_indices), current_f1))        selected_features = [feature_names[i] for i in selected_indices]    return selected_features, trajectory# Check if required variables are definedif 'df_stats' not in globals():    raise NameError(        "df_stats not found. Please run Cell 6 (Feature Ranking) first.\n"        "Cell 6 performs statistical analysis and creates df_stats DataFrame."    )if 'df_ablation' not in globals():    raise NameError(        "df_ablation not found. Please run Cell 6 (Feature Ranking) first.\n"        "Cell 6 creates df_ablation DataFrame from ablation_results."    )if 'TASKS' not in globals() or 'MODELS' not in globals() or 'classifiers' not in globals():    raise NameError(        "Required variables not defined. Please run Cell 2 (Configuration) first."    )if 'storage' not in globals() or 'ablation_dir' not in globals():    raise NameError(        "storage or ablation_dir not found. Please run Cell 1 (Setup) first."    )print("="*80)print("GREEDY FORWARD SELECTION")print("="*80)print("Starting with top-K features from weighted score ranking (Cell 6)\n")TOP_K_SEED = 5  # Start with top 5 featuresMAX_FEATURES = 15  # Maximum features to selectselected_features_dict = {}greedy_trajectories = {}for task in TASKS:    print(f"\n{'='*80}")    print(f"TASK: {task.upper()} - GREEDY FORWARD SELECTION")    print(f"{'='*80}")        # Get top-K features for this task (from Cell 4, ranked by weighted_score)    df_task_stats = df_stats[df_stats['task'] == task].copy()        if len(df_task_stats) == 0:        print(f"  ⚠ No ranking data found for task: {task}")        continue        # Select top-K features by weighted_score    top_k_features = df_task_stats.head(TOP_K_SEED)['feature'].tolist()        print(f"  Top {TOP_K_SEED} seed features (by weighted_score):")    for i, feat in enumerate(top_k_features, 1):        row = df_task_stats[df_task_stats['feature'] == feat].iloc[0]        print(f"    {i}. {feat}")        print(f"       weighted_score={row['weighted_score']:.4f}, mean={row['mean_f1']:.4f}, best={row['best_f1']:.4f}, std={row['std_f1']:.4f}")        # Determine which task to use for loading data    if task == 'hierarchical_evasion_to_clarity':        # Hierarchical task: use evasion features but evaluate against clarity labels        # (hierarchical approach maps evasion predictions to clarity)        data_task = 'evasion'  # Use evasion features and splits        label_key = 'clarity_label'  # But evaluate against clarity labels    elif task == 'clarity':        data_task = 'clarity'        label_key = 'clarity_label'    else:  # evasion        data_task = 'evasion'        label_key = 'evasion_label'        # Load task-specific splits    train_ds = storage.load_split('train', task=data_task)    dev_ds = storage.load_split('dev', task=data_task)        # Extract labels    y_train = np.array([train_ds[i][label_key] for i in range(len(train_ds))])    y_dev = np.array([dev_ds[i][label_key] for i in range(len(dev_ds))])        # Get feature names directly from extraction module (same for all tasks and models)    # This avoids dependency on metadata files in GitHub    feature_names = get_feature_names()        # For each model, use the best classifier for this model×task    for model in MODELS:        try:            X_train = storage.load_features(model, data_task, 'train')            X_dev = storage.load_features(model, data_task, 'dev')        except FileNotFoundError:            print(f"  ⚠ Features not found for {model} × {data_task}, skipping...")            continue                # Find best classifier for this model×task from ablation results        df_model_task = df_ablation[            (df_ablation['task'] == task) &             (df_ablation['model'] == model)        ]                if len(df_model_task) == 0:            continue                # Get best classifier (by mean F1 across all features)        best_clf = df_model_task.groupby('classifier')['macro_f1'].mean().idxmax()        clf = classifiers[best_clf]                print(f"\n  {model.upper()} × {best_clf}:")                # Run greedy selection        selected_features, trajectory = greedy_forward_selection(            X_train, X_dev, y_train, y_dev,            feature_names, top_k_features, clf,            max_features=MAX_FEATURES        )                selected_features_dict[f"{model}_{task}"] = {            'model': model,            'task': task,            'classifier': best_clf,            'selected_features': selected_features,            'n_features': len(selected_features)        }                greedy_trajectories[f"{model}_{task}"] = trajectory                print(f"    Selected {len(selected_features)} features")        print(f"    Final Macro F1: {trajectory[-1][1]:.4f}")                # Save trajectory        df_traj = pd.DataFrame(trajectory, columns=['n_features', 'macro_f1'])        csv_path = ablation_dir / f'greedy_trajectory_{model}_{task}.csv'        df_traj.to_csv(csv_path, index=False)        print(f"    Saved trajectory: {csv_path}")# Save selected featuresif selected_features_dict:    json_path = ablation_dir / 'selected_features_all.json'    with open(json_path, 'w') as f:        json.dump(selected_features_dict, f, indent=2)    print(f"\n{'='*80}")    print(f"Saved selected features: {json_path}")    print(f"{'='*80}")print("\n" + "="*80)print("ABLATION STUDY COMPLETE")print("="*80)print("\nSummary:")print("  ✓ Single-feature ablation completed")print("  ✓ Feature rankings generated")print("  ✓ Greedy forward selection completed")print("  ✓ All results saved to Google Drive")