<div style="font-size: 8px;">

# Feature Extraction: Context Tree Features for Individual Models

================================================================================
PURPOSE: Extract 19 Context Tree features for each transformer model separately
================================================================================

This notebook extracts Context Tree features from question-answer pairs using
four different transformer models. The features capture attention patterns,
lexical properties, and semantic relationships between questions and answers.

**Models:**
- BERT (bert-base-uncased)
- RoBERTa (roberta-base)
- DeBERTa (microsoft/deberta-v3-base)
- XLNet (xlnet-base-cased)

**Tasks:**
- Clarity: 3-class classification (Clear Reply, Ambiguous, Clear Non-Reply)
- Evasion: 9-class classification (Direct Answer, Partial Answer, etc.)

**Feature Categories:**
1. Attention-based features (attention mass, focus strength)
2. Pattern-based features (TF-IDF similarity, content word ratios)
3. Lexicon-based features (answer lexicon ratios, negation ratios)

**Output:** Feature matrices saved to Google Drive for each model/task/split
combination. Features are extracted for Train and Dev splits only. Test split
features will be extracted in the final evaluation notebook.

================================================================================
INPUTS (What this notebook loads)
================================================================================

**From GitHub:**
- Repository code (cloned automatically if not present)
- Source modules from `src/` directory:
  - `src.storage.manager` (StorageManager)
  - `src.features.extraction` (feature extraction functions)

**From HuggingFace Hub:**
- Transformer models (loaded on-the-fly):
  - BERT, RoBERTa, DeBERTa, XLNet tokenizers and models

**From Google Drive:**
- Dataset splits: `splits/dataset_splits.pkl`
  - Train split (loaded from 01_data_split.ipynb output)
  - Dev split (loaded from 01_data_split.ipynb output)
  - Test split (loaded but not used for feature extraction)

================================================================================
OUTPUTS (What this notebook saves)
================================================================================

**To Google Drive:**
- Feature matrices: `features/raw/X_{split}_{model}_{task}.npy`
  - For each model (bert, roberta, deberta, xlnet)
  - For each task (clarity, evasion)
  - For each split (train, dev)
  - Shape: (N_samples, 25_features)

**To GitHub:**
- Feature metadata: `metadata/features_{split}_{model}_{task}.json`
  - Feature names (25 features)
  - Feature dimensions
  - Timestamp and data paths

**What gets passed to next notebook:**
- Feature matrices for Train and Dev splits
- Feature metadata for all model/task/split combinations
- These features are loaded by subsequent notebooks via `storage.load_features()`

</div>


In [1]:
# ============================================================================
# SETUP: Repository Clone, Drive Mount, and Path Configuration
# ============================================================================
# This cell performs minimal setup required for the notebook to run:
# 1. Clones repository from GitHub (if not already present)
# 2. Mounts Google Drive for persistent data storage
# 3. Configures Python paths and initializes StorageManager
# 4. Loads data splits created in 01_data_split.ipynb

import shutil
import os
import subprocess
import time
import requests
import zipfile
import sys
from pathlib import Path
from google.colab import drive
import torch
from transformers import AutoTokenizer, AutoModel

# Repository configuration
repo_dir = '/content/semeval-context-tree-modular'
repo_url = 'https://github.com/EonTechie/semeval-context-tree-modular.git'
zip_url = 'https://github.com/EonTechie/semeval-context-tree-modular/archive/refs/heads/main.zip'

# Clone repository (if not already present)
if not os.path.exists(repo_dir):
    print("Cloning repository from GitHub...")
    max_retries = 2
    clone_success = False

    for attempt in range(max_retries):
        try:
            result = subprocess.run(
                ['git', 'clone', repo_url],
                cwd='/content',
                capture_output=True,
                text=True,
                timeout=60
            )
            if result.returncode == 0:
                print("Repository cloned successfully via git")
                clone_success = True
                break
            else:
                if attempt < max_retries - 1:
                    time.sleep(3)
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(3)

    # Fallback: Download as ZIP if git clone fails
    if not clone_success:
        print("Git clone failed. Downloading repository as ZIP archive...")
        zip_path = '/tmp/repo.zip'
        try:
            response = requests.get(zip_url, stream=True, timeout=60)
            response.raise_for_status()
            with open(zip_path, 'wb') as f:
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)
            with zipfile.ZipFile(zip_path, 'r') as zip_ref:
                zip_ref.extractall('/content')
            extracted_dir = '/content/semeval-context-tree-modular-main'
            if os.path.exists(extracted_dir):
                os.rename(extracted_dir, repo_dir)
            os.remove(zip_path)
            print("Repository downloaded and extracted successfully")
        except Exception as e:
            raise RuntimeError(f"Failed to obtain repository: {e}")

# Mount Google Drive (if not already mounted)
try:
    drive.mount('/content/drive', force_remount=False)
except Exception:
    pass  # Already mounted

# Configure paths
BASE_PATH = Path('/content/semeval-context-tree-modular')
DATA_PATH = Path('/content/drive/MyDrive/semeval_data')

# Verify repository structure exists
if not BASE_PATH.exists():
    raise RuntimeError(f"Repository directory not found: {BASE_PATH}")
if not (BASE_PATH / 'src').exists():
    raise RuntimeError(f"src directory not found in repository: {BASE_PATH / 'src'}")
if not (BASE_PATH / 'src' / 'storage' / 'manager.py').exists():
    raise RuntimeError(f"Required file not found: {BASE_PATH / 'src' / 'storage' / 'manager.py'}")

# Add repository to Python path
sys.path.insert(0, str(BASE_PATH))

# Verify imports work
try:
    from src.storage.manager import StorageManager
    from src.features.extraction import (
        featurize_hf_dataset_in_batches_v2,
        featurize_model_independent_features
    )
    from transformers import pipeline
except ImportError as e:
    raise ImportError(
        f"Failed to import required modules. "
        f"Repository path: {BASE_PATH}, "
        f"Python path: {sys.path[:3]}, "
        f"Error: {e}"
    )

# Initialize StorageManager
storage = StorageManager(
    base_path=str(BASE_PATH),
    data_path=str(DATA_PATH),
    github_path=str(BASE_PATH)
)

# Data splits will be loaded per-task in the feature extraction loop
# Clarity and Evasion have different splits (Evasion uses majority voting)

print("Setup complete")
print(f"  Repository: {BASE_PATH}")
print(f"  Data storage: {DATA_PATH}")
print(f"\nNOTE: Data splits will be loaded per-task (task-specific splits)")
print(f"      Clarity and Evasion have different splits due to majority voting")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Setup complete
  Repository: /content/semeval-context-tree-modular
  Data storage: /content/drive/MyDrive/semeval_data

NOTE: Data splits will be loaded per-task (task-specific splits)
      Clarity and Evasion have different splits due to majority voting


In [2]:
# ============================================================================
# REPRODUCIBILITY SETUP: Set Random Seeds for All Libraries
# ============================================================================
# This cell sets random seeds for Python, NumPy, PyTorch, and HuggingFace
# to ensure reproducible results across all runs.
#
# IMPORTANT: Run this cell FIRST before any other code that uses randomness.
# Seed value: 42 (same as used in all other parts of the pipeline)

from src.utils.reproducibility import set_all_seeds

# Set all random seeds to 42 for full reproducibility
# deterministic=True ensures PyTorch operations are deterministic (slower but fully reproducible)
set_all_seeds(seed=42, deterministic=True)

print("✓ Reproducibility configured: All random seeds set to 42")
print("✓ PyTorch deterministic mode enabled")
print("\nNOTE: If you encounter performance issues or non-deterministic behavior,")
print("      you can set deterministic=False in set_all_seeds() call above.")


✓ Reproducibility seeds set to 42
✓ PyTorch deterministic mode enabled (may be slower)
✓ Reproducibility configured: All random seeds set to 42
✓ PyTorch deterministic mode enabled

NOTE: If you encounter performance issues or non-deterministic behavior,
      you can set deterministic=False in set_all_seeds() call above.


In [None]:
# ============================================================================
# CONFIGURE MODELS AND TASKS
# ============================================================================
# Defines the transformer models and tasks for feature extraction
# Each model will be loaded from HuggingFace Hub and used to extract features

MODELS = {
    'bert': {
        'name': 'bert-base-uncased',
        'display': 'BERT'
    },
    
    'bert_political': {
        'name': 'bucketresearch/politicalBiasBERT',  
        'display': 'BERT-Political'
    },
    'bert_ambiguity': {
        'name': 'Slomb/Ambig_Question', 
        'display': 'BERT-Ambiguity'
    
    },
    'roberta': {
        'name': 'roberta-base',
        'display': 'RoBERTa'
    },
    'deberta': {
        'name': 'microsoft/deberta-v3-base',
        'display': 'DeBERTa'
    },
    'xlnet': {
        'name': 'xlnet-base-cased',
        'display': 'XLNet'
    }
}

# Explicit max sequence length for each model (to avoid tokenizer issues)
# These values are model-specific and must be set correctly to prevent OverflowError
MODEL_MAX_LENGTHS = {
    'bert': 512,
    'bert_political': 512,
    'bert_ambiguity': 512,
    'roberta': 512,
    'deberta': 512,
    'xlnet': 1024  # XLNet supports 1024 tokens
}

TASKS = ['clarity', 'evasion']

# Configure device (GPU if available, otherwise CPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
print(f"Models to process: {list(MODELS.keys())}")
print(f"Tasks: {TASKS}")


Using device: cuda
Models to process: ['bert', 'bert_political', 'bert_ambiguity', 'roberta', 'deberta', 'xlnet']
Tasks: ['clarity', 'evasion']


In [4]:
# ============================================================================
# EXTRACT MODEL-INDEPENDENT FEATURES (ONCE FOR ALL MODELS)
# ============================================================================
# Model-independent features (TF-IDF, sentiment, structural, metadata) are
# extracted once and reused for all models. This is more efficient than
# extracting them separately for each model.
#
# These features are text-based and don't depend on the transformer model.

print("="*80)
print("STEP 1: EXTRACT MODEL-INDEPENDENT FEATURES")
print("="*80)
print("Extracting model-independent features (TF-IDF, sentiment, structural, metadata)")
print("These will be reused for all models to improve efficiency.\n")

# Load sentiment pipeline (for sentiment features)
print("Loading sentiment analysis pipeline...")
try:
    sentiment_pipeline = pipeline(
        "sentiment-analysis",
        model="cardiffnlp/twitter-roberta-base-sentiment-latest",
        device=0 if torch.cuda.is_available() else -1,
        return_all_scores=True
    )
    print("  ✓ Sentiment pipeline loaded")
except Exception as e:
    print(f"   Could not load sentiment pipeline: {e}")
    print("  Continuing without sentiment features...")
    sentiment_pipeline = None

# Metadata keys for QEvasion dataset
metadata_keys = {
    'inaudible': 'inaudible',
    'multiple_questions': 'multiple_questions',
    'affirmative_questions': 'affirmative_questions'
}

# Extract model-independent features for each task
# CHECKPOINT: Try to load from Drive first, extract only if not exists
model_independent_features = {}

for task in TASKS:
    print(f"\n{'='*60}")
    print(f"Task: {task.upper()} - Model-Independent Features")
    print(f"{'='*60}")

    # Load task-specific splits
    train_ds = storage.load_split('train', task=task)
    dev_ds = storage.load_split('dev', task=task)

    print(f"  Train: {len(train_ds)} samples")
    print(f"  Dev: {len(dev_ds)} samples")

    # CHECKPOINT: Try to load train model-independent features from Drive
    try:
        X_train_indep = storage.load_model_independent_features('train', task=task)
        print(f"  ✓ Loaded train model-independent features from Drive (task: {task})")
        # Get feature names from metadata
        import json
        meta_path = storage.github_path / f'metadata/features_independent_train_{task}.json'
        if meta_path.exists():
            with open(meta_path, 'r') as f:
                metadata = json.load(f)
                feature_names_indep = metadata.get('feature_names', [])
        else:
            # Fallback: extract to get feature names
            _, feature_names_indep = featurize_model_independent_features(
                train_ds, question_key='interview_question', answer_key='interview_answer',
                batch_size=32, show_progress=False, sentiment_pipeline=sentiment_pipeline,
                metadata_keys=metadata_keys
            )
    except FileNotFoundError:
        # Extract if not found
        print(f"\n  Extracting train model-independent features (task: {task})...")
        X_train_indep, feature_names_indep = featurize_model_independent_features(
            train_ds,
            question_key='interview_question',
            answer_key='interview_answer',
            batch_size=32,  # Larger batch for model-independent features
            show_progress=True,
            sentiment_pipeline=sentiment_pipeline,
            metadata_keys=metadata_keys
        )
        # Save to Drive for future use
        storage.save_model_independent_features(
            X_train_indep, 'train', feature_names_indep, task=task, question_key='interview_question'
        )
        print(f"  ✓ Saved train model-independent features to Drive (task: {task})")

    # CHECKPOINT: Try to load dev model-independent features from Drive
    try:
        X_dev_indep = storage.load_model_independent_features('dev', task=task)
        print(f"  ✓ Loaded dev model-independent features from Drive (task: {task})")
    except FileNotFoundError:
        # Extract if not found
        print(f"\n  Extracting dev model-independent features (task: {task})...")
        X_dev_indep, _ = featurize_model_independent_features(
            dev_ds,
            question_key='interview_question',
            answer_key='interview_answer',
            batch_size=32,
            show_progress=True,
            sentiment_pipeline=sentiment_pipeline,
            metadata_keys=metadata_keys
        )
        # Save to Drive for future use
        storage.save_model_independent_features(
            X_dev_indep, 'dev', feature_names_indep, task=task, question_key='interview_question'
        )
        print(f"  ✓ Saved dev model-independent features to Drive (task: {task})")

    # Store for reuse across all models
    model_independent_features[task] = {
        'train': X_train_indep,
        'dev': X_dev_indep,
        'feature_names': feature_names_indep
    }

    print(f"  ✓ Train: {X_train_indep.shape[0]} samples, {X_train_indep.shape[1]} features")
    print(f"  ✓ Dev: {X_dev_indep.shape[0]} samples, {X_dev_indep.shape[1]} features")

print(f"\n{'='*80}")
print("Model-independent features extracted for all tasks")
print("These will be reused for all models (efficiency mode)")
print(f"{'='*80}\n")


STEP 1: EXTRACT MODEL-INDEPENDENT FEATURES
Extracting model-independent features (TF-IDF, sentiment, structural, metadata)
These will be reused for all models to improve efficiency.

Loading sentiment analysis pipeline...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you e

  ✓ Sentiment pipeline loaded

Task: CLARITY - Model-Independent Features
  Train: 2758 samples
  Dev: 690 samples
  ✓ Loaded train model-independent features from Drive (task: clarity)
  ✓ Loaded dev model-independent features from Drive (task: clarity)
  ✓ Train: 2758 samples, 18 features
  ✓ Dev: 690 samples, 18 features

Task: EVASION - Model-Independent Features
  Train: 2758 samples
  Dev: 690 samples
  ✓ Loaded train model-independent features from Drive (task: evasion)
  ✓ Loaded dev model-independent features from Drive (task: evasion)
  ✓ Train: 2758 samples, 18 features
  ✓ Dev: 690 samples, 18 features

Model-independent features extracted for all tasks
These will be reused for all models (efficiency mode)



In [5]:
# ============================================================================
# EXTRACT FEATURES FOR EACH MODEL AND TASK
# ============================================================================
# Iterates through each transformer model and extracts Context Tree features
# Features are extracted for Train and Dev splits only
# Test split features will be extracted in the final evaluation notebook

for model_key, model_info in MODELS.items():
    print(f"\n{'='*80}")
    print(f"Processing {model_info['display']} ({model_info['name']})")
    print(f"{'='*80}")

    # Clear GPU cache before loading new model (prevent CUDA errors)
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()

    # Load tokenizer and model from HuggingFace Hub
    print(f"Loading {model_info['display']} model and tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(model_info['name'])
    model = AutoModel.from_pretrained(model_info['name'])
    model.to(device)
    model.eval()

    # Get model-specific max sequence length
    # Priority: 1) Explicit MODEL_MAX_LENGTHS dict, 2) tokenizer.model_max_length, 3) model.config.max_position_embeddings
    # This ensures each model gets the correct max_length and prevents OverflowError from negative values
    if model_key in MODEL_MAX_LENGTHS:
        max_seq_len = MODEL_MAX_LENGTHS[model_key]
    elif hasattr(tokenizer, 'model_max_length') and tokenizer.model_max_length is not None and tokenizer.model_max_length > 0 and tokenizer.model_max_length < 1e10:
        max_seq_len = int(tokenizer.model_max_length)
    elif hasattr(model.config, 'max_position_embeddings') and model.config.max_position_embeddings is not None and model.config.max_position_embeddings > 0:
        max_seq_len = int(model.config.max_position_embeddings)
    else:
        # Final fallback: use default based on model type
        if 'xlnet' in model_info['name'].lower():
            max_seq_len = 1024  # XLNet typically supports 1024
        else:
            max_seq_len = 512   # BERT, RoBERTa, DeBERTa typically 512

    # Ensure max_seq_len is positive (prevent OverflowError)
    if max_seq_len <= 0:
        raise ValueError(f"Invalid max_seq_len for {model_key}: {max_seq_len}. Must be positive.")

    print(f"Model loaded and moved to {device}")
    print(f"Max sequence length for {model_info['display']}: {max_seq_len}")

    for task in TASKS:
        print(f"\n{'='*60}")
        print(f"Task: {task.upper()}")
        print(f"{'='*60}")

        # Load task-specific splits (Clarity and Evasion have different splits)
        # Evasion splits are filtered by majority voting
        train_ds = storage.load_split('train', task=task)
        dev_ds = storage.load_split('dev', task=task)

        print(f"  Loaded splits for {task}:")
        print(f"    Train: {len(train_ds)} samples")
        print(f"    Dev: {len(dev_ds)} samples")

        # Check if features already exist (skip if already extracted)
        try:
            X_train_existing = storage.load_features(model_key, task, 'train')
            X_dev_existing = storage.load_features(model_key, task, 'dev')
            print(f"  Features already exist for {model_key} × {task}")
            print(f"    Train: {X_train_existing.shape[0]} samples, {X_train_existing.shape[1]} features")
            print(f"    Dev: {X_dev_existing.shape[0]} samples, {X_dev_existing.shape[1]} features")
            print(f"  SKIPPING feature extraction (already done)")
            continue
        except FileNotFoundError:
            # Features don't exist, proceed with extraction
            pass

        # EFFICIENCY MODE: Use pre-extracted model-independent features
        # Only extract model-dependent features (attention-based, tokenizer-specific)
        print(f"\nExtracting train features (model-dependent only, using pre-extracted model-independent)...")
        X_train, feature_names, _ = featurize_hf_dataset_in_batches_v2(
            train_ds,
            tokenizer,
            model,
            device,
            batch_size=8,              # Batch size for feature extraction
            max_sequence_length=max_seq_len,  # Model-specific max sequence length
            question_key='interview_question',  # Key for question text in dataset (original question, NOT 'question' which is paraphrased)
            answer_key='interview_answer',  # Key for answer text in dataset (QEvasion uses 'interview_answer')
            show_progress=True,         # Show progress bar
            model_independent_features=model_independent_features[task]['train']  # Reuse pre-extracted features
        )

        # Save train features
        storage.save_features(
            X_train, model_key, task, 'train', feature_names
        )
        print(f"  Saved train: {X_train.shape[0]} samples, {X_train.shape[1]} features")

        # Extract features for Dev split (efficiency mode)
        print(f"\nExtracting dev features (model-dependent only, using pre-extracted model-independent)...")
        X_dev, _, _ = featurize_hf_dataset_in_batches_v2(
            dev_ds,
            tokenizer,
            model,
            device,
            batch_size=8,              # Batch size for feature extraction
            max_sequence_length=max_seq_len,  # Model-specific max sequence length
            question_key='interview_question',  # Key for question text in dataset (original question, NOT 'question' which is paraphrased)
            answer_key='interview_answer',  # Key for answer text in dataset (QEvasion uses 'interview_answer')
            show_progress=True,         # Show progress bar
            model_independent_features=model_independent_features[task]['dev']  # Reuse pre-extracted features
        )

        # Save dev features
        storage.save_features(
            X_dev, model_key, task, 'dev', feature_names
        )
        print(f"  Saved dev: {X_dev.shape[0]} samples, {X_dev.shape[1]} features")

    # Free up GPU memory after processing each model
    del model, tokenizer
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    print(f"\nMemory cleared after processing {model_info['display']}")

print(f"\n{'='*80}")
print("Feature extraction complete for all models and tasks")
print(f"{'='*80}")
print("\nSummary:")
print("  - Features extracted for Train and Dev splits")
print("  - Features saved to Google Drive for each model/task/split combination")
print("  - Test split features will be extracted in final evaluation notebook")



Processing BERT (bert-base-uncased)
Loading BERT model and tokenizer...
Model loaded and moved to cuda
Max sequence length for BERT: 512

Task: CLARITY
  Loaded splits for clarity:
    Train: 2758 samples
    Dev: 690 samples

Extracting train features (model-dependent only, using pre-extracted model-independent)...


Extracting model-dependent features: 100%|██████████| 345/345 [00:26<00:00, 13.12it/s]


Saved features: /content/drive/MyDrive/semeval_data/features/raw/X_train_bert_clarity.npy
Saved metadata: /content/semeval-context-tree-modular/metadata/features_train_bert_clarity.json
  Saved train: 2758 samples, 25 features

Extracting dev features (model-dependent only, using pre-extracted model-independent)...


Extracting model-dependent features: 100%|██████████| 87/87 [00:06<00:00, 13.71it/s]


Saved features: /content/drive/MyDrive/semeval_data/features/raw/X_dev_bert_clarity.npy
Saved metadata: /content/semeval-context-tree-modular/metadata/features_dev_bert_clarity.json
  Saved dev: 690 samples, 25 features

Task: EVASION
  Loaded splits for evasion:
    Train: 2758 samples
    Dev: 690 samples

Extracting train features (model-dependent only, using pre-extracted model-independent)...


Extracting model-dependent features: 100%|██████████| 345/345 [00:25<00:00, 13.41it/s]


Saved features: /content/drive/MyDrive/semeval_data/features/raw/X_train_bert_evasion.npy
Saved metadata: /content/semeval-context-tree-modular/metadata/features_train_bert_evasion.json
  Saved train: 2758 samples, 25 features

Extracting dev features (model-dependent only, using pre-extracted model-independent)...


Extracting model-dependent features: 100%|██████████| 87/87 [00:06<00:00, 13.56it/s]


Saved features: /content/drive/MyDrive/semeval_data/features/raw/X_dev_bert_evasion.npy
Saved metadata: /content/semeval-context-tree-modular/metadata/features_dev_bert_evasion.json
  Saved dev: 690 samples, 25 features

Memory cleared after processing BERT

Processing BERT-Political (bert-base-uncased)
Loading BERT-Political model and tokenizer...
Model loaded and moved to cuda
Max sequence length for BERT-Political: 512

Task: CLARITY
  Loaded splits for clarity:
    Train: 2758 samples
    Dev: 690 samples

Extracting train features (model-dependent only, using pre-extracted model-independent)...


Extracting model-dependent features: 100%|██████████| 345/345 [00:25<00:00, 13.31it/s]


Saved features: /content/drive/MyDrive/semeval_data/features/raw/X_train_bert_political_clarity.npy
Saved metadata: /content/semeval-context-tree-modular/metadata/features_train_bert_political_clarity.json
  Saved train: 2758 samples, 25 features

Extracting dev features (model-dependent only, using pre-extracted model-independent)...


Extracting model-dependent features: 100%|██████████| 87/87 [00:06<00:00, 13.56it/s]


Saved features: /content/drive/MyDrive/semeval_data/features/raw/X_dev_bert_political_clarity.npy
Saved metadata: /content/semeval-context-tree-modular/metadata/features_dev_bert_political_clarity.json
  Saved dev: 690 samples, 25 features

Task: EVASION
  Loaded splits for evasion:
    Train: 2758 samples
    Dev: 690 samples

Extracting train features (model-dependent only, using pre-extracted model-independent)...


Extracting model-dependent features: 100%|██████████| 345/345 [00:25<00:00, 13.34it/s]


Saved features: /content/drive/MyDrive/semeval_data/features/raw/X_train_bert_political_evasion.npy
Saved metadata: /content/semeval-context-tree-modular/metadata/features_train_bert_political_evasion.json
  Saved train: 2758 samples, 25 features

Extracting dev features (model-dependent only, using pre-extracted model-independent)...


Extracting model-dependent features: 100%|██████████| 87/87 [00:06<00:00, 13.64it/s]


Saved features: /content/drive/MyDrive/semeval_data/features/raw/X_dev_bert_political_evasion.npy
Saved metadata: /content/semeval-context-tree-modular/metadata/features_dev_bert_political_evasion.json
  Saved dev: 690 samples, 25 features

Memory cleared after processing BERT-Political

Processing BERT-Ambiguity (bert-base-uncased)
Loading BERT-Ambiguity model and tokenizer...
Model loaded and moved to cuda
Max sequence length for BERT-Ambiguity: 512

Task: CLARITY
  Loaded splits for clarity:
    Train: 2758 samples
    Dev: 690 samples

Extracting train features (model-dependent only, using pre-extracted model-independent)...


Extracting model-dependent features: 100%|██████████| 345/345 [00:25<00:00, 13.36it/s]


Saved features: /content/drive/MyDrive/semeval_data/features/raw/X_train_bert_ambiguity_clarity.npy
Saved metadata: /content/semeval-context-tree-modular/metadata/features_train_bert_ambiguity_clarity.json
  Saved train: 2758 samples, 25 features

Extracting dev features (model-dependent only, using pre-extracted model-independent)...


Extracting model-dependent features: 100%|██████████| 87/87 [00:06<00:00, 13.64it/s]


Saved features: /content/drive/MyDrive/semeval_data/features/raw/X_dev_bert_ambiguity_clarity.npy
Saved metadata: /content/semeval-context-tree-modular/metadata/features_dev_bert_ambiguity_clarity.json
  Saved dev: 690 samples, 25 features

Task: EVASION
  Loaded splits for evasion:
    Train: 2758 samples
    Dev: 690 samples

Extracting train features (model-dependent only, using pre-extracted model-independent)...


Extracting model-dependent features: 100%|██████████| 345/345 [00:25<00:00, 13.39it/s]


Saved features: /content/drive/MyDrive/semeval_data/features/raw/X_train_bert_ambiguity_evasion.npy
Saved metadata: /content/semeval-context-tree-modular/metadata/features_train_bert_ambiguity_evasion.json
  Saved train: 2758 samples, 25 features

Extracting dev features (model-dependent only, using pre-extracted model-independent)...


Extracting model-dependent features: 100%|██████████| 87/87 [00:06<00:00, 13.61it/s]


Saved features: /content/drive/MyDrive/semeval_data/features/raw/X_dev_bert_ambiguity_evasion.npy
Saved metadata: /content/semeval-context-tree-modular/metadata/features_dev_bert_ambiguity_evasion.json
  Saved dev: 690 samples, 25 features

Memory cleared after processing BERT-Ambiguity

Processing RoBERTa (roberta-base)
Loading RoBERTa model and tokenizer...


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded and moved to cuda
Max sequence length for RoBERTa: 512

Task: CLARITY
  Loaded splits for clarity:
    Train: 2758 samples
    Dev: 690 samples

Extracting train features (model-dependent only, using pre-extracted model-independent)...


Extracting model-dependent features: 100%|██████████| 345/345 [00:25<00:00, 13.59it/s]


Saved features: /content/drive/MyDrive/semeval_data/features/raw/X_train_roberta_clarity.npy
Saved metadata: /content/semeval-context-tree-modular/metadata/features_train_roberta_clarity.json
  Saved train: 2758 samples, 25 features

Extracting dev features (model-dependent only, using pre-extracted model-independent)...


Extracting model-dependent features: 100%|██████████| 87/87 [00:06<00:00, 13.87it/s]


Saved features: /content/drive/MyDrive/semeval_data/features/raw/X_dev_roberta_clarity.npy
Saved metadata: /content/semeval-context-tree-modular/metadata/features_dev_roberta_clarity.json
  Saved dev: 690 samples, 25 features

Task: EVASION
  Loaded splits for evasion:
    Train: 2758 samples
    Dev: 690 samples

Extracting train features (model-dependent only, using pre-extracted model-independent)...


Extracting model-dependent features: 100%|██████████| 345/345 [00:25<00:00, 13.67it/s]


Saved features: /content/drive/MyDrive/semeval_data/features/raw/X_train_roberta_evasion.npy
Saved metadata: /content/semeval-context-tree-modular/metadata/features_train_roberta_evasion.json
  Saved train: 2758 samples, 25 features

Extracting dev features (model-dependent only, using pre-extracted model-independent)...


Extracting model-dependent features: 100%|██████████| 87/87 [00:06<00:00, 13.91it/s]


Saved features: /content/drive/MyDrive/semeval_data/features/raw/X_dev_roberta_evasion.npy
Saved metadata: /content/semeval-context-tree-modular/metadata/features_dev_roberta_evasion.json
  Saved dev: 690 samples, 25 features

Memory cleared after processing RoBERTa

Processing DeBERTa (microsoft/deberta-v3-base)
Loading DeBERTa model and tokenizer...


tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/371M [00:00<?, ?B/s]

Model loaded and moved to cuda
Max sequence length for DeBERTa: 512

Task: CLARITY
  Loaded splits for clarity:
    Train: 2758 samples
    Dev: 690 samples

Extracting train features (model-dependent only, using pre-extracted model-independent)...


Extracting model-dependent features:   1%|          | 4/345 [00:00<01:10,  4.80it/s]

model.safetensors:   0%|          | 0.00/371M [00:00<?, ?B/s]

Extracting model-dependent features: 100%|██████████| 345/345 [00:34<00:00,  9.87it/s]


Saved features: /content/drive/MyDrive/semeval_data/features/raw/X_train_deberta_clarity.npy
Saved metadata: /content/semeval-context-tree-modular/metadata/features_train_deberta_clarity.json
  Saved train: 2758 samples, 25 features

Extracting dev features (model-dependent only, using pre-extracted model-independent)...


Extracting model-dependent features: 100%|██████████| 87/87 [00:08<00:00, 10.31it/s]


Saved features: /content/drive/MyDrive/semeval_data/features/raw/X_dev_deberta_clarity.npy
Saved metadata: /content/semeval-context-tree-modular/metadata/features_dev_deberta_clarity.json
  Saved dev: 690 samples, 25 features

Task: EVASION
  Loaded splits for evasion:
    Train: 2758 samples
    Dev: 690 samples

Extracting train features (model-dependent only, using pre-extracted model-independent)...


Extracting model-dependent features: 100%|██████████| 345/345 [00:34<00:00, 10.14it/s]


Saved features: /content/drive/MyDrive/semeval_data/features/raw/X_train_deberta_evasion.npy
Saved metadata: /content/semeval-context-tree-modular/metadata/features_train_deberta_evasion.json
  Saved train: 2758 samples, 25 features

Extracting dev features (model-dependent only, using pre-extracted model-independent)...


Extracting model-dependent features: 100%|██████████| 87/87 [00:08<00:00, 10.30it/s]


Saved features: /content/drive/MyDrive/semeval_data/features/raw/X_dev_deberta_evasion.npy
Saved metadata: /content/semeval-context-tree-modular/metadata/features_dev_deberta_evasion.json
  Saved dev: 690 samples, 25 features

Memory cleared after processing DeBERTa

Processing XLNet (xlnet-base-cased)
Loading XLNet model and tokenizer...


config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/467M [00:00<?, ?B/s]

Model loaded and moved to cuda
Max sequence length for XLNet: 1024

Task: CLARITY
  Loaded splits for clarity:
    Train: 2758 samples
    Dev: 690 samples

Extracting train features (model-dependent only, using pre-extracted model-independent)...


Extracting model-dependent features:   2%|▏         | 8/345 [00:02<01:31,  3.66it/s]

model.safetensors:   0%|          | 0.00/467M [00:00<?, ?B/s]

Extracting model-dependent features: 100%|██████████| 345/345 [01:25<00:00,  4.02it/s]


Saved features: /content/drive/MyDrive/semeval_data/features/raw/X_train_xlnet_clarity.npy
Saved metadata: /content/semeval-context-tree-modular/metadata/features_train_xlnet_clarity.json
  Saved train: 2758 samples, 25 features

Extracting dev features (model-dependent only, using pre-extracted model-independent)...


Extracting model-dependent features: 100%|██████████| 87/87 [00:20<00:00,  4.16it/s]


Saved features: /content/drive/MyDrive/semeval_data/features/raw/X_dev_xlnet_clarity.npy
Saved metadata: /content/semeval-context-tree-modular/metadata/features_dev_xlnet_clarity.json
  Saved dev: 690 samples, 25 features

Task: EVASION
  Loaded splits for evasion:
    Train: 2758 samples
    Dev: 690 samples

Extracting train features (model-dependent only, using pre-extracted model-independent)...


Extracting model-dependent features: 100%|██████████| 345/345 [01:24<00:00,  4.09it/s]


Saved features: /content/drive/MyDrive/semeval_data/features/raw/X_train_xlnet_evasion.npy
Saved metadata: /content/semeval-context-tree-modular/metadata/features_train_xlnet_evasion.json
  Saved train: 2758 samples, 25 features

Extracting dev features (model-dependent only, using pre-extracted model-independent)...


Extracting model-dependent features: 100%|██████████| 87/87 [00:20<00:00,  4.15it/s]

Saved features: /content/drive/MyDrive/semeval_data/features/raw/X_dev_xlnet_evasion.npy
Saved metadata: /content/semeval-context-tree-modular/metadata/features_dev_xlnet_evasion.json
  Saved dev: 690 samples, 25 features

Memory cleared after processing XLNet

Feature extraction complete for all models and tasks

Summary:
  - Features extracted for Train and Dev splits
  - Features saved to Google Drive for each model/task/split combination
  - Test split features will be extracted in final evaluation notebook



