# Feature Extraction: Context Tree Features for Individual Models

================================================================================
PURPOSE: Extract 19 Context Tree features for each transformer model separately
================================================================================

This notebook extracts Context Tree features from question-answer pairs using
four different transformer models. The features capture attention patterns,
lexical properties, and semantic relationships between questions and answers.

**Models:**
- BERT (bert-base-uncased)
- RoBERTa (roberta-base)
- DeBERTa (microsoft/deberta-v3-base)
- XLNet (xlnet-base-cased)

**Tasks:**
- Clarity: 3-class classification (Clear Reply, Ambiguous, Clear Non-Reply)
- Evasion: 9-class classification (Direct Answer, Partial Answer, etc.)

**Feature Categories:**
1. Attention-based features (attention mass, focus strength)
2. Pattern-based features (TF-IDF similarity, content word ratios)
3. Lexicon-based features (answer lexicon ratios, negation ratios)

**Output:** Feature matrices saved to Google Drive for each model/task/split
combination. Features are extracted for Train and Dev splits only. Test split
features will be extracted in the final evaluation notebook.

================================================================================
INPUTS (What this notebook loads)
================================================================================

**From GitHub:**
- Repository code (cloned automatically if not present)
- Source modules from `src/` directory:
  - `src.storage.manager` (StorageManager)
  - `src.features.extraction` (feature extraction functions)

**From HuggingFace Hub:**
- Transformer models (loaded on-the-fly):
  - BERT, RoBERTa, DeBERTa, XLNet tokenizers and models

**From Google Drive:**
- Dataset splits: `splits/dataset_splits.pkl`
  - Train split (loaded from 01_data_split.ipynb output)
  - Dev split (loaded from 01_data_split.ipynb output)
  - Test split (loaded but not used for feature extraction)

================================================================================
OUTPUTS (What this notebook saves)
================================================================================

**To Google Drive:**
- Feature matrices: `features/raw/X_{split}_{model}_{task}.npy`
  - For each model (bert, roberta, deberta, xlnet)
  - For each task (clarity, evasion)
  - For each split (train, dev)
  - Shape: (N_samples, 19_features)

**To GitHub:**
- Feature metadata: `metadata/features_{split}_{model}_{task}.json`
  - Feature names (19 features)
  - Feature dimensions
  - Timestamp and data paths

**What gets passed to next notebook:**
- Feature matrices for Train and Dev splits
- Feature metadata for all model/task/split combinations
- These features are loaded by subsequent notebooks via `storage.load_features()`


In [None]:
# ============================================================================
# SETUP: Repository Clone, Drive Mount, and Path Configuration
# ============================================================================
# This cell performs minimal setup required for the notebook to run:
# 1. Clones repository from GitHub (if not already present)
# 2. Mounts Google Drive for persistent data storage
# 3. Configures Python paths and initializes StorageManager
# 4. Loads data splits created in 01_data_split.ipynb

import shutil
import os
import subprocess
import time
import requests
import zipfile
import sys
from pathlib import Path
from google.colab import drive
import torch
from transformers import AutoTokenizer, AutoModel

# Repository configuration
repo_dir = '/content/semeval-context-tree-modular'
repo_url = 'https://github.com/EonTechie/semeval-context-tree-modular.git'
zip_url = 'https://github.com/EonTechie/semeval-context-tree-modular/archive/refs/heads/main.zip'

# Clone repository (if not already present)
if not os.path.exists(repo_dir):
    print("Cloning repository from GitHub...")
    max_retries = 2
    clone_success = False
    
    for attempt in range(max_retries):
        try:
            result = subprocess.run(
                ['git', 'clone', repo_url],
                cwd='/content',
                capture_output=True,
                text=True,
                timeout=60
            )
            if result.returncode == 0:
                print("Repository cloned successfully via git")
                clone_success = True
                break
            else:
                if attempt < max_retries - 1:
                    time.sleep(3)
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(3)
    
    # Fallback: Download as ZIP if git clone fails
    if not clone_success:
        print("Git clone failed. Downloading repository as ZIP archive...")
        zip_path = '/tmp/repo.zip'
        try:
            response = requests.get(zip_url, stream=True, timeout=60)
            response.raise_for_status()
            with open(zip_path, 'wb') as f:
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)
            with zipfile.ZipFile(zip_path, 'r') as zip_ref:
                zip_ref.extractall('/content')
            extracted_dir = '/content/semeval-context-tree-modular-main'
            if os.path.exists(extracted_dir):
                os.rename(extracted_dir, repo_dir)
            os.remove(zip_path)
            print("Repository downloaded and extracted successfully")
        except Exception as e:
            raise RuntimeError(f"Failed to obtain repository: {e}")

# Mount Google Drive (if not already mounted)
try:
    drive.mount('/content/drive', force_remount=False)
except Exception:
    pass  # Already mounted

# Configure paths
BASE_PATH = Path('/content/semeval-context-tree-modular')
DATA_PATH = Path('/content/drive/MyDrive/semeval_data')

# Verify repository structure exists
if not BASE_PATH.exists():
    raise RuntimeError(f"Repository directory not found: {BASE_PATH}")
if not (BASE_PATH / 'src').exists():
    raise RuntimeError(f"src directory not found in repository: {BASE_PATH / 'src'}")
if not (BASE_PATH / 'src' / 'storage' / 'manager.py').exists():
    raise RuntimeError(f"Required file not found: {BASE_PATH / 'src' / 'storage' / 'manager.py'}")

# Add repository to Python path
sys.path.insert(0, str(BASE_PATH))

# Verify imports work
try:
    from src.storage.manager import StorageManager
    from src.features.extraction import featurize_hf_dataset_in_batches_v2
except ImportError as e:
    raise ImportError(
        f"Failed to import required modules. "
        f"Repository path: {BASE_PATH}, "
        f"Python path: {sys.path[:3]}, "
        f"Error: {e}"
    )

# Initialize StorageManager
storage = StorageManager(
    base_path=str(BASE_PATH),
    data_path=str(DATA_PATH),
    github_path=str(BASE_PATH)
)

# Load data splits (created in 01_data_split.ipynb)
train_ds = storage.load_split('train')
dev_ds = storage.load_split('dev')
test_ds = storage.load_split('test')  # Will be used only in final evaluation

print("Setup complete")
print(f"  Repository: {BASE_PATH}")
print(f"  Data storage: {DATA_PATH}")
print(f"\nLoaded data splits:")
print(f"  Train: {len(train_ds)} samples")
print(f"  Dev: {len(dev_ds)} samples")
print(f"  Test: {len(test_ds)} samples (reserved for final evaluation)")


In [None]:
# ============================================================================
# CONFIGURE MODELS AND TASKS
# ============================================================================
# Defines the transformer models and tasks for feature extraction
# Each model will be loaded from HuggingFace Hub and used to extract features

MODELS = {
    'bert': {
        'name': 'bert-base-uncased',
        'display': 'BERT'
    },
    'bert_political': {
        'name': 'bert-base-uncased',  # TODO: Replace with actual political discourse BERT model from HuggingFace
        'display': 'BERT-Political'
    },
    'bert_ambiguity': {
        'name': 'bert-base-uncased',  # TODO: Replace with actual ambiguity-focused BERT model from HuggingFace
        'display': 'BERT-Ambiguity'
    },
    'roberta': {
        'name': 'roberta-base',
        'display': 'RoBERTa'
    },
    'deberta': {
        'name': 'microsoft/deberta-v3-base',
        'display': 'DeBERTa'
    },
    'xlnet': {
        'name': 'xlnet-base-cased',
        'display': 'XLNet'
    }
}

TASKS = ['clarity', 'evasion']

# Configure device (GPU if available, otherwise CPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
print(f"Models to process: {list(MODELS.keys())}")
print(f"Tasks: {TASKS}")


In [None]:
# ============================================================================
# EXTRACT FEATURES FOR EACH MODEL AND TASK
# ============================================================================
# Iterates through each transformer model and extracts Context Tree features
# Features are extracted for Train and Dev splits only
# Test split features will be extracted in the final evaluation notebook

for model_key, model_info in MODELS.items():
    print(f"\n{'='*80}")
    print(f"Processing {model_info['display']} ({model_info['name']})")
    print(f"{'='*80}")
    
    # Load tokenizer and model from HuggingFace Hub
    print(f"Loading {model_info['display']} model and tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(model_info['name'])
    model = AutoModel.from_pretrained(model_info['name'])
    model.to(device)
    model.eval()
    print(f"Model loaded and moved to {device}")
    
    for task in TASKS:
        print(f"\n{'='*60}")
        print(f"Task: {task.upper()}")
        print(f"{'='*60}")
        
        # Extract features for Train and Dev splits
        for split_name, split_ds in [('train', train_ds), ('dev', dev_ds)]:
            print(f"\nExtracting {split_name} features...")
            
            # Extract 19 Context Tree features using the model's attention mechanism
            # Features include attention patterns, lexical properties, and semantic relationships
            X, feature_names, _ = featurize_hf_dataset_in_batches_v2(
                split_ds,
                tokenizer,
                model,
                device,
                batch_size=8,              # Batch size for feature extraction
                max_sequence_length=256,    # Maximum sequence length
                question_key='question',    # Key for question text in dataset
                answer_key='answer',        # Key for answer text in dataset
                show_progress=True          # Show progress bar
            )
            
            # Save features to persistent storage (Google Drive)
            storage.save_features(
                X, model_key, task, split_name, feature_names
            )
            
            print(f"  Saved: {X.shape[0]} samples, {X.shape[1]} features")
            print(f"  Feature names: {len(feature_names)} features")
    
    # Free up GPU memory after processing each model
    del model, tokenizer
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    print(f"\nMemory cleared after processing {model_info['display']}")

print(f"\n{'='*80}")
print("Feature extraction complete for all models and tasks")
print(f"{'='*80}")
print("\nSummary:")
print("  - Features extracted for Train and Dev splits")
print("  - Features saved to Google Drive for each model/task/split combination")
print("  - Test split features will be extracted in final evaluation notebook")
