<a href="https://colab.research.google.com/github/EonTechie/semeval-context-tree-modular/blob/main/notebooks/01_data_split.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Split: Train / Dev / Test

================================================================================
PURPOSE: Split HuggingFace train split into Train/Dev, keep test split separate
================================================================================

This notebook loads the QEvasion dataset from HuggingFace, which already has
train and test splits. The HuggingFace test split is kept untouched and will
ONLY be used in final evaluation. Only the HuggingFace train split is divided:

- **Train**: 80% of HuggingFace train split (used for training models)
- **Dev**: 20% of HuggingFace train split (used for model/feature selection)
- **Test**: HuggingFace test split (ONLY used in final evaluation notebook)

**CRITICAL**: The HuggingFace test split is NEVER used for training, model
selection, or any development decisions. It is only accessed in the final
evaluation notebook (05_final_evaluation.ipynb).

================================================================================
INPUTS (What this notebook loads)
================================================================================

**From GitHub:**
- Repository code (cloned automatically if not present)
- Source modules from `src/` directory

**From HuggingFace Hub:**
- QEvasion dataset (`ailsntua/QEvasion`)
  - Train split (approximately 3400 samples)
  - Test split (308 samples, kept untouched)

**From Google Drive:**
- Nothing (this is the first notebook in the pipeline)

================================================================================
OUTPUTS (What this notebook saves)
================================================================================

**To Google Drive:**
- Dataset splits: `splits/dataset_splits.pkl`
  - Train split (80% of HuggingFace train split)
  - Dev split (20% of HuggingFace train split)
  - Test split (HuggingFace test split, untouched)

**To GitHub:**
- Split metadata: `metadata/splits.json`
  - Train/Dev/Test sizes
  - Timestamp and data paths

**What gets passed to next notebook:**
- Train, Dev, and Test splits saved to persistent storage
- These splits are loaded by subsequent notebooks via `storage.load_split()`


In [1]:
# ============================================================================
# SETUP: Repository Clone, Drive Mount, and Path Configuration
# ============================================================================
# This cell performs minimal setup required for the notebook to run:
# 1. Clones repository from GitHub (if not already present)
# 2. Mounts Google Drive for persistent data storage
# 3. Configures Python paths and initializes StorageManager

import shutil
import os
import subprocess
import time
import requests
import zipfile
import sys
from pathlib import Path
from google.colab import drive

# Repository configuration
repo_dir = '/content/semeval-context-tree-modular'
repo_url = 'https://github.com/EonTechie/semeval-context-tree-modular.git'
zip_url = 'https://github.com/EonTechie/semeval-context-tree-modular/archive/refs/heads/main.zip'

# Clone repository (if not already present)
if not os.path.exists(repo_dir):
    print("Cloning repository from GitHub...")
    max_retries = 2
    clone_success = False

    for attempt in range(max_retries):
        try:
            result = subprocess.run(
                ['git', 'clone', repo_url],
                cwd='/content',
                capture_output=True,
                text=True,
                timeout=60
            )
            if result.returncode == 0:
                print("Repository cloned successfully via git")
                clone_success = True
                break
            else:
                if attempt < max_retries - 1:
                    time.sleep(3)
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(3)

    # Fallback: Download as ZIP if git clone fails
    if not clone_success:
        print("Git clone failed. Downloading repository as ZIP archive...")
        zip_path = '/tmp/repo.zip'
        try:
            response = requests.get(zip_url, stream=True, timeout=60)
            response.raise_for_status()
            with open(zip_path, 'wb') as f:
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)
            with zipfile.ZipFile(zip_path, 'r') as zip_ref:
                zip_ref.extractall('/content')
            extracted_dir = '/content/semeval-context-tree-modular-main'
            if os.path.exists(extracted_dir):
                os.rename(extracted_dir, repo_dir)
            os.remove(zip_path)
            print("Repository downloaded and extracted successfully")
        except Exception as e:
            raise RuntimeError(f"Failed to obtain repository: {e}")

# Mount Google Drive (if not already mounted)
try:
    drive.mount('/content/drive', force_remount=False)
except Exception:
    pass  # Already mounted

# Configure paths
BASE_PATH = Path('/content/semeval-context-tree-modular')
DATA_PATH = Path('/content/drive/MyDrive/semeval_data')

# Verify repository structure exists
if not BASE_PATH.exists():
    raise RuntimeError(f"Repository directory not found: {BASE_PATH}")
if not (BASE_PATH / 'src').exists():
    raise RuntimeError(f"src directory not found in repository: {BASE_PATH / 'src'}")
if not (BASE_PATH / 'src' / 'storage' / 'manager.py').exists():
    raise RuntimeError(f"Required file not found: {BASE_PATH / 'src' / 'storage' / 'manager.py'}")

# Add repository to Python path
sys.path.insert(0, str(BASE_PATH))

# Verify import works
try:
    from src.storage.manager import StorageManager
except ImportError as e:
    raise ImportError(
        f"Failed to import StorageManager. "
        f"Repository path: {BASE_PATH}, "
        f"Python path: {sys.path[:3]}, "
        f"Error: {e}"
    )

# Initialize StorageManager
storage = StorageManager(
    base_path=str(BASE_PATH),
    data_path=str(DATA_PATH),
    github_path=str(BASE_PATH)
)

print("Setup complete")
print(f"  Repository: {BASE_PATH}")
print(f"  Data storage: {DATA_PATH}")
print(f"  Repository verified: src/ directory exists")
print(f"  Python path configured: {BASE_PATH} added to sys.path")


Cloning repository from GitHub...
Repository cloned successfully via git
Mounted at /content/drive
Setup complete
  Repository: /content/semeval-context-tree-modular
  Data storage: /content/drive/MyDrive/semeval_data
  Repository verified: src/ directory exists
  Python path configured: /content/semeval-context-tree-modular added to sys.path


In [2]:
# ============================================================================
# LOAD DATASET FROM HUGGINGFACE HUB
# ============================================================================
# Loads the QEvasion dataset from HuggingFace Hub
# The dataset already has train and test splits - we keep test untouched

from src.data.loader import load_dataset

dataset = load_dataset(dataset_name="ailsntua/QEvasion")
train_raw = dataset['train']
test_raw = dataset['test']  # HuggingFace test split - kept untouched

print(f"Dataset loaded:")
print(f"  Train split: {len(train_raw)} samples")
print(f"  Test split: {len(test_raw)} samples (will be used ONLY in final evaluation)")
print(f"  Features: {list(train_raw.features.keys())}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/3.90M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/259k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3448 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/308 [00:00<?, ? examples/s]

✅ Dataset loaded: ailsntua/QEvasion
   Train: 3448 samples
   Test: 308 samples
Dataset loaded:
  Train split: 3448 samples
  Test split: 308 samples (will be used ONLY in final evaluation)
  Features: ['title', 'date', 'president', 'url', 'question_order', 'interview_question', 'interview_answer', 'gpt3.5_summary', 'gpt3.5_prediction', 'question', 'annotator_id', 'annotator1', 'annotator2', 'annotator3', 'inaudible', 'multiple_questions', 'affirmative_questions', 'index', 'clarity_label', 'evasion_label']


In [3]:
# ============================================================================
# SPLIT TRAIN INTO TRAIN / DEV (80-20) - CLARITY TASK
# ============================================================================
# Splits HuggingFace train split into Train (80%) and Dev (20%) for Clarity task
# HuggingFace test split is kept untouched and will be used as final test
# Clarity task uses all samples (no filtering)

from src.data.splitter import split_train_into_train_dev

train_ds_clarity, dev_ds_clarity = split_train_into_train_dev(
    train_dataset=train_raw,
    dev_ratio=0.20,   # 20% of train data becomes dev (train/dev split: 80-20)
    seed=42           # Fixed seed for reproducibility
)

# HuggingFace test split is kept as-is (no modification) for clarity
test_ds_clarity = test_raw

print("\n" + "="*80)
print("CLARITY TASK SPLITS (no filtering - all samples used)")
print("="*80)
print(f"  Train: {len(train_ds_clarity)} samples ({len(train_ds_clarity)/len(train_raw)*100:.1f}% of train split)")
print(f"  Dev: {len(dev_ds_clarity)} samples ({len(dev_ds_clarity)/len(train_raw)*100:.1f}% of train split)")
print(f"  Test: {len(test_ds_clarity)} samples (HuggingFace test split - untouched)")


Dataset split:
   Train: 2758 samples (80.0%)
   Dev: 690 samples (20.0%)

CLARITY TASK SPLITS (no filtering - all samples used)
  Train: 2758 samples (80.0% of train split)
  Dev: 690 samples (20.0% of train split)
  Test: 308 samples (HuggingFace test split - untouched)


In [4]:
# ============================================================================
# EVASION TASK: APPLY MAJORITY VOTING AND SPLIT
# ============================================================================
# For Evasion task, we apply majority voting from annotators (annotator1, annotator2, annotator3)
# Samples without strict majority (2/3 or 3/3) are dropped
# This results in a smaller dataset than Clarity task

from src.data.splitter import build_evasion_majority_dataset, split_train_into_train_dev

print("\n" + "="*80)
print("EVASION TASK: APPLYING MAJORITY VOTING")
print("="*80)

# Apply majority voting to train and test splits
print("\nApplying majority voting to train split...")
train_raw_evasion = build_evasion_majority_dataset(train_raw, verbose=True)

print("\nApplying majority voting to test split...")
test_raw_evasion = build_evasion_majority_dataset(test_raw, verbose=True)

# Split filtered train into train/dev (80-20)
print("\nSplitting filtered train into train/dev (80-20)...")
train_ds_evasion, dev_ds_evasion = split_train_into_train_dev(
    train_dataset=train_raw_evasion,
    dev_ratio=0.20,
    seed=42
)

# Test split (already filtered by majority voting)
test_ds_evasion = test_raw_evasion

print("\n" + "="*80)
print("EVASION TASK SPLITS (filtered by majority voting)")
print("="*80)
print(f"  Train: {len(train_ds_evasion)} samples ({len(train_ds_evasion)/len(train_raw_evasion)*100:.1f}% of filtered train)")
print(f"  Dev: {len(dev_ds_evasion)} samples ({len(dev_ds_evasion)/len(train_raw_evasion)*100:.1f}% of filtered train)")
print(f"  Test: {len(test_ds_evasion)} samples (filtered HuggingFace test split)")
print(f"\n  NOTE: Evasion dataset is smaller than Clarity because samples without")
print(f"        strict majority (2/3 or 3/3) were dropped.")



EVASION TASK: APPLYING MAJORITY VOTING

Applying majority voting to train split...
[EVASION MAJORITY] Existing evasion_label found → using dataset as-is.

Applying majority voting to test split...


Flattening the indices:   0%|          | 0/275 [00:00<?, ? examples/s]

[EVASION MAJORITY] Original size: 308
[EVASION MAJORITY] Kept (majority): 275
[EVASION MAJORITY] Dropped (no majority): 33

Splitting filtered train into train/dev (80-20)...
Dataset split:
   Train: 2758 samples (80.0%)
   Dev: 690 samples (20.0%)

EVASION TASK SPLITS (filtered by majority voting)
  Train: 2758 samples (80.0% of filtered train)
  Dev: 690 samples (20.0% of filtered train)
  Test: 275 samples (filtered HuggingFace test split)

  NOTE: Evasion dataset is smaller than Clarity because samples without
        strict majority (2/3 or 3/3) were dropped.


In [5]:
# ============================================================================
# SAVE SPLITS TO PERSISTENT STORAGE (TASK-SPECIFIC)
# ============================================================================
# Saves splits for both Clarity and Evasion tasks separately
# Clarity and Evasion have different splits because Evasion uses majority voting
# which drops samples without strict majority

print("\n" + "="*80)
print("SAVING SPLITS TO PERSISTENT STORAGE")
print("="*80)

# Save Clarity splits
print("\nSaving Clarity task splits...")
storage.save_splits(
    train_ds_clarity,
    dev_ds_clarity,
    test_ds_clarity,
    train_raw=train_raw,
    dev_ratio=0.20,
    seed=42,
    task='clarity'
)

# Save Evasion splits
print("\nSaving Evasion task splits...")
storage.save_splits(
    train_ds_evasion,
    dev_ds_evasion,
    test_ds_evasion,
    train_raw=train_raw_evasion,  # Use filtered train_raw for evasion
    dev_ratio=0.20,
    seed=42,
    task='evasion'
)

print("\n" + "="*80)
print("SPLITS SAVED SUCCESSFULLY")
print("="*80)
print("\nSummary:")
print(f"  Clarity - Train: {len(train_ds_clarity)}, Dev: {len(dev_ds_clarity)}, Test: {len(test_ds_clarity)}")
print(f"  Evasion - Train: {len(train_ds_evasion)}, Dev: {len(dev_ds_evasion)}, Test: {len(test_ds_evasion)}")
print("\nIMPORTANT:")
print("  - Clarity and Evasion have DIFFERENT splits (Evasion is filtered)")
print("  - Always specify task='clarity' or task='evasion' when loading splits")
print("  - Test set will ONLY be used in final evaluation notebook (05_final_evaluation.ipynb)")
print("  - Do not use test set for training or development!")



SAVING SPLITS TO PERSISTENT STORAGE

Saving Clarity task splits...
  Converting datasets to dict format for serialization...
Saved splits (indices) for task 'clarity': /content/drive/MyDrive/semeval_data/splits/dataset_splits_clarity.pkl
  Dataset: ailsntua/QEvasion
  Train: 2758 samples
  Dev: 690 samples
  Test: 308 samples

Saving Evasion task splits...
  Converting datasets to dict format for serialization...
Saved splits (indices) for task 'evasion': /content/drive/MyDrive/semeval_data/splits/dataset_splits_evasion.pkl
  Dataset: ailsntua/QEvasion
  Train: 2758 samples
  Dev: 690 samples
  Test: 275 samples
  NOTE: Evasion splits are filtered (majority voting applied)

SPLITS SAVED SUCCESSFULLY

Summary:
  Clarity - Train: 2758, Dev: 690, Test: 308
  Evasion - Train: 2758, Dev: 690, Test: 275

IMPORTANT:
  - Clarity and Evasion have DIFFERENT splits (Evasion is filtered)
  - Always specify task='clarity' or task='evasion' when loading splits
  - Test set will ONLY be used in fina