# Data Split: Train / Dev / Test

================================================================================
PURPOSE: Split HuggingFace train split into Train/Dev, keep test split separate
================================================================================

This notebook loads the QEvasion dataset from HuggingFace, which already has
train and test splits. The HuggingFace test split is kept untouched and will
ONLY be used in final evaluation. Only the HuggingFace train split is divided:

- **Train**: 80% of HuggingFace train split (used for training models)
- **Dev**: 20% of HuggingFace train split (used for model/feature selection)
- **Test**: HuggingFace test split (ONLY used in final evaluation notebook)

**CRITICAL**: The HuggingFace test split is NEVER used for training, model
selection, or any development decisions. It is only accessed in the final
evaluation notebook (05_final_evaluation.ipynb).

================================================================================
INPUTS (What this notebook loads)
================================================================================

**From GitHub:**
- Repository code (cloned automatically if not present)
- Source modules from `src/` directory

**From HuggingFace Hub:**
- QEvasion dataset (`ailsntua/QEvasion`)
  - Train split (approximately 3400 samples)
  - Test split (308 samples, kept untouched)

**From Google Drive:**
- Nothing (this is the first notebook in the pipeline)

================================================================================
OUTPUTS (What this notebook saves)
================================================================================

**To Google Drive:**
- Dataset splits: `splits/dataset_splits.pkl`
  - Train split (80% of HuggingFace train split)
  - Dev split (20% of HuggingFace train split)
  - Test split (HuggingFace test split, untouched)

**To GitHub:**
- Split metadata: `metadata/splits.json`
  - Train/Dev/Test sizes
  - Timestamp and data paths

**What gets passed to next notebook:**
- Train, Dev, and Test splits saved to persistent storage
- These splits are loaded by subsequent notebooks via `storage.load_split()`


In [None]:
# ============================================================================
# SETUP: Repository Clone, Drive Mount, and Path Configuration
# ============================================================================
# This cell performs minimal setup required for the notebook to run:
# 1. Clones repository from GitHub (if not already present)
# 2. Mounts Google Drive for persistent data storage
# 3. Configures Python paths and initializes StorageManager

import shutil
import os
import subprocess
import time
import requests
import zipfile
import sys
from pathlib import Path
from google.colab import drive

# Repository configuration
repo_dir = '/content/semeval-context-tree-modular'
repo_url = 'https://github.com/EonTechie/semeval-context-tree-modular.git'
zip_url = 'https://github.com/EonTechie/semeval-context-tree-modular/archive/refs/heads/main.zip'

# Clone repository (if not already present)
if not os.path.exists(repo_dir):
    print("Cloning repository from GitHub...")
    max_retries = 2
    clone_success = False
    
    for attempt in range(max_retries):
        try:
            result = subprocess.run(
                ['git', 'clone', repo_url],
                cwd='/content',
                capture_output=True,
                text=True,
                timeout=60
            )
            if result.returncode == 0:
                print("Repository cloned successfully via git")
                clone_success = True
                break
            else:
                if attempt < max_retries - 1:
                    time.sleep(3)
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(3)
    
    # Fallback: Download as ZIP if git clone fails
    if not clone_success:
        print("Git clone failed. Downloading repository as ZIP archive...")
        zip_path = '/tmp/repo.zip'
        try:
            response = requests.get(zip_url, stream=True, timeout=60)
            response.raise_for_status()
            with open(zip_path, 'wb') as f:
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)
            with zipfile.ZipFile(zip_path, 'r') as zip_ref:
                zip_ref.extractall('/content')
            extracted_dir = '/content/semeval-context-tree-modular-main'
            if os.path.exists(extracted_dir):
                os.rename(extracted_dir, repo_dir)
            os.remove(zip_path)
            print("Repository downloaded and extracted successfully")
        except Exception as e:
            raise RuntimeError(f"Failed to obtain repository: {e}")

# Mount Google Drive (if not already mounted)
try:
    drive.mount('/content/drive', force_remount=False)
except Exception:
    pass  # Already mounted

# Configure paths
BASE_PATH = Path('/content/semeval-context-tree-modular')
DATA_PATH = Path('/content/drive/MyDrive/semeval_data')

# Verify repository structure exists
if not BASE_PATH.exists():
    raise RuntimeError(f"Repository directory not found: {BASE_PATH}")
if not (BASE_PATH / 'src').exists():
    raise RuntimeError(f"src directory not found in repository: {BASE_PATH / 'src'}")
if not (BASE_PATH / 'src' / 'storage' / 'manager.py').exists():
    raise RuntimeError(f"Required file not found: {BASE_PATH / 'src' / 'storage' / 'manager.py'}")

# Add repository to Python path
sys.path.insert(0, str(BASE_PATH))

# Verify import works
try:
    from src.storage.manager import StorageManager
except ImportError as e:
    raise ImportError(
        f"Failed to import StorageManager. "
        f"Repository path: {BASE_PATH}, "
        f"Python path: {sys.path[:3]}, "
        f"Error: {e}"
    )

# Initialize StorageManager
storage = StorageManager(
    base_path=str(BASE_PATH),
    data_path=str(DATA_PATH),
    github_path=str(BASE_PATH)
)

print("Setup complete")
print(f"  Repository: {BASE_PATH}")
print(f"  Data storage: {DATA_PATH}")
print(f"  Repository verified: src/ directory exists")
print(f"  Python path configured: {BASE_PATH} added to sys.path")


In [None]:
# ============================================================================
# LOAD DATASET FROM HUGGINGFACE HUB
# ============================================================================
# Loads the QEvasion dataset from HuggingFace Hub
# The dataset already has train and test splits - we keep test untouched

from src.data.loader import load_dataset

dataset = load_dataset(dataset_name="ailsntua/QEvasion")
train_raw = dataset['train']
test_raw = dataset['test']  # HuggingFace test split - kept untouched

print(f"Dataset loaded:")
print(f"  Train split: {len(train_raw)} samples")
print(f"  Test split: {len(test_raw)} samples (will be used ONLY in final evaluation)")
print(f"  Features: {list(train_raw.features.keys())}")


In [None]:
# ============================================================================
# SPLIT TRAIN INTO TRAIN / DEV (80-20)
# ============================================================================
# Splits HuggingFace train split into Train (80%) and Dev (20%)
# HuggingFace test split is kept untouched and will be used as final test

from src.data.splitter import split_train_into_train_dev

train_ds, dev_ds = split_train_into_train_dev(
    train_dataset=train_raw,
    dev_ratio=0.20,   # 20% of train data becomes dev (train/dev split: 80-20)
    seed=42           # Fixed seed for reproducibility
)

# HuggingFace test split is kept as-is (no modification)
test_ds = test_raw

print("\nFinal splits:")
print(f"  Train: {len(train_ds)} samples ({len(train_ds)/len(train_raw)*100:.1f}% of train split)")
print(f"  Dev: {len(dev_ds)} samples ({len(dev_ds)/len(train_raw)*100:.1f}% of train split)")
print(f"  Test: {len(test_ds)} samples (HuggingFace test split - untouched)")


In [None]:
# ============================================================================
# SAVE SPLITS TO PERSISTENT STORAGE
# ============================================================================
# Saves the three splits to Google Drive for use in subsequent notebooks
# Splits are saved in a format that preserves all dataset features and metadata

storage.save_splits(train_ds, dev_ds, test_ds)

print("Splits saved to persistent storage")
print(f"  Train: {len(train_ds)} samples")
print(f"  Dev: {len(dev_ds)} samples")
print(f"  Test: {len(test_ds)} samples (HuggingFace test split)")
print("\nIMPORTANT: Test set (HuggingFace test split) will ONLY be used in")
print("           final evaluation notebook (05_final_evaluation.ipynb).")
print("           Do not use it for training or development!")
