# BookNLP Maximum Performance Quote Attribution (Unified)

**Goal**: Train the max-performance quote attribution model (80‚Äì90% accuracy) in either Kaggle or Colab via one RUN_ENV-aware notebook.

**Features**
- DeBERTa-v3-large with quote/candidate masks + [QUOTE], [ALTQUOTE], [PAR]
- Candidate-level softmax with label smoothing; optional R-Drop; optional temperature scaling
- Optional multi-source loading + genre-balanced sampler; PDNC fallback; configurable hard negatives
- Curriculum sampler + light augmentation; gradient checkpointing + FP16
- Auto checkpoint/resume (model/optimizer/scheduler/best_acc) with cadence set by RUN_ENV; bucketed eval + placeholder postprocess hook

**Requirements**
- Kaggle: T4 x2 accelerator; storage under `/kaggle/working`
- Colab: T4 GPU; storage in Drive at `/content/drive`

**Quick start**
1) Set `RUN_ENV = "kaggle"` or `"colab"` in the next cell (default: kaggle).
2) Kaggle: no Drive mount; repo/output in `/kaggle/working`; multi-GPU via `accelerate`; checkpoints/evals every 500 steps; auto-resume from latest `checkpoint_*.pt`.
3) Colab: mounts Drive to `/content/drive`; repo in `/content`; outputs in Drive; single-GPU (no DDP) with gradient accumulation; checkpoints/evals every 300 steps; auto-resume from latest `checkpoint_*.pt`.
4) Run all cells‚Äîdata is cloned automatically from the repo.



## RUN_ENV toggle
Set `RUN_ENV = "kaggle"` or `"colab"` in the next cell (default: kaggle). Paths, checkpoint cadence, and mounts adjust automatically. Kaggle uses multi-GPU via `accelerate` (no Drive mount); Colab mounts Drive, runs single-GPU with gradient accumulation.



In [1]:
import os, sys, torch

# CURSOR: Toggle once; everything else keys off this value
RUN_ENV = os.environ.get("RUN_ENV", "kaggle").strip().lower()
ENV_CFG = {
    "kaggle": {
        "base_dir": "/kaggle/working",
        "repo_dir": "/kaggle/working/speaker-attribution-acl2023",
        "training_repo_dir": "/kaggle/working/quote-attribution-training",
        "output_root": "/kaggle/working",
        "checkpoint_every": 500,
        "eval_every": 500,
        "grad_accum": 4,
        "mount_drive": False,
    },
    "colab": {
        "base_dir": "/content/drive/MyDrive/quote_attribution",
        "repo_dir": "/content/speaker-attribution-acl2023",
        "training_repo_dir": "/content/quote-attribution-training",
        "output_root": "/content/drive/MyDrive/quote_attribution",
        "checkpoint_every": 300,
        "eval_every": 300,
        "grad_accum": 16,
        "mount_drive": True,
    },
}
assert RUN_ENV in ENV_CFG, f"Unsupported RUN_ENV: {RUN_ENV}"
ENV = ENV_CFG[RUN_ENV]

if ENV["mount_drive"]:
    from google.colab import drive
    drive.mount("/content/drive")
    os.makedirs(ENV["base_dir"], exist_ok=True)

print(f"Python: {sys.version}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPUs: {torch.cuda.device_count()}")
    for i in range(torch.cuda.device_count()):
        print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")
else:
    raise RuntimeError("GPU not available; enable a GPU runtime.")

# Data root (datasets auto-downloaded later based on CONFIG['datasets'])
REPO_DIR = ENV["repo_dir"]
os.makedirs(REPO_DIR, exist_ok=True)
print(f"Data repo root: {REPO_DIR}")

# Clone training code repository
TRAINING_REPO_DIR = ENV["training_repo_dir"]
if not os.path.exists(TRAINING_REPO_DIR):
    print("\nüì• Cloning training code repository...")
    !git clone https://github.com/bohdan-natsevych/quote-attribution-training.git {TRAINING_REPO_DIR}
else:
    print(f"‚úÖ Training repository present at {TRAINING_REPO_DIR}")

# Add training repo to Python path for imports
if TRAINING_REPO_DIR not in sys.path:
    sys.path.insert(0, TRAINING_REPO_DIR)
    print(f"‚úÖ Added {TRAINING_REPO_DIR} to Python path")

DATA_ROOT = f"{REPO_DIR}/data"
os.makedirs(DATA_ROOT, exist_ok=True)
print(f"Data root: {DATA_ROOT}")

BASE_DIR = ENV["base_dir"]
OUTPUT_ROOT = ENV["output_root"]
os.makedirs(OUTPUT_ROOT, exist_ok=True)
print(f"Output root: {OUTPUT_ROOT}")


Python: 3.11.13 (main, Jun  4 2025, 08:57:29) [GCC 11.4.0]
PyTorch: 2.6.0+cu124
CUDA available: True
CUDA version: 12.4
GPUs: 2
  GPU 0: Tesla T4
  GPU 1: Tesla T4
Data repo root: /kaggle/working/speaker-attribution-acl2023

üì• Cloning training code repository...
Cloning into '/kaggle/working/quote-attribution-training'...
remote: Enumerating objects: 60, done.[K
remote: Counting objects: 100% (60/60), done.[K
remote: Compressing objects: 100% (41/41), done.[K
remote: Total 60 (delta 26), reused 51 (delta 17), pack-reused 0 (from 0)[K
Receiving objects: 100% (60/60), 100.37 KiB | 3.72 MiB/s, done.
Resolving deltas: 100% (26/26), done.
‚úÖ Added /kaggle/working/quote-attribution-training to Python path
Data root: /kaggle/working/speaker-attribution-acl2023/data
Output root: /kaggle/working


In [2]:
%pip install -q transformers>=4.30.0 accelerate>=0.20.0 datasets scikit-learn tqdm nlpaug nltk


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 2.12.0 requires google-cloud-bigquery-storage<3.0.0,>=2.30.0, which is not installed.
pylibcudf-cu12 25.2.2 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.
cudf-cu12 25.2.2 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.
bigframes 2.12.0 requires rich<14,>=12.4.4, but you have rich 14.2.0 which is incompatible.
libcugraph-cu12 25.6.0 requires libraft-cu12==25.6.*, but you have libraft-cu12 25.2.0 which is incompatible.
cudf-polars-cu12 25.6.0 requires pylibcudf-cu12==25.6.*, but you have pylibcudf-cu12 25.2.2 which is incompatible.
pylibcugraph-cu12 25.6.0 requires pylibraft-cu12==25.6.*, but you have pylibraft-cu12 25.2.0 which is incompatible.
pylibcugraph-cu12 

In [3]:
# =============================================================================
# CONFIGURATION
# =============================================================================

# CURSOR: Import to get available datasets
import sys
from enum import Enum
if TRAINING_REPO_DIR not in sys.path:
    sys.path.insert(0, TRAINING_REPO_DIR)

from data.multi_source_data import MultiSourceDataLoader

# Define Dataset enum from available datasets
class Dataset(str, Enum):
    """Available datasets for quote attribution training."""
    PDNC = "pdnc"
    LITBANK = "litbank"
    DIRECTQUOTE = "directquote"
    # CURSOR: QUOTEBANK = "quotebank"
    
    @classmethod
    def get_all(cls):
        """Get all dataset values."""
        return [d.value for d in cls]
    
    @classmethod
    def validate(cls, datasets: list):
        """Validate that all datasets are in the enum."""
        valid = cls.get_all()
        for ds in datasets:
            if ds not in valid:
                raise ValueError(f"Invalid dataset '{ds}'. Must be one of: {valid}")
        return True

TARGET_LEVEL = 1  # 1=PDNC, 2=multi-source, 3=ensemble placeholder

# CURSOR: For best generalization on unknown/new books, train all 5 folds
# Set to "all" for all folds, or list of fold indices [0, 1, 2, 3, 4] or [0, 2] etc.
FOLD_SELECTION = "all"  # "all" or list like [0, 1, 2] or [3]

CONFIGS = {
    1: {
        'name': 'Target 1: DeBERTa-large + Augmentation',
        'epochs': 15, 'batch_size': 8, 'lr': 5e-6,
        'use_augmentation': True, 'use_curriculum': True,
        'focal_gamma': 2.0, 'label_smoothing': 0.1, 'r_drop_alpha': 0.0,
        'target_accuracy': 0.85,
        'hard_negative_topk': 2,
        'calibrate_temperature': True,
        'datasets': [Dataset.PDNC.value],  # Use enum value
        'use_postprocess': False,
        'balance_genres': False,
        'fold_selection': FOLD_SELECTION,
    },
    2: {
        'name': 'Target 2: Multi-Source + Genre Balancing',
        'epochs': 15, 'batch_size': 8, 'lr': 2e-6,
        'use_augmentation': True, 'use_curriculum': True,
        'balance_genres': True, 'min_genre_acc': 0.75,
        'target_accuracy': 0.88,
        'hard_negative_topk': 2,
        'calibrate_temperature': True,
        'datasets': [Dataset.PDNC.value, Dataset.LITBANK.value, Dataset.DIRECTQUOTE.value],  # Use enum values
        'use_postprocess': False,
        'fold_selection': FOLD_SELECTION,
    },
    3: {
        'name': 'Target 3: Ensemble + Distillation',
        'ensemble_models': ['microsoft/deberta-v3-large', 'roberta-large'],
        'student_model': 'microsoft/deberta-v3-base',
        'epochs': 15, 'batch_size': 4, 'lr': 5e-6,
        'distill_epochs': 10, 'temperature': 3.0, 'alpha': 0.7,
        'target_accuracy': 0.90,
        'datasets': [Dataset.PDNC.value, Dataset.LITBANK.value, Dataset.DIRECTQUOTE.value],  # Use enum values
        'use_augmentation': True,
        'use_curriculum': True,
        'balance_genres': True,
        'hard_negative_topk': 2,
        'calibrate_temperature': True,
        'use_postprocess': False,
        'fold_selection': FOLD_SELECTION,
    }
}

CONFIG = CONFIGS[TARGET_LEVEL].copy()

# Validate datasets
Dataset.validate(CONFIG.get('datasets', [Dataset.PDNC.value]))

multi_source_base = f"{REPO_DIR}/data"

CONFIG.update({
    'base_model': 'microsoft/deberta-v3-large',
    'max_length': 512,
    'gradient_accumulation_steps': ENV['grad_accum'],
    'checkpoint_every': ENV['checkpoint_every'],  # CURSOR: env-specific cadence
    'eval_every': ENV['eval_every'],
    'fp16': True,
    'gradient_checkpointing': True,
    'seed': 42,
    'output_dir': f"{OUTPUT_ROOT}/target_{TARGET_LEVEL}",
    'multi_source_base': multi_source_base,  # CURSOR: All datasets loaded via MultiSourceDataLoader
    # CURSOR: Feature toggles (disabled by default)
    'use_combined_loss': True,
    'use_postprocess': True,
    'postprocess_confidence': 0.6,
    'use_ensemble_eval': True,
    'ensemble_model_names': ['microsoft/deberta-v3-large'],
    'ensemble_voting_strategy': 'weighted_average',
    'run_cross_domain_validation': True,
    'run_genre_adaptation': True,
    'run_error_analysis': True,
    'run_model_optimization': True,
    'optimize_quantize': True,
    'optimize_export_onnx': True,
})

os.makedirs(CONFIG['output_dir'], exist_ok=True)
print(f"Selected: {CONFIG['name']}")
print(f"Target accuracy: {CONFIG['target_accuracy']:.0%}")
print(f"Datasets: {CONFIG.get('datasets', [Dataset.PDNC.value])}")
print(f"Output dir: {CONFIG['output_dir']}")
print(f"RUN_ENV: {RUN_ENV} | checkpoint_every={CONFIG['checkpoint_every']} | grad_accum={CONFIG['gradient_accumulation_steps']}")

Selected: Target 1: DeBERTa-large + Augmentation
Target accuracy: 85%
Datasets: ['pdnc']
Output dir: /kaggle/working/target_1
RUN_ENV: kaggle | checkpoint_every=500 | grad_accum=4


In [4]:
# =============================================================================
# AUTO-DOWNLOAD AND PREPARE ALL DATASETS
# =============================================================================

from data.multi_source_data import download_datasets

downloaded_datasets = download_datasets(
    base_path=CONFIG["multi_source_base"],
    datasets=CONFIG.get('datasets')
)


AUTO-DOWNLOAD DATASETS
Datasets to download: ['pdnc']

üì¶ Processing PDNC (pdnc)...
   Description: Pride and Prejudice Dialog Novel Corpus - 22 novels, literature focus
   Target directory: /kaggle/working/speaker-attribution-acl2023/data/pdnc


Cloning into '/kaggle/working/speaker-attribution-acl2023/data/pdnc'...
Updating files:  97% (740/762)


DATASET DOWNLOAD COMPLETE
üìö Downloaded datasets: ['pdnc']
üìÅ Base directory: /kaggle/working/speaker-attribution-acl2023/data



Updating files: 100% (762/762), done.


In [5]:
# =============================================================================
# GPU SETUP
# =============================================================================

import os
import torch

NUM_GPUS = torch.cuda.device_count()
print(f"üîç Detected {NUM_GPUS} GPU(s)")
for i in range(NUM_GPUS):
    props = torch.cuda.get_device_properties(i)
    print(f"   GPU {i}: {torch.cuda.get_device_name(i)} ({props.total_memory / 1024**3:.1f} GB)")

if NUM_GPUS > 1:
    print(f"\n‚úÖ Multi-GPU training via HF Trainer (DDP)")
    print(f"   Effective batch: {CONFIG['batch_size']} x {NUM_GPUS} x {CONFIG['gradient_accumulation_steps']} = {CONFIG['batch_size'] * NUM_GPUS * CONFIG['gradient_accumulation_steps']}")


üîç Detected 2 GPU(s)
   GPU 0: Tesla T4 (14.7 GB)
   GPU 1: Tesla T4 (14.7 GB)

‚úÖ Multi-GPU training via HF Trainer (DDP)
   Effective batch: 8 x 2 x 4 = 64


In [6]:
import glob, random, numpy as np, pandas as pd
from pathlib import Path
from collections import defaultdict
from dataclasses import dataclass
from typing import Optional, Dict, Any

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset

from transformers import Trainer, TrainingArguments, EvalPrediction
from sklearn.metrics import accuracy_score

from data.multi_source_data import MultiSourceDataLoader
from data.data_augmentation import QuoteAugmenter
from data.curriculum_loader import DifficultyClassifier, CurriculumSampler, CurriculumConfig
from evaluation.confidence_calibration import TemperatureScaling
from models.max_performance_model import MaxPerformanceSpeakerModel
from losses.focal_loss import CombinedLoss

# CURSOR: Deterministic setup for reproducibility
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

set_seed(CONFIG['seed'])
print(f"GPUs: {NUM_GPUS} | FP16: {CONFIG['fp16']}")


2025-12-10 05:03:48.434798: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1765343028.649424      88 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1765343028.708201      88 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

GPUs: 2 | FP16: True


In [None]:
# Use the richer model with span/mask support

class QuoteDataset(Dataset):
    def __init__(self, samples, tokenizer, max_length=512, augment=False, augmenter: QuoteAugmenter = None):
        self.samples, self.tok, self.max_len = samples, tokenizer, max_length
        self.augment = augment
        self.augmenter = augmenter
        self.par_id = self.tok.convert_tokens_to_ids("[PAR]")
        self.altq_id = self.tok.convert_tokens_to_ids("[ALTQUOTE]")

    def __len__(self):
        return len(self.samples)

    def _maybe_augment_text(self, text: str) -> str:
        if not self.augment or not self.augmenter:
            return text
        
        # CURSOR: Multi-strategy augmentation for better generalization
        # Apply augmentation with 50% probability (increased from 20%)
        if random.random() > 0.5:
            return text
        
        augmented = text
        try:
            # CURSOR: Strategy 1 - Synonym replacement (40% chance, 3-5 words)
            if random.random() < 0.4:
                n_synonyms = random.randint(3, 5)
                augmented = self.augmenter.synonym_replace(augmented, protected_spans=[], n=n_synonyms)
            
            # CURSOR: Strategy 2 - Random word insertion (25% chance)
            if random.random() < 0.25:
                augmented = self.augmenter.random_insert(augmented, protected_spans=[], n=1)
            
            # CURSOR: Strategy 3 - Random word swap (20% chance)
            if random.random() < 0.2:
                augmented = self.augmenter.random_swap(augmented, protected_spans=[], n=1)
            
            # CURSOR: Strategy 4 - Random word deletion (15% chance, very light)
            if random.random() < 0.15:
                augmented = self.augmenter.random_delete(augmented, protected_spans=[], p=0.03)
            
            return augmented
        except Exception:
            return text

    def _encode(self, sample):
        base_text = self._maybe_augment_text(sample['text'])
        base_ids = self.tok.encode(base_text, add_special_tokens=False)
        candidates = sample['candidates']
        cand_ids = [self.tok.encode(c, add_special_tokens=False) for c in candidates]

        reserved = 1 + 1 + sum(1 + len(ci) for ci in cand_ids)
        room = max(self.max_len - reserved, 8)
        if len(base_ids) > room:
            base_ids = base_ids[:room]

        tokens = [self.par_id] + base_ids + [self.altq_id]
        quote_mask = [1] * len(tokens)

        # CURSOR: Track candidate start/end positions, build tokens first
        cand_spans = []
        for ci in cand_ids:
            tokens.append(self.par_id)
            start = len(tokens)
            tokens.extend(ci)
            end = len(tokens)
            cand_spans.append((start, end))

        if not cand_spans:
            tokens.append(self.par_id)
            cand_spans.append((len(tokens), len(tokens)))

        # CURSOR: Create masks with uniform length (final token length)
        final_len = len(tokens)
        cand_masks = []
        for start, end in cand_spans:
            mask = [0] * final_len
            for i in range(start, min(end, final_len)):
                mask[i] = 1
            cand_masks.append(mask)

        # CURSOR: Extend quote_mask to match final token length (candidates added 0s)
        if len(quote_mask) < final_len:
            quote_mask += [0] * (final_len - len(quote_mask))

        tokens = tokens[: self.max_len]
        attention = [1] * len(tokens)
        if len(tokens) < self.max_len:
            pad_len = self.max_len - len(tokens)
            tokens += [self.tok.pad_token_id] * pad_len
            attention += [0] * pad_len
            quote_mask += [0] * pad_len
            cand_masks = [cm + [0] * pad_len for cm in cand_masks]
        else:
            quote_mask = quote_mask[: self.max_len]
            cand_masks = [cm[: self.max_len] for cm in cand_masks]

        return tokens, attention, quote_mask, cand_masks

    def __getitem__(self, idx):
        sample = self.samples[idx]
        tokens, attention, quote_mask, cand_masks = self._encode(sample)
        label_idx = sample['gold_index'] if sample['gold_index'] >= 0 else -100
        return {
            'input_ids': torch.tensor(tokens, dtype=torch.long),
            'attention_mask': torch.tensor(attention, dtype=torch.long),
            'quote_mask': torch.tensor(quote_mask, dtype=torch.long),
            'candidate_masks': [torch.tensor(cm, dtype=torch.long) for cm in cand_masks],
            'label_idx': torch.tensor(label_idx, dtype=torch.long),
            'quote_id': sample['quote_id']
        }


class FocalLoss(nn.Module):
    def __init__(self, gamma=2.0, label_smoothing=0.1):
        super().__init__()
        self.gamma, self.ls = gamma, label_smoothing
    def forward(self, inputs, targets):
        smoothed = targets.float() * (1 - self.ls) + 0.5 * self.ls
        probs = torch.sigmoid(inputs)
        ce = F.binary_cross_entropy(probs, smoothed, reduction='none')
        pt = torch.where(targets > 0, probs, 1 - probs).clamp(min=1e-6, max=1-1e-6)
        return ((1 - pt) ** self.gamma * ce).mean()


# Data helpers

def _add_hard_negatives(samples, topk):
    """Add hard negative candidates from frequent speakers across the dataset."""
    if not topk or not samples:
        return samples
    freq = {}
    for s in samples:
        for c in s['candidates']:
            freq[c] = freq.get(c, 0) + 1
    sorted_cands = [c for c, _ in sorted(freq.items(), key=lambda x: -x[1])]
    for s in samples:
        existing = set(s['candidates'])
        extras = [c for c in sorted_cands if c not in existing and c != s['gold']][:topk]
        s['candidates'] = s['candidates'] + extras
    return samples


def _convert_multi_source_sample(s, idx, source='unknown'):
    """Convert MultiSourceDataLoader sample format to QuoteDataset format."""
    gold = s.get('speaker', '')
    text = s.get('text') or s.get('quote') or ''
    source = s.get('source', source)
    genre = s.get('genre', source)
    book_id = s.get('book_id', '')
    
    if not gold or not text:
        return None
    
    qid = f"{source}:{book_id}:{idx}" if book_id else f"{source}:{idx}"
    return {
        'quote_id': qid,
        'text': text,
        'candidates': [gold],
        'gold': gold,
        'genre': genre,
        'source': source,
        'book_id': book_id
    }


def load_pdnc_data_via_multi_source(base_path: str, n_folds: int = 5, seed: int = 42):
    """
    Load PDNC data using MultiSourceDataLoader and create k-fold splits by book.
    
    Returns:
        List of (train, val, test) tuples for each fold
    """
    loader = MultiSourceDataLoader(base_path=base_path, datasets=['pdnc'], seed=seed)
    loader.load_all()
    
    all_samples = []
    for genre_samples in loader.data_by_genre.values():
        all_samples.extend(genre_samples)
    
    if not all_samples:
        print("‚ö†Ô∏è No PDNC samples loaded via MultiSourceDataLoader")
        return []
    
    # CURSOR: Group samples by book_id for leave-book-out cross-validation
    by_book = defaultdict(list)
    for i, s in enumerate(all_samples):
        book_id = s.get('book_id', 'unknown')
        by_book[book_id].append((i, s))
    
    book_ids = sorted(by_book.keys())
    n_books = len(book_ids)
    
    if n_books < n_folds:
        print(f"‚ö†Ô∏è Only {n_books} books, adjusting to {n_books} folds")
        n_folds = max(1, n_books)
    
    # CURSOR: Assign books to folds for leave-x-out cross-validation
    random.seed(seed)
    shuffled_books = book_ids.copy()
    random.shuffle(shuffled_books)
    
    fold_book_assignments = [[] for _ in range(n_folds)]
    for i, book_id in enumerate(shuffled_books):
        fold_book_assignments[i % n_folds].append(book_id)
    
    folds_data = []
    for fold_idx in range(n_folds):
        test_books = set(fold_book_assignments[fold_idx])
        val_books = set(fold_book_assignments[(fold_idx + 1) % n_folds])
        train_books = set(shuffled_books) - test_books - val_books
        
        train_samples, val_samples, test_samples = [], [], []
        
        for book_id, samples_list in by_book.items():
            for i, s in samples_list:
                converted = _convert_multi_source_sample(s, i, 'pdnc')
                if converted is None:
                    continue
                if book_id in test_books:
                    test_samples.append(converted)
                elif book_id in val_books:
                    val_samples.append(converted)
                else:
                    train_samples.append(converted)
        
        folds_data.append((train_samples, val_samples, test_samples))
    
    return folds_data


def load_datasets(base_path: str, datasets: list):
    """
    Load datasets using MultiSourceDataLoader and convert to training format.
    Works for single or multiple datasets. Uses unified sample conversion.
    """
    loader = MultiSourceDataLoader(base_path=base_path, datasets=datasets, seed=CONFIG['seed'])
    loader.load_all()
    
    # Use the proper split_by_genre method from the module
    train_samples, val_samples, test_samples = loader.split_by_genre(
        val_ratio=0.1,
        test_ratio=0.1
    )
    
    # CURSOR: Convert using unified helper function
    train_converted = [_convert_multi_source_sample(s, i) for i, s in enumerate(train_samples)]
    train_converted = [s for s in train_converted if s is not None]
    
    val_converted = [_convert_multi_source_sample(s, i) for i, s in enumerate(val_samples)]
    val_converted = [s for s in val_converted if s is not None]
    
    test_converted = [_convert_multi_source_sample(s, i) for i, s in enumerate(test_samples)]
    test_converted = [s for s in test_converted if s is not None]
    
    # Add hard negatives
    train_converted = _add_hard_negatives(train_converted, CONFIG.get('hard_negative_topk', 0))
    val_converted = _add_hard_negatives(val_converted, CONFIG.get('hard_negative_topk', 0))
    test_converted = _add_hard_negatives(test_converted, CONFIG.get('hard_negative_topk', 0))
    
    # Update gold_index after adding hard negatives
    for s in train_converted + val_converted + test_converted:
        s['gold_index'] = s['candidates'].index(s['gold']) if s['gold'] in s['candidates'] else -1
    
    return train_converted, val_converted, test_converted


In [8]:
# Load Data
datasets_to_load = CONFIG.get('datasets', ['pdnc'])
# CURSOR: Use PDNC folds when pdnc is in datasets (via MultiSourceDataLoader)
use_pdnc_folds = 'pdnc' in datasets_to_load
other_datasets = [d for d in datasets_to_load if d != 'pdnc']

if use_pdnc_folds:
    fold_selection = CONFIG.get('fold_selection', [0])

    if fold_selection == "all":
        FOLDS_TO_TRAIN = list(range(5))
    elif isinstance(fold_selection, list):
        FOLDS_TO_TRAIN = fold_selection
    else:
        FOLDS_TO_TRAIN = [int(fold_selection)]

    print(f"üìã PDNC Folds to train: {FOLDS_TO_TRAIN}")
else:
    FOLDS_TO_TRAIN = [0]  # Single iteration for multi-dataset
    print(f"üìã Training with datasets: {datasets_to_load}")

# CURSOR: Pre-load all PDNC folds via MultiSourceDataLoader (unified approach)
pdnc_folds_data = None
if use_pdnc_folds:
    print(f"üìÇ Loading PDNC data via MultiSourceDataLoader...")
    pdnc_folds_data = load_pdnc_data_via_multi_source(
        base_path=CONFIG['multi_source_base'],
        n_folds=5,
        seed=CONFIG['seed']
    )
    if pdnc_folds_data:
        print(f"   ‚úÖ Created {len(pdnc_folds_data)} folds from PDNC data")
    else:
        print("   ‚ö†Ô∏è Failed to load PDNC folds, falling back to load_datasets")
        use_pdnc_folds = False

def load_pdnc_fold(fold_idx: int):
    """Load train/val/test data for a specific PDNC fold via MultiSourceDataLoader."""
    if pdnc_folds_data is None or fold_idx >= len(pdnc_folds_data):
        print(f"   ‚ö†Ô∏è Fold {fold_idx} not available, using load_datasets fallback")
        return load_datasets(CONFIG['multi_source_base'], ['pdnc'])
    
    train_data, val_data, test_data = pdnc_folds_data[fold_idx]
    
    # CURSOR: Add hard negatives and gold_index
    train_data = _add_hard_negatives(train_data, CONFIG.get('hard_negative_topk', 0))
    val_data = _add_hard_negatives(val_data, CONFIG.get('hard_negative_topk', 0))
    test_data = _add_hard_negatives(test_data, CONFIG.get('hard_negative_topk', 0))
    
    for s in train_data + val_data + test_data:
        s['gold_index'] = s['candidates'].index(s['gold']) if s['gold'] in s['candidates'] else -1
    
    # CURSOR: Combine with other datasets if any
    if other_datasets:
        print(f"   + Adding datasets: {other_datasets}")
        other_train, other_val, other_test = load_datasets(CONFIG['multi_source_base'], other_datasets)
        train_data = train_data + other_train
        val_data = val_data + other_val
        test_data = test_data + other_test
    
    return train_data, val_data, test_data

# CURSOR: Data loading is now handled inside the training loop for multi-fold support
# Preview first fold stats only
if use_pdnc_folds and pdnc_folds_data:
    preview_train, preview_val, preview_test = pdnc_folds_data[FOLDS_TO_TRAIN[0]]
    print(f"\nüìä Preview - Fold {FOLDS_TO_TRAIN[0]} stats (before hard negatives):")
    print(f"   Train quotes: {len(preview_train)}")
    print(f"   Val quotes: {len(preview_val)}")
    print(f"   Test quotes: {len(preview_test)}")
    print(f"\nüîÑ Will train {len(FOLDS_TO_TRAIN)} fold(s): {FOLDS_TO_TRAIN}")
else:
    print(f"üìÇ Will load datasets: {datasets_to_load}")
    print(f"üîÑ Single training run (no cross-validation folds)")


üìã PDNC Folds to train: [0, 1, 2, 3, 4]
üìÇ Loading PDNC data via MultiSourceDataLoader...

üìÇ Loading PDNC from /kaggle/working/speaker-attribution-acl2023/data/pdnc...
  Found PDNC training data at /kaggle/working/speaker-attribution-acl2023/data/pdnc/training/data/pdnc
  Found 15 quote files in leave-x-out splits
  Processing: quotes.dev.txt (4223 lines)
    First line preview: AgeOfInnocence	Q1-0	CHAR_35	52	[[18, 19, 1, "CHAR_35"], [92, 94, 0, "CHAR_15"]]	glance flitting back to the young girl w...
  Loaded 161542 samples, parse_errors=16788
   ‚úÖ Loaded 161,542 samples from PDNC
   ‚úÖ Created 5 folds from PDNC data

üìä Preview - Fold 0 stats (before hard negatives):
   Train quotes: 89705
   Val quotes: 29085
   Test quotes: 42752

üîÑ Will train 5 fold(s): [0, 1, 2, 3, 4]


In [9]:
!rm -rf /kaggle/working/quote-attribution-training

In [None]:
# =============================================================================
# TRAINING SETUP AND HELPERS
# =============================================================================

# CURSOR: Disable gradient checkpointing for multi-GPU (causes backward graph conflicts)
USE_GRADIENT_CHECKPOINTING = NUM_GPUS == 1 and CONFIG.get('gradient_checkpointing', False)
print(f"‚öôÔ∏è Gradient checkpointing: {USE_GRADIENT_CHECKPOINTING} (disabled for multi-GPU)")

# CURSOR: Collate function for variable-length candidate masks
def collate_fn(batch):
    max_cands = max(len(item['candidate_masks']) for item in batch)
    input_ids = torch.stack([b['input_ids'] for b in batch])
    attention_mask = torch.stack([b['attention_mask'] for b in batch])
    quote_mask = torch.stack([b['quote_mask'] for b in batch])
    cand_masks, cand_attn = [], []
    for b in batch:
        masks = b['candidate_masks']
        orig_len = len(masks) or 1
        if not masks:
            masks = [torch.zeros_like(b['input_ids'])]
        pad_count = max_cands - orig_len
        if pad_count > 0:
            masks = masks + [torch.zeros_like(masks[0])] * pad_count
        cand_masks.append(torch.stack(masks))
        cand_attn.append(torch.tensor([1] * orig_len + [0] * pad_count, dtype=torch.long))
    return {
        'input_ids': input_ids,
        'attention_mask': attention_mask,
        'quote_mask': quote_mask,
        'candidate_masks': torch.stack(cand_masks),
        'candidate_attention_mask': torch.stack(cand_attn),
        'labels': torch.stack([b['label_idx'] for b in batch]),
    }

# CURSOR: Simple label smoothing cross entropy (stable with gradient checkpointing)
class LabelSmoothingCE(nn.Module):
    def __init__(self, smoothing=0.1, ignore_index=-100):
        super().__init__()
        self.smoothing = smoothing
        self.ignore_index = ignore_index
    
    def forward(self, logits, targets):
        # CURSOR: Mask out ignored indices
        mask = targets != self.ignore_index
        if not mask.any():
            return torch.tensor(0.0, device=logits.device, requires_grad=True)
        
        logits = logits[mask]
        targets = targets[mask]
        
        n_classes = logits.size(-1)
        log_probs = F.log_softmax(logits, dim=-1)
        
        # CURSOR: Create smoothed targets
        with torch.no_grad():
            smooth_targets = torch.full_like(log_probs, self.smoothing / (n_classes - 1))
            smooth_targets.scatter_(1, targets.unsqueeze(1), 1 - self.smoothing)
        
        loss = -(smooth_targets * log_probs).sum(dim=-1).mean()
        return loss

# CURSOR: Custom Trainer for our model
class QuoteAttributionTrainer(Trainer):
    """Custom trainer that handles our model's unique input format."""
    
    def __init__(self, loss_fn=None, **kwargs):
        super().__init__(**kwargs)
        self.custom_loss_fn = loss_fn
        self.ce_loss = nn.CrossEntropyLoss(ignore_index=-100)
    
    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        labels = inputs.pop('labels')
        logits, _ = model(
            inputs['input_ids'],
            inputs['attention_mask'],
            inputs['quote_mask'],
            inputs['candidate_masks'],
            inputs['candidate_attention_mask']
        )
        if inputs['candidate_attention_mask'] is not None:
            logits = logits.masked_fill(inputs['candidate_attention_mask'] == 0, -1e9)
        
        if self.custom_loss_fn is not None:
            loss = self.custom_loss_fn(logits, labels)
        else:
            loss = self.ce_loss(logits, labels)
        
        return (loss, {'logits': logits}) if return_outputs else loss

# CURSOR: Metrics function
def compute_metrics(eval_pred: EvalPrediction) -> Dict[str, float]:
    logits, labels = eval_pred.predictions, eval_pred.label_ids
    preds = np.argmax(logits, axis=-1)
    mask = labels >= 0
    acc = (preds[mask] == labels[mask]).mean()
    return {'accuracy': float(acc)}

# CURSOR: Loss function - use simple label smoothing CE for stability
def get_loss_fn():
    smoothing = CONFIG.get('label_smoothing', 0.1)
    if smoothing > 0:
        return LabelSmoothingCE(smoothing=smoothing)
    return None

os.makedirs(CONFIG['output_dir'], exist_ok=True)
print("‚úÖ Training helpers ready!")


In [None]:
# =============================================================================
# MULTI-FOLD TRAINING LOOP
# =============================================================================
# CURSOR: Trains all selected folds sequentially
# Each fold gets fresh model weights for proper cross-validation
# Models are saved per-fold for ensemble inference

import gc

# CURSOR: Track results across all folds
fold_results = {}
all_fold_accuracies = []

print("=" * 70)
print(f"üöÄ MULTI-FOLD TRAINING: {CONFIG['name']}")
print(f"   Folds to train: {FOLDS_TO_TRAIN}")
print(f"   GPUs: {NUM_GPUS} | Batch/GPU: {CONFIG['batch_size']}")
print(f"   Effective batch: {CONFIG['batch_size'] * NUM_GPUS * CONFIG['gradient_accumulation_steps']}")
print("=" * 70)

for fold_idx in FOLDS_TO_TRAIN:
    print(f"\n{'='*70}")
    print(f"üìÇ FOLD {fold_idx + 1}/{len(FOLDS_TO_TRAIN)} (index={fold_idx})")
    print(f"{'='*70}")
    
    # CURSOR: Load data for this fold
    if use_pdnc_folds:
        print(f"   Loading PDNC fold {fold_idx}...")
        train_samples, val_samples, test_samples = load_pdnc_fold(fold_idx)
    else:
        print(f"   Loading datasets: {datasets_to_load}...")
        train_samples, val_samples, test_samples = load_datasets(CONFIG['multi_source_base'], datasets_to_load)
    
    print(f"   Train: {len(train_samples)} | Val: {len(val_samples)} | Test: {len(test_samples)}")
    
    # CURSOR: Create fresh model for each fold (important for proper cross-validation)
    print(f"   Initializing fresh model...")
    set_seed(CONFIG['seed'] + fold_idx)  # Different seed per fold for diversity
    model = MaxPerformanceSpeakerModel(CONFIG.get('base_model', 'microsoft/deberta-v3-large'))
    if USE_GRADIENT_CHECKPOINTING:
        model.encoder.gradient_checkpointing_enable()
    tokenizer = model.get_tokenizer()
    
    # CURSOR: Create augmenter
    augmenter = QuoteAugmenter(seed=CONFIG['seed'] + fold_idx) if CONFIG.get('use_augmentation', False) else None
    
    # CURSOR: Apply curriculum sorting if enabled
    if CONFIG.get('use_curriculum', False):
        train_samples = sorted(train_samples, key=lambda s: len(s['text']))
    
    # CURSOR: Create datasets
    train_dataset = QuoteDataset(
        train_samples, tokenizer, CONFIG['max_length'],
        augment=CONFIG.get('use_augmentation', False), augmenter=augmenter
    )
    val_dataset = QuoteDataset(val_samples, tokenizer, CONFIG['max_length'])
    
    # CURSOR: Fold-specific output directory
    fold_output_dir = f"{CONFIG['output_dir']}/fold_{fold_idx}"
    os.makedirs(fold_output_dir, exist_ok=True)
    
    # CURSOR: Reduce batch size for multi-GPU without gradient checkpointing
    # DeBERTa-large needs ~6GB per sample, T4 has 15GB, so batch=2 is safe
    effective_batch = 2 if not USE_GRADIENT_CHECKPOINTING else CONFIG['batch_size']
    # CURSOR: Increase grad accum to maintain effective batch size
    effective_grad_accum = CONFIG['gradient_accumulation_steps'] * (CONFIG['batch_size'] // effective_batch)
    print(f"   Batch/GPU: {effective_batch} | Grad accum: {effective_grad_accum} | Effective: {effective_batch * NUM_GPUS * effective_grad_accum}")
    
    # CURSOR: Training arguments for this fold
    training_args = TrainingArguments(
        output_dir=fold_output_dir,
        num_train_epochs=CONFIG['epochs'],
        per_device_train_batch_size=effective_batch,
        per_device_eval_batch_size=effective_batch,
        gradient_accumulation_steps=effective_grad_accum,
        learning_rate=CONFIG['lr'],
        weight_decay=0.01,
        fp16=CONFIG['fp16'],
        warmup_ratio=0.1,
        lr_scheduler_type="cosine",
        
        # Evaluation and saving
        eval_strategy="steps",
        eval_steps=CONFIG['eval_every'],
        save_strategy="steps",
        save_steps=CONFIG['checkpoint_every'],
        save_total_limit=2,
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
        greater_is_better=True,
        
        # Logging
        logging_steps=100,
        logging_first_step=True,
        report_to="none",
        
        # Performance - use 0 workers for multi-GPU stability
        dataloader_num_workers=0,
        dataloader_pin_memory=True,
        remove_unused_columns=False,
        
        # Seed
        seed=CONFIG['seed'] + fold_idx,
    )
    
    # CURSOR: Create trainer for this fold
    trainer = QuoteAttributionTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        data_collator=collate_fn,
        compute_metrics=compute_metrics,
        loss_fn=get_loss_fn(),
    )
    
    # CURSOR: Train this fold
    print(f"\n   üèãÔ∏è Training fold {fold_idx}...")
    train_result = trainer.train()
    
    # CURSOR: Evaluate on validation set
    eval_results = trainer.evaluate()
    fold_accuracy = eval_results.get('eval_accuracy', 0.0)
    all_fold_accuracies.append(fold_accuracy)
    
    # CURSOR: Save best model for this fold
    best_model_path = f"{fold_output_dir}/best_model"
    trainer.save_model(best_model_path)
    
    # CURSOR: Store results
    fold_results[fold_idx] = {
        'accuracy': fold_accuracy,
        'train_loss': train_result.training_loss,
        'model_path': best_model_path,
    }
    
    print(f"\n   ‚úÖ Fold {fold_idx} complete!")
    print(f"      Accuracy: {fold_accuracy:.4f}")
    print(f"      Model saved: {best_model_path}")
    
    # CURSOR: Clean up GPU memory before next fold
    del model, trainer, train_dataset, val_dataset
    gc.collect()
    torch.cuda.empty_cache()

# =============================================================================
# TRAINING SUMMARY
# =============================================================================
print(f"\n{'='*70}")
print("üèÜ MULTI-FOLD TRAINING COMPLETE!")
print(f"{'='*70}")

print(f"\nüìä Results per fold:")
for fold_idx, results in fold_results.items():
    print(f"   Fold {fold_idx}: Accuracy = {results['accuracy']:.4f}")

if len(all_fold_accuracies) > 1:
    mean_acc = np.mean(all_fold_accuracies)
    std_acc = np.std(all_fold_accuracies)
    print(f"\nüìà Cross-validation summary:")
    print(f"   Mean accuracy: {mean_acc:.4f} ¬± {std_acc:.4f}")
    print(f"   Min: {min(all_fold_accuracies):.4f} | Max: {max(all_fold_accuracies):.4f}")
else:
    print(f"\nüìà Single fold accuracy: {all_fold_accuracies[0]:.4f}")

print(f"\nüìÅ Models saved to: {CONFIG['output_dir']}/fold_*/best_model")
print(f"{'='*70}")

