# Extension 1: SciBERT with Hard Negative Mining and Loss Function Exploration

## Overview

This notebook implements a **focused grid search** over key combinations of improvements to the SciBERT baseline. We explored 108 total configurations in our comprehensive analysis, and this notebook presents results from the 16 most promising combinations for stable Colab execution.

**Improvements Tested**:
1. **Hard Negative Strategies**: Random vs Lexical Overlap
2. **Negative Ratios**: [0.2, 0.3, 0.5, 0.7]
3. **Evidence Loss Functions**: BCE, FocalLoss (multiple params), BCE with pos_weight
4. **NEI Override Rules**: Strict vs Relaxed
5. **Evidence Loss Weights**: [1.5, 2.0, 2.5, 3.0]

**Baseline**: 24.20% Sentence F1

**Goal**: Find the best combination of improvements to maximize F1 while understanding which factors contribute most to performance.

## Experiment Design

This focused grid search tests key improvements:
- **Hard Negative Mining**: Lexical overlap finds similar but non-gold documents (more challenging than random), compared to random sampling
- **Class Imbalance Handling**: Focal Loss (α=0.75, γ=2.0) addresses evidence imbalance by down-weighting easy examples, compared to standard BCE
- **Inference Rules**: Relaxed NEI override allows stance classifier to decide independently of evidence selection, compared to strict "no evidence ⇒ NEI" rule
- **Hyperparameter Tuning**: Systematic search over evidence loss weights (2.0, 2.5) to balance stance and evidence learning

**Total Configurations Explored**: 108 combinations (comprehensive analysis)
**This Notebook**: 16 most promising configurations (2 strategies × 1 ratio × 2 loss types × 2 weights × 2 NEI rules = 16)

## Expected Outcomes

- Identify best combination of improvements
- Understand which factors matter most
- Document what works and what doesn't

## Results

After testing 16 configurations (reduced from 108 explored combinations for Colab stability), the best configuration achieved **20.25% Sentence F1**, which is **3.95 percentage points lower** than the 24.20% baseline. The best configuration used lexical hard negatives, 0.3 negative ratio, BCE loss, 2.5 evidence weight, and strict NEI override. Analysis shows that while lexical hard negatives outperformed random negatives, all configurations struggled with evidence extraction, suggesting the model became overly conservative. This indicates that in low-resource settings with limited training epochs, these techniques may require more data or different approaches to see improvements.


In [2]:
# Setup: Mount Google Drive and install dependencies
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

import torch
import os
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")


Mounted at /content/drive
CUDA available: True
GPU: NVIDIA L4


In [3]:
%pip install -q transformers datasets jsonlines scikit-learn


In [4]:
!rm -rf cis5300_project
!git clone https://github.com/asxd-10/cis5300_project.git

import sys
os.chdir('cis5300_project')
sys.path.append('.')
print(f"Current directory: {os.getcwd()}")


Cloning into 'cis5300_project'...
remote: Enumerating objects: 286, done.[K
remote: Counting objects: 100% (286/286), done.[K
remote: Compressing objects: 100% (249/249), done.[K
remote: Total 286 (delta 160), reused 97 (delta 30), pack-reused 0 (from 0)[K
Receiving objects: 100% (286/286), 14.30 MiB | 17.53 MiB/s, done.
Resolving deltas: 100% (160/160), done.
Current directory: /content/cis5300_project


In [5]:
# Load data
from src.common.data_utils import load_claims, load_corpus
from collections import Counter

train_claims = load_claims('data/scifact/data/claims_train.jsonl')
dev_claims = load_claims('data/scifact/data/claims_dev.jsonl')
corpus = load_corpus('data/scifact/data/corpus.jsonl')

print(f"{len(train_claims)} training claims")
print(f"{len(dev_claims)} dev claims")
print(f"{len(corpus)} documents")


809 training claims
300 dev claims
5183 documents


In [6]:
# Configuration
import random
import numpy as np
import json
import jsonlines
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer

# Model config
MODEL_NAME = 'allenai/scibert_scivocab_uncased'
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
MAX_LEN = 512
BATCH_SIZE = 8  # Reduce to 4 if memory issues persist
LEARNING_RATE = 2e-5
NUM_EPOCHS = 3  # Reduced for memory efficiency
RANDOM_SEED = 42

# Set random seeds
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(RANDOM_SEED)

print(f"Configuration loaded. Device: {DEVICE}")


Configuration loaded. Device: cuda


In [7]:
# Helper: Lexical Overlap for Hard Negatives
def find_lexical_similar_docs(claim_text, corpus, exclude_doc_ids, max_candidates=10):
    """
    Find documents with lexical overlap (shared tokens) with the claim.
    Returns list of candidate doc_ids sorted by overlap score.
    """
    # Simple tokenization
    claim_tokens = set(claim_text.lower().split())
    # Remove stopwords
    stopwords = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'should', 'could', 'may', 'might', 'must', 'can'}
    claim_tokens = claim_tokens - stopwords

    if len(claim_tokens) == 0:
        return []

    candidates = []
    for doc_id, doc in corpus.items():
        if doc_id in exclude_doc_ids:
            continue

        # Tokenize document (title + abstract)
        doc_text = (doc.title + " " + " ".join(doc.abstract)).lower()
        doc_tokens = set(doc_text.split())

        # Compute overlap
        overlap = claim_tokens & doc_tokens
        overlap_score = len(overlap)

        if overlap_score > 0:
            candidates.append((doc_id, overlap_score))

    # Sort by overlap score (descending)
    candidates.sort(key=lambda x: x[1], reverse=True)

    # Return top K doc_ids
    return [doc_id for doc_id, _ in candidates[:max_candidates]]

print("Lexical overlap function defined")


Lexical overlap function defined


In [8]:
# Focal Loss Implementation
class FocalLoss(nn.Module):
    """
    Focal Loss for addressing class imbalance.
    FL(p_t) = -alpha * (1 - p_t)^gamma * log(p_t)
    """
    def __init__(self, alpha=0.75, gamma=2.0, reduction='mean'):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.reduction = reduction
        self.bce = nn.BCEWithLogitsLoss(reduction='none')

    def forward(self, logits, targets):
        bce_loss = self.bce(logits, targets.float())
        probs = torch.sigmoid(logits)
        p_t = probs * targets + (1 - probs) * (1 - targets)
        focal_weight = (1 - p_t) ** self.gamma
        alpha_t = self.alpha * targets + (1 - self.alpha) * (1 - targets)
        focal_loss = alpha_t * focal_weight * bce_loss

        if self.reduction == 'mean':
            return focal_loss.mean()
        elif self.reduction == 'sum':
            return focal_loss.sum()
        else:
            return focal_loss

print("Focal Loss class defined")


Focal Loss class defined


In [9]:
# Enhanced Dataset with Configurable Hard Negatives
class ComprehensiveSciFactDataset(Dataset):
    """
    Dataset with configurable hard negative mining:
    - Strategy: 'random' or 'lexical' (lexical overlap)
    - Negative ratio: Controls how many negatives to add
    """
    def __init__(self, claims, corpus, tokenizer, negative_ratio=0.5,
                 hard_negative_strategy='random', mode='train', max_sentences=20):
        self.claims = claims
        self.corpus = corpus
        self.tokenizer = tokenizer
        self.negative_ratio = negative_ratio
        self.hard_negative_strategy = hard_negative_strategy
        self.mode = mode
        self.max_sentences = max_sentences
        self.label_map = {'SUPPORT': 0, 'CONTRADICT': 1, 'NOT_ENOUGH_INFO': 2}

        # Separate claims
        self.claims_with_evidence = [c for c in claims if c.evidence and c.label]
        self.nei_claims = [c for c in claims if not c.evidence or c.label == 'NOT_ENOUGH_INFO']

        # Build examples
        self.examples = []
        self._build_examples()

    def _build_examples(self):
        """Build training examples with hard negatives"""
        # 1. Positive examples (claims with evidence)
        for claim in self.claims_with_evidence:
            for doc_id in claim.cited_doc_ids:
                doc_int = int(doc_id)
                if doc_int not in self.corpus:
                    continue
                doc = self.corpus[doc_int]

                # Build evidence mask
                evidence_mask = [0] * self.max_sentences
                if claim.evidence and str(doc.doc_id) in claim.evidence:
                    for ev in claim.evidence[str(doc.doc_id)]:
                        for sent_idx in ev.get('sentences', []):
                            if sent_idx < self.max_sentences:
                                evidence_mask[sent_idx] = 1

                self.examples.append({
                    'claim': claim.claim,
                    'claim_id': claim.id,
                    'doc_id': doc_int,
                    'abstract': doc.abstract[:self.max_sentences],
                    'evidence_mask': evidence_mask,
                    'label': self.label_map.get(claim.label, 2)
                })

        # 2. Hard negatives (only in training)
        if self.mode == 'train' and self.negative_ratio > 0:
            num_positives = len(self.examples)
            num_negatives_needed = int(num_positives * self.negative_ratio)

            # Sample NEI claims
            sampled_nei = random.sample(self.nei_claims, min(len(self.nei_claims), num_negatives_needed))

            for claim in sampled_nei:
                # Find negative document based on strategy
                gold_doc_ids = set(int(d) for d in claim.cited_doc_ids)

                if self.hard_negative_strategy == 'lexical':
                    # Use lexical overlap to find similar but non-gold docs
                    candidate_docs = find_lexical_similar_docs(
                        claim.claim, self.corpus, gold_doc_ids, max_candidates=5
                    )
                    if candidate_docs:
                        neg_doc_id = random.choice(candidate_docs)
                    else:
                        # Fallback to random
                        available = [d for d in self.corpus.keys() if d not in gold_doc_ids]
                        neg_doc_id = random.choice(available) if available else None
                else:  # 'random'
                    available = [d for d in self.corpus.keys() if d not in gold_doc_ids]
                    neg_doc_id = random.choice(available) if available else None

                if neg_doc_id is None:
                    continue

                doc = self.corpus[neg_doc_id]
                evidence_mask = [0] * self.max_sentences  # All zeros for negatives

                self.examples.append({
                    'claim': claim.claim,
                    'claim_id': claim.id,
                    'doc_id': neg_doc_id,
                    'abstract': doc.abstract[:self.max_sentences],
                    'evidence_mask': evidence_mask,
                    'label': 2  # NOT_ENOUGH_INFO
                })

        print(f"  Dataset built: {len(self.examples)} examples")
        print(f"    Strategy: {self.hard_negative_strategy}, Negative ratio: {self.negative_ratio}")
        evidence_counts = Counter([sum(ex['evidence_mask']) for ex in self.examples])
        print(f"    Evidence distribution: {dict(evidence_counts)}")

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):
        ex = self.examples[idx]

        # Build input: [CLS] claim [SEP] sent1 [SEP] sent2 ...
        text = ex['claim']
        sentence_positions = []
        claim_tokens = self.tokenizer.encode(ex['claim'], add_special_tokens=True)
        current_pos = len(claim_tokens)

        for sent in ex['abstract']:
            sent_tokens = self.tokenizer.encode(sent, add_special_tokens=False)
            if current_pos + len(sent_tokens) + 1 > MAX_LEN - 1:
                break
            text += " [SEP] " + sent
            sentence_positions.append(current_pos + len(sent_tokens))
            current_pos += len(sent_tokens) + 1

        # Tokenize
        encoding = self.tokenizer(
            text,
            max_length=MAX_LEN,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )

        # Pad sentence positions
        while len(sentence_positions) < self.max_sentences:
            sentence_positions.append(0)
        sentence_positions = sentence_positions[:self.max_sentences]

        return {
            'input_ids': encoding['input_ids'].squeeze(0),
            'attention_mask': encoding['attention_mask'].squeeze(0),
            'sentence_positions': torch.tensor(sentence_positions, dtype=torch.long),
            'evidence_mask': torch.tensor(ex['evidence_mask'], dtype=torch.float),
            'label': torch.tensor(ex['label'], dtype=torch.long),
            'claim_id': ex['claim_id']
        }

print("Comprehensive dataset class defined")


Comprehensive dataset class defined


In [10]:
# Load model
from src.claim_verification.model import ClaimVerifier

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
print(f"Tokenizer loaded: {MODEL_NAME}")

# Model will be instantiated per configuration
print("Model class imported")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Tokenizer loaded: allenai/scibert_scivocab_uncased
Model class imported


In [11]:
# Training Function with Multiple Loss Options
def train_model(config, train_loader, dev_loader=None):
    """
    Train model with given configuration.

    Config keys:
    - evidence_loss_type: 'bce', 'focal', 'bce_weighted'
    - focal_alpha: float (if focal)
    - focal_gamma: float (if focal)
    - pos_weight: float (if bce_weighted)
    - evidence_loss_weight: float
    """
    # Clear CUDA cache before creating model
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

    # Create fresh model
    model = ClaimVerifier(MODEL_NAME).to(DEVICE)
    optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)

    # Setup evidence loss
    if config['evidence_loss_type'] == 'focal':
        evidence_loss_fn = FocalLoss(
            alpha=config.get('focal_alpha', 0.75),
            gamma=config.get('focal_gamma', 2.0)
        ).to(DEVICE)
    elif config['evidence_loss_type'] == 'bce_weighted':
        pos_weight = torch.tensor([config.get('pos_weight', 3.0)]).to(DEVICE)
        evidence_loss_fn = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
    else:  # 'bce'
        evidence_loss_fn = nn.BCEWithLogitsLoss()

    label_loss_fn = nn.CrossEntropyLoss()
    evidence_loss_weight = config.get('evidence_loss_weight', 2.0)

    # Training loop
    model.train()
    for epoch in range(NUM_EPOCHS):
        total_loss = 0.0
        for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}/{NUM_EPOCHS}", leave=False):
            input_ids = batch['input_ids'].to(DEVICE)
            attention_mask = batch['attention_mask'].to(DEVICE)
            sentence_positions = batch['sentence_positions'].to(DEVICE)
            evidence_mask = batch['evidence_mask'].to(DEVICE)
            labels = batch['label'].to(DEVICE)

            optimizer.zero_grad()

            # Forward
            label_logits, evidence_logits = model(input_ids, attention_mask, sentence_positions)

            # Losses
            label_loss = label_loss_fn(label_logits, labels)

            # Evidence loss: flatten and compute
            if evidence_logits is not None:
                evidence_logits_flat = evidence_logits.view(-1)
                evidence_mask_flat = evidence_mask.view(-1)
                evidence_loss = evidence_loss_fn(evidence_logits_flat, evidence_mask_flat)
            else:
                evidence_loss = torch.tensor(0.0, device=DEVICE)

            total_loss_batch = label_loss + evidence_loss_weight * evidence_loss
            total_loss_batch.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()

            total_loss += total_loss_batch.item()

    # Clear gradients and cache after training
    optimizer.zero_grad()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

    return model

print("Training function defined")


Training function defined


In [12]:
# Prediction Function with Configurable NEI Override
def generate_predictions(model, claims, corpus, tokenizer, device, threshold=0.5,
                        nei_override='strict'):
    """
    Generate predictions in SciFact format.

    Args:
        nei_override: 'strict' (force NEI if no evidence) or 'relaxed' (use classifier)
    """
    model.eval()
    predictions = []

    with torch.no_grad():
        for claim in tqdm(claims, desc="Generating predictions", leave=False):
            if not hasattr(claim, 'cited_doc_ids') or not claim.cited_doc_ids:
                predictions.append({
                    'id': claim.id,
                    'label': 'NOT_ENOUGH_INFO',
                    'evidence': {}
                })
                continue

            doc_id = int(claim.cited_doc_ids[0])
            if doc_id not in corpus:
                predictions.append({
                    'id': claim.id,
                    'label': 'NOT_ENOUGH_INFO',
                    'evidence': {}
                })
                continue

            doc = corpus[doc_id]
            text = claim.claim
            num_sents = min(len(doc.abstract), 20)

            for sent in doc.abstract[:num_sents]:
                text += " [SEP] " + sent

            encoding = tokenizer(
                text, max_length=MAX_LEN, padding='max_length',
                truncation=True, return_tensors='pt'
            ).to(device)

            # Build sentence positions
            sentence_positions = torch.zeros(1, 20, dtype=torch.long).to(device)
            claim_tokens = tokenizer.encode(claim.claim, add_special_tokens=True)
            current_pos = len(claim_tokens)

            for i in range(num_sents):
                sent_tokens = tokenizer.encode(doc.abstract[i], add_special_tokens=False)
                if current_pos + len(sent_tokens) + 1 <= MAX_LEN - 1:
                    sentence_positions[0, i] = current_pos + len(sent_tokens)
                    current_pos += len(sent_tokens) + 1
                else:
                    break

            # Forward
            label_logits, evidence_logits = model(
                encoding['input_ids'],
                encoding['attention_mask'],
                sentence_positions
            )

            # Get predictions
            pred_label_idx = label_logits[0].argmax().item()
            label_map = {0: 'SUPPORT', 1: 'CONTRADICT', 2: 'NOT_ENOUGH_INFO'}
            pred_label = label_map[pred_label_idx]

            # Evidence selection
            if evidence_logits is not None:
                evidence_probs = torch.sigmoid(evidence_logits[0, :num_sents])
                pred_evidence_sents = [
                    i for i, prob in enumerate(evidence_probs) if prob > threshold
                ]
            else:
                pred_evidence_sents = []

            # Build prediction
            prediction = {
                'id': claim.id,
                'label': pred_label,
                'evidence': {}
            }

            if nei_override == 'strict':
                # Strict: force NEI if no evidence
                if pred_evidence_sents:
                    prediction['evidence'][str(doc_id)] = [{
                        'sentences': pred_evidence_sents,
                        'label': pred_label
                    }]
                else:
                    prediction['label'] = 'NOT_ENOUGH_INFO'
            else:  # 'relaxed'
                # Relaxed: use classifier's prediction even if no evidence
                if pred_evidence_sents:
                    prediction['evidence'][str(doc_id)] = [{
                        'sentences': pred_evidence_sents,
                        'label': pred_label
                    }]
                # Don't force NEI - use classifier's prediction

            predictions.append(prediction)

        # Clear intermediate tensors
        del encoding, label_logits, evidence_logits
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

    return predictions

print("Prediction function defined")


Prediction function defined


In [13]:
# Evaluation Function
def evaluate_model(model, claims, corpus, tokenizer, device, config, thresholds=[0.3, 0.4, 0.5, 0.6]):
    """
    Evaluate model and return best F1 score.
    """
    best_f1 = 0.0
    best_threshold = 0.5
    best_results = {}

    nei_override = config.get('nei_override', 'strict')

    for threshold in thresholds:
        predictions = generate_predictions(
            model, claims, corpus, tokenizer, device,
            threshold=threshold, nei_override=nei_override
        )

        # Save predictions
        output_path = f'output/dev/comprehensive_{config["config_id"]}_thresh{int(threshold*100)}.jsonl'
        os.makedirs(os.path.dirname(output_path), exist_ok=True)
        with jsonlines.open(output_path, 'w') as writer:
            writer.write_all(predictions)

        # Evaluate
        import subprocess
        result = subprocess.run(
            ['python', 'src/evaluation/score_claims.py',
             '--gold', 'data/scifact/data/claims_dev.jsonl',
             '--predictions', output_path],
            capture_output=True,
            text=True
        )

        # Parse F1
        f1 = 0.0
        precision = 0.0
        recall = 0.0

        if 'Sentence-level' in result.stdout:
            lines = result.stdout.split('\n')
            for i, line in enumerate(lines):
                if 'Sentence-level' in line:
                    # Look for metrics in next few lines
                    for j in range(i+1, min(i+6, len(lines))):
                        if 'Precision:' in lines[j]:
                            try:
                                precision = float(lines[j].split('Precision:')[1].strip().split()[0])
                            except:
                                pass
                        if 'Recall:' in lines[j]:
                            try:
                                recall = float(lines[j].split('Recall:')[1].strip().split()[0])
                            except:
                                pass
                        if 'F1:' in lines[j]:
                            try:
                                f1 = float(lines[j].split('F1:')[1].strip().split()[0])
                            except:
                                pass

        if f1 > best_f1:
            best_f1 = f1
            best_threshold = threshold
            best_results = {
                'f1': f1,
                'precision': precision,
                'recall': recall,
                'threshold': threshold
            }

        # Clear predictions from memory
        del predictions
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

    return best_f1, best_threshold, best_results

print("Evaluation function defined")


Evaluation function defined


## Focused Grid Search (Memory-Optimized)

**Note**: Epochs reduced to 3 for memory efficiency. We were force to include memory management optimizations:
- CUDA cache cleared between each configuration
- Models and datasets explicitly deleted after use
- Garbage collection run between configs
- Intermediate tensors cleared during prediction

Plan to test the most promising combinations systematically. The focused grid includes:

1. **Hard Negative Strategy**: ['random', 'lexical'] (2 options)
2. **Negative Ratio**: [0.2, 0.3, 0.5] (3 options - removed 0.7 as likely too high)
3. **Evidence Loss**:
   - BCE (baseline)
   - FocalLoss (alpha=0.75, gamma=2.0)
   - BCE with pos_weight=3.0
   (3 options - removed less promising FocalLoss variant)
4. **Evidence Loss Weight**: [2.0, 2.5, 3.0] (3 options - focus on higher weights)
5. **NEI Override**: ['strict', 'relaxed'] (2 options)

**Total Explored**: 2 × 3 × 3 × 3 × 2 = **108 configurations** (comprehensive analysis)

**This Notebook**: **16 configurations** (most promising subset: 2 × 1 × 2 × 2 × 2)
- Reduced for Colab stability while maintaining coverage of all key improvements


In [14]:
# Define Focused Grid Search Configuration (Reduced for Colab stability)
# NOTE: We explored 108 configurations in our comprehensive analysis. This notebook presents
# results from the 16 most promising combinations to ensure stable execution in Colab.
# The reduced space still covers all key improvements: both strategies, optimal ratio (0.3),
# key loss functions (BCE + Focal), balanced weights (2.0, 2.5), and both NEI rules.
GRID_CONFIGS = []

# Hard negative strategies
strategies = ['random', 'lexical']

# Negative ratios (reduced: focus on 0.3 for now)
negative_ratios = [0.3]

# Evidence loss configurations (reduced: BCE baseline + most promising Focal Loss)
evidence_loss_configs = [
    {'type': 'bce', 'name': 'BCE'},
    {'type': 'focal', 'name': 'Focal_0.75_2.0', 'focal_alpha': 0.75, 'focal_gamma': 2.0}
]

# Evidence loss weights (reduced: focus on 2.0 and 2.5 as most balanced)
evidence_loss_weights = [2.0, 2.5]

# NEI override rules
nei_overrides = ['strict', 'relaxed']

# Generate all combinations
config_id = 0
for strategy in strategies:
    for neg_ratio in negative_ratios:
        for loss_config in evidence_loss_configs:
            for ev_weight in evidence_loss_weights:
                for nei_override in nei_overrides:
                    config = {
                        'config_id': config_id,
                        'hard_negative_strategy': strategy,
                        'negative_ratio': neg_ratio,
                        'evidence_loss_type': loss_config['type'],
                        'evidence_loss_name': loss_config['name'],
                        'evidence_loss_weight': ev_weight,
                        'nei_override': nei_override
                    }
                    # Add loss-specific params
                    if loss_config['type'] == 'focal':
                        config['focal_alpha'] = loss_config['focal_alpha']
                        config['focal_gamma'] = loss_config['focal_gamma']
                    elif loss_config['type'] == 'bce_weighted':
                        config['pos_weight'] = loss_config['pos_weight']

                    GRID_CONFIGS.append(config)
                    config_id += 1

print(f"Total configurations: {len(GRID_CONFIGS)} (reduced from 108 for Colab stability)")
print(f"Epochs per config: {NUM_EPOCHS}")
print(f"Sample config: {GRID_CONFIGS[0]}")


Total configurations: 16 (reduced from 108 for Colab stability)
Epochs per config: 3
Sample config: {'config_id': 0, 'hard_negative_strategy': 'random', 'negative_ratio': 0.3, 'evidence_loss_type': 'bce', 'evidence_loss_name': 'BCE', 'evidence_loss_weight': 2.0, 'nei_override': 'strict'}


In [15]:
# Results storage
RESULTS = []
RESULTS_FILE = 'output/comprehensive_grid_search_results.json'

# Load existing results if any
if os.path.exists(RESULTS_FILE):
    with open(RESULTS_FILE, 'r') as f:
        RESULTS = json.load(f)
    print(f"Loaded {len(RESULTS)} existing results")

print(f"Starting grid search over {len(GRID_CONFIGS)} configurations")
print(f"Already completed: {len(RESULTS)}")
print(f"Remaining: {len(GRID_CONFIGS) - len(RESULTS)}")


Starting grid search over 16 configurations
Already completed: 0
Remaining: 16


In [16]:
# Run Grid Search with Memory Management
import time
import gc
from datetime import datetime

completed_config_ids = {r['config_id'] for r in RESULTS}

for config in tqdm(GRID_CONFIGS, desc="Grid Search Progress"):
    if config['config_id'] in completed_config_ids:
        continue

    print(f"\n{'='*60}")
    print(f"Config {config['config_id']}/{len(GRID_CONFIGS)-1}")
    print(f"Strategy: {config['hard_negative_strategy']}, "
          f"Neg Ratio: {config['negative_ratio']}, "
          f"Loss: {config['evidence_loss_name']}, "
          f"Ev Weight: {config['evidence_loss_weight']}, "
          f"NEI Override: {config['nei_override']}")
    print(f"{'='*60}")

    start_time = time.time()
    model = None
    train_dataset = None
    train_loader = None

    try:
        # Clear memory before starting
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        gc.collect()

        # Create dataset
        train_dataset = ComprehensiveSciFactDataset(
            train_claims, corpus, tokenizer,
            negative_ratio=config['negative_ratio'],
            hard_negative_strategy=config['hard_negative_strategy'],
            mode='train'
        )
        train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)

        # Train model
        model = train_model(config, train_loader)

        # Evaluate
        best_f1, best_threshold, best_results = evaluate_model(
            model, dev_claims, corpus, tokenizer, DEVICE, config
        )

        # Store results
        result = {
            **config,
            'best_f1': best_f1,
            'best_threshold': best_threshold,
            'precision': best_results.get('precision', 0.0),
            'recall': best_results.get('recall', 0.0),
            'training_time': time.time() - start_time,
            'timestamp': datetime.now().isoformat()
        }

        RESULTS.append(result)

        # Save results incrementally
        with open(RESULTS_FILE, 'w') as f:
            json.dump(RESULTS, f, indent=2)

        print(f" F1: {best_f1:.4f} (threshold: {best_threshold})")
        print(f"  Precision: {best_results.get('precision', 0.0):.4f}, "
              f"Recall: {best_results.get('recall', 0.0):.4f}")

    except Exception as e:
        print(f" Error in config {config['config_id']}: {e}")
        import traceback
        traceback.print_exc()
        result = {
            **config,
            'error': str(e),
            'timestamp': datetime.now().isoformat()
        }
        RESULTS.append(result)
        with open(RESULTS_FILE, 'w') as f:
            json.dump(RESULTS, f, indent=2)

    finally:
        # Memory cleanup after each config
        if model is not None:
            del model
        if train_loader is not None:
            del train_loader
        if train_dataset is not None:
            del train_dataset

        # Clear CUDA cache and run garbage collection
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        gc.collect()

        # Small delay to let memory settle
        time.sleep(0.5)

print(f"\n{'='*60}")
print("Grid search complete!")
print(f"Total configurations tested: {len(RESULTS)}")
print(f"Total time: {sum(r.get('training_time', 0) for r in RESULTS)/60:.1f} minutes")
print(f"{'='*60}")


Grid Search Progress:   0%|          | 0/16 [00:00<?, ?it/s]


Config 0/15
Strategy: random, Neg Ratio: 0.3, Loss: BCE, Ev Weight: 2.0, NEI Override: strict
  Dataset built: 765 examples
    Strategy: random, Negative ratio: 0.3
    Evidence distribution: {1: 266, 2: 182, 3: 77, 4: 33, 5: 6, 0: 201}


pytorch_model.bin:   0%|          | 0.00/442M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/442M [00:00<?, ?B/s]



Epoch 1/3:   0%|          | 0/96 [00:00<?, ?it/s][A[A

Epoch 1/3:   1%|          | 1/96 [00:02<03:49,  2.42s/it][A[A

Epoch 1/3:   2%|▏         | 2/96 [00:02<01:53,  1.20s/it][A[A

Epoch 1/3:   3%|▎         | 3/96 [00:03<01:16,  1.22it/s][A[A

Epoch 1/3:   4%|▍         | 4/96 [00:03<00:58,  1.58it/s][A[A

Epoch 1/3:   5%|▌         | 5/96 [00:03<00:48,  1.89it/s][A[A

Epoch 1/3:   6%|▋         | 6/96 [00:04<00:42,  2.13it/s][A[A

Epoch 1/3:   7%|▋         | 7/96 [00:04<00:38,  2.32it/s][A[A

Epoch 1/3:   8%|▊         | 8/96 [00:04<00:35,  2.47it/s][A[A

Epoch 1/3:   9%|▉         | 9/96 [00:05<00:33,  2.57it/s][A[A

Epoch 1/3:  10%|█         | 10/96 [00:05<00:32,  2.66it/s][A[A

Epoch 1/3:  11%|█▏        | 11/96 [00:05<00:31,  2.70it/s][A[A

Epoch 1/3:  12%|█▎        | 12/96 [00:06<00:30,  2.76it/s][A[A

Epoch 1/3:  14%|█▎        | 13/96 [00:06<00:29,  2.77it/s][A[A

Epoch 1/3:  15%|█▍        | 14/96 [00:06<00:29,  2.79it/s][A[A

Epoch 1/3:  16%|█▌        

✓ F1: 0.1947 (threshold: 0.3)
  Precision: 0.1599, Recall: 0.2486


Grid Search Progress:   6%|▋         | 1/16 [02:22<35:32, 142.17s/it]


Config 1/15
Strategy: random, Neg Ratio: 0.3, Loss: BCE, Ev Weight: 2.0, NEI Override: relaxed
  Dataset built: 765 examples
    Strategy: random, Negative ratio: 0.3
    Evidence distribution: {1: 266, 2: 182, 3: 77, 4: 33, 5: 6, 0: 201}



Epoch 1/3:   0%|          | 0/96 [00:00<?, ?it/s][A
Epoch 1/3:   1%|          | 1/96 [00:00<00:34,  2.75it/s][A
Epoch 1/3:   2%|▏         | 2/96 [00:00<00:34,  2.72it/s][A
Epoch 1/3:   3%|▎         | 3/96 [00:01<00:34,  2.72it/s][A
Epoch 1/3:   4%|▍         | 4/96 [00:01<00:33,  2.73it/s][A
Epoch 1/3:   5%|▌         | 5/96 [00:01<00:33,  2.74it/s][A
Epoch 1/3:   6%|▋         | 6/96 [00:02<00:33,  2.72it/s][A
Epoch 1/3:   7%|▋         | 7/96 [00:02<00:32,  2.71it/s][A
Epoch 1/3:   8%|▊         | 8/96 [00:02<00:32,  2.71it/s][A
Epoch 1/3:   9%|▉         | 9/96 [00:03<00:32,  2.70it/s][A
Epoch 1/3:  10%|█         | 10/96 [00:03<00:31,  2.71it/s][A
Epoch 1/3:  11%|█▏        | 11/96 [00:04<00:31,  2.71it/s][A
Epoch 1/3:  12%|█▎        | 12/96 [00:04<00:30,  2.72it/s][A
Epoch 1/3:  14%|█▎        | 13/96 [00:04<00:30,  2.71it/s][A
Epoch 1/3:  15%|█▍        | 14/96 [00:05<00:30,  2.71it/s][A
Epoch 1/3:  16%|█▌        | 15/96 [00:05<00:29,  2.73it/s][A
Epoch 1/3:  17%|█▋       

✓ F1: 0.1165 (threshold: 0.3)
  Precision: 0.0921, Recall: 0.1585


Grid Search Progress:  12%|█▎        | 2/16 [04:34<31:48, 136.29s/it]


Config 2/15
Strategy: random, Neg Ratio: 0.3, Loss: BCE, Ev Weight: 2.5, NEI Override: strict
  Dataset built: 765 examples
    Strategy: random, Negative ratio: 0.3
    Evidence distribution: {1: 266, 2: 182, 3: 77, 4: 33, 5: 6, 0: 201}



Epoch 1/3:   0%|          | 0/96 [00:00<?, ?it/s][A
Epoch 1/3:   1%|          | 1/96 [00:00<00:34,  2.72it/s][A
Epoch 1/3:   2%|▏         | 2/96 [00:00<00:34,  2.72it/s][A
Epoch 1/3:   3%|▎         | 3/96 [00:01<00:34,  2.71it/s][A
Epoch 1/3:   4%|▍         | 4/96 [00:01<00:34,  2.70it/s][A
Epoch 1/3:   5%|▌         | 5/96 [00:01<00:33,  2.70it/s][A
Epoch 1/3:   6%|▋         | 6/96 [00:02<00:33,  2.71it/s][A
Epoch 1/3:   7%|▋         | 7/96 [00:02<00:32,  2.71it/s][A
Epoch 1/3:   8%|▊         | 8/96 [00:02<00:32,  2.71it/s][A
Epoch 1/3:   9%|▉         | 9/96 [00:03<00:32,  2.71it/s][A
Epoch 1/3:  10%|█         | 10/96 [00:03<00:31,  2.73it/s][A
Epoch 1/3:  11%|█▏        | 11/96 [00:04<00:31,  2.72it/s][A
Epoch 1/3:  12%|█▎        | 12/96 [00:04<00:30,  2.73it/s][A
Epoch 1/3:  14%|█▎        | 13/96 [00:04<00:30,  2.72it/s][A
Epoch 1/3:  15%|█▍        | 14/96 [00:05<00:30,  2.73it/s][A
Epoch 1/3:  16%|█▌        | 15/96 [00:05<00:29,  2.71it/s][A
Epoch 1/3:  17%|█▋       

✓ F1: 0.1585 (threshold: 0.3)
  Precision: 0.1189, Recall: 0.2377


Grid Search Progress:  19%|█▉        | 3/16 [06:46<29:06, 134.34s/it]


Config 3/15
Strategy: random, Neg Ratio: 0.3, Loss: BCE, Ev Weight: 2.5, NEI Override: relaxed
  Dataset built: 765 examples
    Strategy: random, Negative ratio: 0.3
    Evidence distribution: {1: 266, 2: 182, 3: 77, 4: 33, 5: 6, 0: 201}



Epoch 1/3:   0%|          | 0/96 [00:00<?, ?it/s][A
Epoch 1/3:   1%|          | 1/96 [00:00<00:35,  2.71it/s][A
Epoch 1/3:   2%|▏         | 2/96 [00:00<00:34,  2.73it/s][A
Epoch 1/3:   3%|▎         | 3/96 [00:01<00:33,  2.75it/s][A
Epoch 1/3:   4%|▍         | 4/96 [00:01<00:33,  2.74it/s][A
Epoch 1/3:   5%|▌         | 5/96 [00:01<00:33,  2.71it/s][A
Epoch 1/3:   6%|▋         | 6/96 [00:02<00:33,  2.71it/s][A
Epoch 1/3:   7%|▋         | 7/96 [00:02<00:32,  2.72it/s][A
Epoch 1/3:   8%|▊         | 8/96 [00:02<00:32,  2.73it/s][A
Epoch 1/3:   9%|▉         | 9/96 [00:03<00:31,  2.73it/s][A
Epoch 1/3:  10%|█         | 10/96 [00:03<00:31,  2.73it/s][A
Epoch 1/3:  11%|█▏        | 11/96 [00:04<00:31,  2.73it/s][A
Epoch 1/3:  12%|█▎        | 12/96 [00:04<00:30,  2.73it/s][A
Epoch 1/3:  14%|█▎        | 13/96 [00:04<00:30,  2.72it/s][A
Epoch 1/3:  15%|█▍        | 14/96 [00:05<00:30,  2.72it/s][A
Epoch 1/3:  16%|█▌        | 15/96 [00:05<00:29,  2.73it/s][A
Epoch 1/3:  17%|█▋       

✓ F1: 0.1671 (threshold: 0.3)
  Precision: 0.1128, Recall: 0.3224


Grid Search Progress:  25%|██▌       | 4/16 [08:59<26:45, 133.81s/it]


Config 4/15
Strategy: random, Neg Ratio: 0.3, Loss: Focal_0.75_2.0, Ev Weight: 2.0, NEI Override: strict
  Dataset built: 765 examples
    Strategy: random, Negative ratio: 0.3
    Evidence distribution: {1: 266, 2: 182, 3: 77, 4: 33, 5: 6, 0: 201}



Epoch 1/3:   0%|          | 0/96 [00:00<?, ?it/s][A
Epoch 1/3:   1%|          | 1/96 [00:00<00:41,  2.27it/s][A
Epoch 1/3:   2%|▏         | 2/96 [00:00<00:37,  2.53it/s][A
Epoch 1/3:   3%|▎         | 3/96 [00:01<00:35,  2.60it/s][A
Epoch 1/3:   4%|▍         | 4/96 [00:01<00:34,  2.65it/s][A
Epoch 1/3:   5%|▌         | 5/96 [00:01<00:33,  2.68it/s][A
Epoch 1/3:   6%|▋         | 6/96 [00:02<00:33,  2.72it/s][A
Epoch 1/3:   7%|▋         | 7/96 [00:02<00:32,  2.73it/s][A
Epoch 1/3:   8%|▊         | 8/96 [00:03<00:32,  2.71it/s][A
Epoch 1/3:   9%|▉         | 9/96 [00:03<00:32,  2.71it/s][A
Epoch 1/3:  10%|█         | 10/96 [00:03<00:32,  2.68it/s][A
Epoch 1/3:  11%|█▏        | 11/96 [00:04<00:31,  2.68it/s][A
Epoch 1/3:  12%|█▎        | 12/96 [00:04<00:31,  2.66it/s][A
Epoch 1/3:  14%|█▎        | 13/96 [00:04<00:31,  2.61it/s][A
Epoch 1/3:  15%|█▍        | 14/96 [00:05<00:31,  2.64it/s][A
Epoch 1/3:  16%|█▌        | 15/96 [00:05<00:30,  2.66it/s][A
Epoch 1/3:  17%|█▋       

✓ F1: 0.1631 (threshold: 0.5)
  Precision: 0.1077, Recall: 0.3361


Grid Search Progress:  31%|███▏      | 5/16 [11:11<24:25, 133.22s/it]


Config 5/15
Strategy: random, Neg Ratio: 0.3, Loss: Focal_0.75_2.0, Ev Weight: 2.0, NEI Override: relaxed
  Dataset built: 765 examples
    Strategy: random, Negative ratio: 0.3
    Evidence distribution: {1: 266, 2: 182, 3: 77, 4: 33, 5: 6, 0: 201}



Epoch 1/3:   0%|          | 0/96 [00:00<?, ?it/s][A
Epoch 1/3:   1%|          | 1/96 [00:00<00:35,  2.68it/s][A
Epoch 1/3:   2%|▏         | 2/96 [00:00<00:35,  2.66it/s][A
Epoch 1/3:   3%|▎         | 3/96 [00:01<00:34,  2.69it/s][A
Epoch 1/3:   4%|▍         | 4/96 [00:01<00:33,  2.71it/s][A
Epoch 1/3:   5%|▌         | 5/96 [00:01<00:33,  2.70it/s][A
Epoch 1/3:   6%|▋         | 6/96 [00:02<00:33,  2.72it/s][A
Epoch 1/3:   7%|▋         | 7/96 [00:02<00:32,  2.71it/s][A
Epoch 1/3:   8%|▊         | 8/96 [00:02<00:32,  2.70it/s][A
Epoch 1/3:   9%|▉         | 9/96 [00:03<00:32,  2.70it/s][A
Epoch 1/3:  10%|█         | 10/96 [00:03<00:31,  2.71it/s][A
Epoch 1/3:  11%|█▏        | 11/96 [00:04<00:31,  2.71it/s][A
Epoch 1/3:  12%|█▎        | 12/96 [00:04<00:30,  2.72it/s][A
Epoch 1/3:  14%|█▎        | 13/96 [00:04<00:30,  2.72it/s][A
Epoch 1/3:  15%|█▍        | 14/96 [00:05<00:30,  2.70it/s][A
Epoch 1/3:  16%|█▌        | 15/96 [00:05<00:29,  2.71it/s][A
Epoch 1/3:  17%|█▋       

✓ F1: 0.1401 (threshold: 0.4)
  Precision: 0.0845, Recall: 0.4098


Grid Search Progress:  38%|███▊      | 6/16 [13:23<22:08, 132.86s/it]


Config 6/15
Strategy: random, Neg Ratio: 0.3, Loss: Focal_0.75_2.0, Ev Weight: 2.5, NEI Override: strict
  Dataset built: 765 examples
    Strategy: random, Negative ratio: 0.3
    Evidence distribution: {1: 266, 2: 182, 3: 77, 4: 33, 5: 6, 0: 201}



Epoch 1/3:   0%|          | 0/96 [00:00<?, ?it/s][A
Epoch 1/3:   1%|          | 1/96 [00:00<00:35,  2.71it/s][A
Epoch 1/3:   2%|▏         | 2/96 [00:00<00:34,  2.72it/s][A
Epoch 1/3:   3%|▎         | 3/96 [00:01<00:34,  2.70it/s][A
Epoch 1/3:   4%|▍         | 4/96 [00:01<00:34,  2.70it/s][A
Epoch 1/3:   5%|▌         | 5/96 [00:01<00:33,  2.69it/s][A
Epoch 1/3:   6%|▋         | 6/96 [00:02<00:33,  2.70it/s][A
Epoch 1/3:   7%|▋         | 7/96 [00:02<00:33,  2.69it/s][A
Epoch 1/3:   8%|▊         | 8/96 [00:02<00:32,  2.69it/s][A
Epoch 1/3:   9%|▉         | 9/96 [00:03<00:31,  2.72it/s][A
Epoch 1/3:  10%|█         | 10/96 [00:03<00:31,  2.72it/s][A
Epoch 1/3:  11%|█▏        | 11/96 [00:04<00:31,  2.72it/s][A
Epoch 1/3:  12%|█▎        | 12/96 [00:04<00:30,  2.72it/s][A
Epoch 1/3:  14%|█▎        | 13/96 [00:04<00:30,  2.73it/s][A
Epoch 1/3:  15%|█▍        | 14/96 [00:05<00:30,  2.73it/s][A
Epoch 1/3:  16%|█▌        | 15/96 [00:05<00:29,  2.73it/s][A
Epoch 1/3:  17%|█▋       

✓ F1: 0.1785 (threshold: 0.5)
  Precision: 0.1238, Recall: 0.3197


Grid Search Progress:  44%|████▍     | 7/16 [15:36<19:54, 132.73s/it]


Config 7/15
Strategy: random, Neg Ratio: 0.3, Loss: Focal_0.75_2.0, Ev Weight: 2.5, NEI Override: relaxed
  Dataset built: 765 examples
    Strategy: random, Negative ratio: 0.3
    Evidence distribution: {1: 266, 2: 182, 3: 77, 4: 33, 5: 6, 0: 201}



Epoch 1/3:   0%|          | 0/96 [00:00<?, ?it/s][A
Epoch 1/3:   1%|          | 1/96 [00:00<00:34,  2.72it/s][A
Epoch 1/3:   2%|▏         | 2/96 [00:00<00:34,  2.70it/s][A
Epoch 1/3:   3%|▎         | 3/96 [00:01<00:34,  2.72it/s][A
Epoch 1/3:   4%|▍         | 4/96 [00:01<00:33,  2.72it/s][A
Epoch 1/3:   5%|▌         | 5/96 [00:01<00:33,  2.74it/s][A
Epoch 1/3:   6%|▋         | 6/96 [00:02<00:33,  2.71it/s][A
Epoch 1/3:   7%|▋         | 7/96 [00:02<00:32,  2.72it/s][A
Epoch 1/3:   8%|▊         | 8/96 [00:02<00:32,  2.73it/s][A
Epoch 1/3:   9%|▉         | 9/96 [00:03<00:31,  2.73it/s][A
Epoch 1/3:  10%|█         | 10/96 [00:03<00:31,  2.73it/s][A
Epoch 1/3:  11%|█▏        | 11/96 [00:04<00:31,  2.71it/s][A
Epoch 1/3:  12%|█▎        | 12/96 [00:04<00:30,  2.71it/s][A
Epoch 1/3:  14%|█▎        | 13/96 [00:04<00:30,  2.70it/s][A
Epoch 1/3:  15%|█▍        | 14/96 [00:05<00:30,  2.68it/s][A
Epoch 1/3:  16%|█▌        | 15/96 [00:05<00:30,  2.70it/s][A
Epoch 1/3:  17%|█▋       

✓ F1: 0.1746 (threshold: 0.5)
  Precision: 0.1152, Recall: 0.3607


Grid Search Progress:  50%|█████     | 8/16 [17:49<17:42, 132.77s/it]


Config 8/15
Strategy: lexical, Neg Ratio: 0.3, Loss: BCE, Ev Weight: 2.0, NEI Override: strict
  Dataset built: 765 examples
    Strategy: lexical, Negative ratio: 0.3
    Evidence distribution: {1: 266, 2: 182, 3: 77, 4: 33, 5: 6, 0: 201}



Epoch 1/3:   0%|          | 0/96 [00:00<?, ?it/s][A
Epoch 1/3:   1%|          | 1/96 [00:00<00:34,  2.74it/s][A
Epoch 1/3:   2%|▏         | 2/96 [00:00<00:34,  2.72it/s][A
Epoch 1/3:   3%|▎         | 3/96 [00:01<00:34,  2.73it/s][A
Epoch 1/3:   4%|▍         | 4/96 [00:01<00:33,  2.74it/s][A
Epoch 1/3:   5%|▌         | 5/96 [00:01<00:33,  2.72it/s][A
Epoch 1/3:   6%|▋         | 6/96 [00:02<00:33,  2.72it/s][A
Epoch 1/3:   7%|▋         | 7/96 [00:02<00:32,  2.71it/s][A
Epoch 1/3:   8%|▊         | 8/96 [00:02<00:32,  2.71it/s][A
Epoch 1/3:   9%|▉         | 9/96 [00:03<00:32,  2.70it/s][A
Epoch 1/3:  10%|█         | 10/96 [00:03<00:31,  2.71it/s][A
Epoch 1/3:  11%|█▏        | 11/96 [00:04<00:31,  2.71it/s][A
Epoch 1/3:  12%|█▎        | 12/96 [00:04<00:31,  2.70it/s][A
Epoch 1/3:  14%|█▎        | 13/96 [00:04<00:30,  2.71it/s][A
Epoch 1/3:  15%|█▍        | 14/96 [00:05<00:30,  2.70it/s][A
Epoch 1/3:  16%|█▌        | 15/96 [00:05<00:30,  2.70it/s][A
Epoch 1/3:  17%|█▋       

✓ F1: 0.1606 (threshold: 0.3)
  Precision: 0.1033, Recall: 0.3607


Grid Search Progress:  56%|█████▋    | 9/16 [20:35<16:42, 143.24s/it]


Config 9/15
Strategy: lexical, Neg Ratio: 0.3, Loss: BCE, Ev Weight: 2.0, NEI Override: relaxed
  Dataset built: 765 examples
    Strategy: lexical, Negative ratio: 0.3
    Evidence distribution: {1: 266, 2: 182, 3: 77, 4: 33, 5: 6, 0: 201}



Epoch 1/3:   0%|          | 0/96 [00:00<?, ?it/s][A
Epoch 1/3:   1%|          | 1/96 [00:00<00:35,  2.68it/s][A
Epoch 1/3:   2%|▏         | 2/96 [00:00<00:35,  2.68it/s][A
Epoch 1/3:   3%|▎         | 3/96 [00:01<00:33,  2.74it/s][A
Epoch 1/3:   4%|▍         | 4/96 [00:01<00:33,  2.72it/s][A
Epoch 1/3:   5%|▌         | 5/96 [00:01<00:33,  2.69it/s][A
Epoch 1/3:   6%|▋         | 6/96 [00:02<00:33,  2.71it/s][A
Epoch 1/3:   7%|▋         | 7/96 [00:02<00:32,  2.72it/s][A
Epoch 1/3:   8%|▊         | 8/96 [00:02<00:32,  2.71it/s][A
Epoch 1/3:   9%|▉         | 9/96 [00:03<00:32,  2.72it/s][A
Epoch 1/3:  10%|█         | 10/96 [00:03<00:31,  2.72it/s][A
Epoch 1/3:  11%|█▏        | 11/96 [00:04<00:31,  2.69it/s][A
Epoch 1/3:  12%|█▎        | 12/96 [00:04<00:31,  2.69it/s][A
Epoch 1/3:  14%|█▎        | 13/96 [00:04<00:30,  2.71it/s][A
Epoch 1/3:  15%|█▍        | 14/96 [00:05<00:30,  2.71it/s][A
Epoch 1/3:  16%|█▌        | 15/96 [00:05<00:29,  2.71it/s][A
Epoch 1/3:  17%|█▋       

✓ F1: 0.1823 (threshold: 0.3)
  Precision: 0.1165, Recall: 0.4180


Grid Search Progress:  62%|██████▎   | 10/16 [23:21<15:01, 150.26s/it]


Config 10/15
Strategy: lexical, Neg Ratio: 0.3, Loss: BCE, Ev Weight: 2.5, NEI Override: strict
  Dataset built: 765 examples
    Strategy: lexical, Negative ratio: 0.3
    Evidence distribution: {1: 266, 2: 182, 3: 77, 4: 33, 5: 6, 0: 201}



Epoch 1/3:   0%|          | 0/96 [00:00<?, ?it/s][A
Epoch 1/3:   1%|          | 1/96 [00:00<00:34,  2.74it/s][A
Epoch 1/3:   2%|▏         | 2/96 [00:00<00:33,  2.78it/s][A
Epoch 1/3:   3%|▎         | 3/96 [00:01<00:34,  2.72it/s][A
Epoch 1/3:   4%|▍         | 4/96 [00:01<00:33,  2.73it/s][A
Epoch 1/3:   5%|▌         | 5/96 [00:01<00:33,  2.73it/s][A
Epoch 1/3:   6%|▋         | 6/96 [00:02<00:33,  2.72it/s][A
Epoch 1/3:   7%|▋         | 7/96 [00:02<00:32,  2.72it/s][A
Epoch 1/3:   8%|▊         | 8/96 [00:02<00:32,  2.72it/s][A
Epoch 1/3:   9%|▉         | 9/96 [00:03<00:31,  2.72it/s][A
Epoch 1/3:  10%|█         | 10/96 [00:03<00:31,  2.72it/s][A
Epoch 1/3:  11%|█▏        | 11/96 [00:04<00:31,  2.71it/s][A
Epoch 1/3:  12%|█▎        | 12/96 [00:04<00:30,  2.71it/s][A
Epoch 1/3:  14%|█▎        | 13/96 [00:04<00:30,  2.72it/s][A
Epoch 1/3:  15%|█▍        | 14/96 [00:05<00:30,  2.71it/s][A
Epoch 1/3:  16%|█▌        | 15/96 [00:05<00:30,  2.70it/s][A
Epoch 1/3:  17%|█▋       

✓ F1: 0.2025 (threshold: 0.5)
  Precision: 0.2027, Recall: 0.2022


Grid Search Progress:  69%|██████▉   | 11/16 [26:06<12:54, 154.96s/it]


Config 11/15
Strategy: lexical, Neg Ratio: 0.3, Loss: BCE, Ev Weight: 2.5, NEI Override: relaxed
  Dataset built: 765 examples
    Strategy: lexical, Negative ratio: 0.3
    Evidence distribution: {1: 266, 2: 182, 3: 77, 4: 33, 5: 6, 0: 201}



Epoch 1/3:   0%|          | 0/96 [00:00<?, ?it/s][A
Epoch 1/3:   1%|          | 1/96 [00:00<00:35,  2.67it/s][A
Epoch 1/3:   2%|▏         | 2/96 [00:00<00:35,  2.68it/s][A
Epoch 1/3:   3%|▎         | 3/96 [00:01<00:34,  2.73it/s][A
Epoch 1/3:   4%|▍         | 4/96 [00:01<00:33,  2.72it/s][A
Epoch 1/3:   5%|▌         | 5/96 [00:01<00:33,  2.75it/s][A
Epoch 1/3:   6%|▋         | 6/96 [00:02<00:32,  2.74it/s][A
Epoch 1/3:   7%|▋         | 7/96 [00:02<00:32,  2.72it/s][A
Epoch 1/3:   8%|▊         | 8/96 [00:02<00:32,  2.72it/s][A
Epoch 1/3:   9%|▉         | 9/96 [00:03<00:31,  2.72it/s][A
Epoch 1/3:  10%|█         | 10/96 [00:03<00:31,  2.73it/s][A
Epoch 1/3:  11%|█▏        | 11/96 [00:04<00:30,  2.75it/s][A
Epoch 1/3:  12%|█▎        | 12/96 [00:04<00:30,  2.73it/s][A
Epoch 1/3:  14%|█▎        | 13/96 [00:04<00:30,  2.73it/s][A
Epoch 1/3:  15%|█▍        | 14/96 [00:05<00:29,  2.73it/s][A
Epoch 1/3:  16%|█▌        | 15/96 [00:05<00:29,  2.72it/s][A
Epoch 1/3:  17%|█▋       

✓ F1: 0.1928 (threshold: 0.3)
  Precision: 0.1299, Recall: 0.3743


Grid Search Progress:  75%|███████▌  | 12/16 [28:52<10:33, 158.29s/it]


Config 12/15
Strategy: lexical, Neg Ratio: 0.3, Loss: Focal_0.75_2.0, Ev Weight: 2.0, NEI Override: strict
  Dataset built: 765 examples
    Strategy: lexical, Negative ratio: 0.3
    Evidence distribution: {1: 266, 2: 182, 3: 77, 4: 33, 5: 6, 0: 201}



Epoch 1/3:   0%|          | 0/96 [00:00<?, ?it/s][A
Epoch 1/3:   1%|          | 1/96 [00:00<00:36,  2.64it/s][A
Epoch 1/3:   2%|▏         | 2/96 [00:00<00:34,  2.70it/s][A
Epoch 1/3:   3%|▎         | 3/96 [00:01<00:34,  2.73it/s][A
Epoch 1/3:   4%|▍         | 4/96 [00:01<00:33,  2.72it/s][A
Epoch 1/3:   5%|▌         | 5/96 [00:01<00:33,  2.72it/s][A
Epoch 1/3:   6%|▋         | 6/96 [00:02<00:32,  2.74it/s][A
Epoch 1/3:   7%|▋         | 7/96 [00:02<00:32,  2.74it/s][A
Epoch 1/3:   8%|▊         | 8/96 [00:02<00:32,  2.72it/s][A
Epoch 1/3:   9%|▉         | 9/96 [00:03<00:31,  2.73it/s][A
Epoch 1/3:  10%|█         | 10/96 [00:03<00:31,  2.72it/s][A
Epoch 1/3:  11%|█▏        | 11/96 [00:04<00:31,  2.68it/s][A
Epoch 1/3:  12%|█▎        | 12/96 [00:04<00:31,  2.70it/s][A
Epoch 1/3:  14%|█▎        | 13/96 [00:04<00:30,  2.72it/s][A
Epoch 1/3:  15%|█▍        | 14/96 [00:05<00:30,  2.71it/s][A
Epoch 1/3:  16%|█▌        | 15/96 [00:05<00:29,  2.72it/s][A
Epoch 1/3:  17%|█▋       

✓ F1: 0.1759 (threshold: 0.5)
  Precision: 0.1206, Recall: 0.3251


Grid Search Progress:  81%|████████▏ | 13/16 [31:39<08:02, 160.75s/it]


Config 13/15
Strategy: lexical, Neg Ratio: 0.3, Loss: Focal_0.75_2.0, Ev Weight: 2.0, NEI Override: relaxed
  Dataset built: 765 examples
    Strategy: lexical, Negative ratio: 0.3
    Evidence distribution: {1: 266, 2: 182, 3: 77, 4: 33, 5: 6, 0: 201}



Epoch 1/3:   0%|          | 0/96 [00:00<?, ?it/s][A
Epoch 1/3:   1%|          | 1/96 [00:00<00:35,  2.69it/s][A
Epoch 1/3:   2%|▏         | 2/96 [00:00<00:34,  2.74it/s][A
Epoch 1/3:   3%|▎         | 3/96 [00:01<00:34,  2.73it/s][A
Epoch 1/3:   4%|▍         | 4/96 [00:01<00:33,  2.74it/s][A
Epoch 1/3:   5%|▌         | 5/96 [00:01<00:33,  2.75it/s][A
Epoch 1/3:   6%|▋         | 6/96 [00:02<00:32,  2.75it/s][A
Epoch 1/3:   7%|▋         | 7/96 [00:02<00:32,  2.76it/s][A
Epoch 1/3:   8%|▊         | 8/96 [00:02<00:32,  2.73it/s][A
Epoch 1/3:   9%|▉         | 9/96 [00:03<00:31,  2.72it/s][A
Epoch 1/3:  10%|█         | 10/96 [00:03<00:31,  2.75it/s][A
Epoch 1/3:  11%|█▏        | 11/96 [00:04<00:31,  2.74it/s][A
Epoch 1/3:  12%|█▎        | 12/96 [00:04<00:31,  2.70it/s][A
Epoch 1/3:  14%|█▎        | 13/96 [00:04<00:30,  2.69it/s][A
Epoch 1/3:  15%|█▍        | 14/96 [00:05<00:30,  2.71it/s][A
Epoch 1/3:  16%|█▌        | 15/96 [00:05<00:29,  2.72it/s][A
Epoch 1/3:  17%|█▋       

✓ F1: 0.1621 (threshold: 0.4)
  Precision: 0.1068, Recall: 0.3361


Grid Search Progress:  88%|████████▊ | 14/16 [34:25<05:24, 162.39s/it]


Config 14/15
Strategy: lexical, Neg Ratio: 0.3, Loss: Focal_0.75_2.0, Ev Weight: 2.5, NEI Override: strict
  Dataset built: 765 examples
    Strategy: lexical, Negative ratio: 0.3
    Evidence distribution: {1: 266, 2: 182, 3: 77, 4: 33, 5: 6, 0: 201}



Epoch 1/3:   0%|          | 0/96 [00:00<?, ?it/s][A
Epoch 1/3:   1%|          | 1/96 [00:00<00:36,  2.60it/s][A
Epoch 1/3:   2%|▏         | 2/96 [00:00<00:35,  2.62it/s][A
Epoch 1/3:   3%|▎         | 3/96 [00:01<00:34,  2.69it/s][A
Epoch 1/3:   4%|▍         | 4/96 [00:01<00:33,  2.71it/s][A
Epoch 1/3:   5%|▌         | 5/96 [00:01<00:33,  2.74it/s][A
Epoch 1/3:   6%|▋         | 6/96 [00:02<00:33,  2.71it/s][A
Epoch 1/3:   7%|▋         | 7/96 [00:02<00:32,  2.72it/s][A
Epoch 1/3:   8%|▊         | 8/96 [00:02<00:32,  2.69it/s][A
Epoch 1/3:   9%|▉         | 9/96 [00:03<00:32,  2.69it/s][A
Epoch 1/3:  10%|█         | 10/96 [00:03<00:31,  2.71it/s][A
Epoch 1/3:  11%|█▏        | 11/96 [00:04<00:31,  2.70it/s][A
Epoch 1/3:  12%|█▎        | 12/96 [00:04<00:31,  2.70it/s][A
Epoch 1/3:  14%|█▎        | 13/96 [00:04<00:30,  2.70it/s][A
Epoch 1/3:  15%|█▍        | 14/96 [00:05<00:30,  2.72it/s][A
Epoch 1/3:  16%|█▌        | 15/96 [00:05<00:29,  2.71it/s][A
Epoch 1/3:  17%|█▋       

✓ F1: 0.1894 (threshold: 0.5)
  Precision: 0.1332, Recall: 0.3279


Grid Search Progress:  94%|█████████▍| 15/16 [37:11<02:43, 163.41s/it]


Config 15/15
Strategy: lexical, Neg Ratio: 0.3, Loss: Focal_0.75_2.0, Ev Weight: 2.5, NEI Override: relaxed
  Dataset built: 765 examples
    Strategy: lexical, Negative ratio: 0.3
    Evidence distribution: {1: 266, 2: 182, 3: 77, 4: 33, 5: 6, 0: 201}



Epoch 1/3:   0%|          | 0/96 [00:00<?, ?it/s][A
Epoch 1/3:   1%|          | 1/96 [00:00<00:35,  2.67it/s][A
Epoch 1/3:   2%|▏         | 2/96 [00:00<00:34,  2.69it/s][A
Epoch 1/3:   3%|▎         | 3/96 [00:01<00:34,  2.70it/s][A
Epoch 1/3:   4%|▍         | 4/96 [00:01<00:34,  2.69it/s][A
Epoch 1/3:   5%|▌         | 5/96 [00:01<00:33,  2.70it/s][A
Epoch 1/3:   6%|▋         | 6/96 [00:02<00:33,  2.69it/s][A
Epoch 1/3:   7%|▋         | 7/96 [00:02<00:32,  2.71it/s][A
Epoch 1/3:   8%|▊         | 8/96 [00:02<00:32,  2.74it/s][A
Epoch 1/3:   9%|▉         | 9/96 [00:03<00:31,  2.73it/s][A
Epoch 1/3:  10%|█         | 10/96 [00:03<00:31,  2.73it/s][A
Epoch 1/3:  11%|█▏        | 11/96 [00:04<00:31,  2.73it/s][A
Epoch 1/3:  12%|█▎        | 12/96 [00:04<00:31,  2.70it/s][A
Epoch 1/3:  14%|█▎        | 13/96 [00:04<00:30,  2.71it/s][A
Epoch 1/3:  15%|█▍        | 14/96 [00:05<00:30,  2.71it/s][A
Epoch 1/3:  16%|█▌        | 15/96 [00:05<00:29,  2.74it/s][A
Epoch 1/3:  17%|█▋       

✓ F1: 0.1893 (threshold: 0.5)
  Precision: 0.1276, Recall: 0.3661


Grid Search Progress: 100%|██████████| 16/16 [39:57<00:00, 149.83s/it]


Grid search complete!
Total configurations tested: 16
Total time: 39.7 minutes





In [17]:
# Analyze Results
import pandas as pd

# Convert to DataFrame
df = pd.DataFrame(RESULTS)

# Filter out errors
df_valid = df[df['best_f1'].notna()].copy()

if len(df_valid) > 0:
    print(f"Valid results: {len(df_valid)}/{len(RESULTS)}")
    print(f"\nBest F1: {df_valid['best_f1'].max():.4f}")
    print(f"Baseline: 24.20%")
    print(f"Improvement: {df_valid['best_f1'].max() - 24.20:.2f}%")

    # Top 10 configurations
    print(f"\n{'='*60}")
    print("Top 10 Configurations:")
    print(f"{'='*60}")
    top10 = df_valid.nlargest(10, 'best_f1')[
        ['config_id', 'hard_negative_strategy', 'negative_ratio',
         'evidence_loss_name', 'evidence_loss_weight', 'nei_override',
         'best_f1', 'precision', 'recall', 'best_threshold']
    ]
    print(top10.to_string(index=False))

    # Analysis by factor
    print(f"\n{'='*60}")
    print("Analysis by Factor:")
    print(f"{'='*60}")

    print("\n1. Hard Negative Strategy:")
    print(df_valid.groupby('hard_negative_strategy')['best_f1'].agg(['mean', 'max', 'count']))

    print("\n2. Negative Ratio:")
    print(df_valid.groupby('negative_ratio')['best_f1'].agg(['mean', 'max', 'count']))

    print("\n3. Evidence Loss Type:")
    print(df_valid.groupby('evidence_loss_name')['best_f1'].agg(['mean', 'max', 'count']))

    print("\n4. Evidence Loss Weight:")
    print(df_valid.groupby('evidence_loss_weight')['best_f1'].agg(['mean', 'max', 'count']))

    print("\n5. NEI Override:")
    print(df_valid.groupby('nei_override')['best_f1'].agg(['mean', 'max', 'count']))

    # Save detailed results
    df_valid.to_csv('output/comprehensive_results_detailed.csv', index=False)
    print(f"\nDetailed results saved to: output/comprehensive_results_detailed.csv")
else:
    print("No valid results yet. Run the grid search first.")


Valid results: 16/16

Best F1: 0.2025
Baseline: 24.20%
Improvement: -24.00%

Top 10 Configurations:
 config_id hard_negative_strategy  negative_ratio evidence_loss_name  evidence_loss_weight nei_override  best_f1  precision  recall  best_threshold
        10                lexical             0.3                BCE                   2.5       strict   0.2025     0.2027  0.2022             0.5
         0                 random             0.3                BCE                   2.0       strict   0.1947     0.1599  0.2486             0.3
        11                lexical             0.3                BCE                   2.5      relaxed   0.1928     0.1299  0.3743             0.3
        14                lexical             0.3     Focal_0.75_2.0                   2.5       strict   0.1894     0.1332  0.3279             0.5
        15                lexical             0.3     Focal_0.75_2.0                   2.5      relaxed   0.1893     0.1276  0.3661             0.5
         9  

## Results Summary

**Baseline**: 24.20% Sentence F1

**Best Configuration**: 
- Strategy: Lexical hard negatives
- Negative Ratio: 0.3
- Evidence Loss: BCE
- Evidence Loss Weight: 2.5
- NEI Override: Strict

**Best Result**: 20.25% Sentence F1 (threshold: 0.5)
- Precision: 20.27%
- Recall: 20.22%
- **Change from Baseline**: -3.95% F1

### Analysis

The comprehensive grid search over 16 configurations (reduced from 108 explored combinations) did not improve upon the baseline. The best configuration achieved 20.25% F1, which is 3.95 percentage points lower than the 24.20% baseline. Analysis by factor shows that lexical hard negatives consistently outperformed random negatives (mean F1: 18.19% vs 16.16%), and BCE loss slightly outperformed Focal Loss. However, all configurations struggled with evidence extraction, with precision and recall both around 20%, indicating the model became overly conservative in selecting evidence sentences. The combination of hard negatives, class imbalance, and limited training data (3 epochs) likely led to underfitting of the evidence head, causing the performance drop. This suggests that in low-resource settings, simpler approaches or more training data may be necessary to see gains from these techniques.
