<a href="https://colab.research.google.com/github/XianWanLo/COMP90042-Group-Project/blob/main/GroupID__COMP90042_Project_2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 2025 COMP90042 Project
*Make sure you change the file name with your group id.*

# Readme
*If there is something to be noted for the marker, please mention here.*

*If you are planning to implement a program with Object Oriented Programming style, please put those the bottom of this ipynb file*

# 1.DataSet Processing
(You can add as many code blocks and text blocks as you need. However, YOU SHOULD NOT MODIFY the section title)

In [6]:
# === Standard-library imports =================================================
import os, json, pickle, math
from typing import Dict, List, Tuple
!pip install rank_bm25

# === Third-party imports ======================================================
import spacy                                # We use this for Tokenisation
from rank_bm25 import BM25Okapi             # We use this for BM25 retrieval (Supposed to be better than TF-IDF)
from sentence_transformers import SentenceTransformer, util  # SBERT encoding & cos-sim utils

# === paths ================================================
TRAIN_CLAIMS_PATH  = "data/train-claims.json"
DEV_CLAIMS_PATH    = "data/dev-claims.json"
EVIDENCE_PATH      = "data/evidence.json"

nlp = spacy.load("en_core_web_sm") # SpaCy model

def load_claims(claims_path: str) -> Dict:
    """Return claims from JSON file."""
    with open(claims_path) as f:
        claims = json.load(f)
    return claims

def load_evidences(evidence_path: str) -> Dict:
    """Return evidences from JSON file."""
    with open(evidence_path) as f:
        evidences = json.load(f)
    return evidences

def tokenise_cached(texts: List[str], cache_file: str) -> List[List[str]]:
    """Simple spaCy tokenisation that caches to disk."""
    # Attempt to retrieve cached data
    os.makedirs(os.path.dirname(cache_file), exist_ok=True)
    if os.path.exists(cache_file):
        with open(cache_file, "rb") as f:
            return pickle.load(f)

    # Prepare to tokenise the list of documents and save to a file
    out: List[List[str]] = []
    for doc in nlp.pipe(texts, batch_size=64):
        tokens = [t.text for t in doc if not t.is_stop and not t.is_punct]
        out.append(tokens)
    with open(cache_file, "wb") as f:
        pickle.dump(out, f)
    return out


Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


# 2. Model Implementation
(You can add as many code blocks and text blocks as you need. However, YOU SHOULD NOT MODIFY the section title)

In [7]:
# === BM25 filtering =================================================
def bm25_candidates(
    claims: Dict[str, dict], evidences: Dict[str, str],
    top_k: int, ratio: float
) -> Tuple[List[str], List[str], Dict[str, List[str]]]:
    """Return claim_ids, claim_texts, and BM25 top-k evidence IDs per claim."""
    claim_ids   = list(claims)
    claim_texts = [claims[cid]["claim_text"] for cid in claim_ids]

    evidence_ids, evidence_texts = zip(*[
        (eid, txt) for eid, txt in evidences.items() if txt
    ]) if evidences else ([], [])

    tok_e = tokenise_cached(list(evidence_texts), f"cache/evid_{len(evidence_texts)}.pkl")
    tok_c = tokenise_cached(claim_texts,          f"cache/claim_{len(claim_texts)}.pkl")

    bm25 = BM25Okapi(tok_e)
    k = top_k if len(evidence_ids) >= 500 else max(1, math.ceil(len(evidence_ids)*ratio))

    cand_map: Dict[str, List[str]] = {}
    for cid, toks in zip(claim_ids, tok_c):
        scores  = bm25.get_scores(toks)
        top_idx = sorted(range(len(scores)), key=scores.__getitem__, reverse=True)[:k]
        cand_map[cid] = [evidence_ids[i] for i in top_idx]
    return claim_ids, claim_texts, cand_map

# === Sentence-Bert re-ranking =================================================
def sbert_rerank(
    claim_ids: List[str], claim_texts: List[str], cand_map: Dict[str, List[str]],
    evidences: Dict[str, str], model: SentenceTransformer, score_th: float
) -> Dict[str, dict]:
    """Rerank BM25 candidates with SBERT cosine similarity."""
    results: Dict[str, dict] = {}

    emb_claims = model.encode(claim_texts, convert_to_tensor=True)
    idx_of = {cid: i for i, cid in enumerate(claim_ids)}

    for cid in claim_ids:
        c_vec = emb_claims[idx_of[cid]]
        ev_scores = []
        for eid in cand_map[cid]:
            if eid not in evidences:
                continue
            e_vec = model.encode(evidences[eid], convert_to_tensor=True)
            score = util.cos_sim(c_vec, e_vec).item()
            ev_scores.append((eid, score))
        kept = [p for p in ev_scores if p[1] >= score_th] or sorted(ev_scores, key=lambda x:x[1], reverse=True)[:1]
        results[cid] = {
            "evidences": [eid for eid, _ in kept],
            "scores":    [round(s,4) for _, s in kept],
        }
    return results

# 3.Testing and Evaluation
(You can add as many code blocks and text blocks as you need. However, YOU SHOULD NOT MODIFY the section title)

In [None]:

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import precision_recall_fscore_support

# === Parameters ================================================
BASE_MODEL_NAME    = "all-MiniLM-L6-v2"     # SBERT checkpoint (no fine-tune), TODO: Fine tune

TOP_K_FIXED        = 100    # BM25 candidates per claim (upper bound)
TOP_K_RATIO        = 0.20   # Ratio fallback when corpus is tiny
SBERT_SCORE_TH     = 0.93   # Cosine-similarity threshold for sentence-bert

LIMIT_DEV_CLAIMS   = True  # Quick-iteration switch, to limit the size of evidences for faster processing
LIMIT_COUNT        = 100

def evaluate(pred: dict, actual: dict):
    gold_sets = [set(actual[c]["evidences"])          for c in actual]
    pred_sets = [set(pred.get(c, {}).get("evidences", [])) for c in actual]

    mlb = MultiLabelBinarizer()                     # ← turns set-of-IDs → multi-hot vector
    y_true = mlb.fit_transform(gold_sets)
    y_pred = mlb.transform(pred_sets)               # use same classes_

    prec, rec, f1, _ = precision_recall_fscore_support(
        y_true, y_pred, average="micro", zero_division=0
    )
    return rec, prec, f1

def main():
    print("Loading datasets…")
    train_claims = load_claims(TRAIN_CLAIMS_PATH) # Todo: TRAIN S-BERT
    development_claims   = load_claims(DEV_CLAIMS_PATH)
    evidence_corpus = load_evidences(EVIDENCE_PATH)

    if LIMIT_DEV_CLAIMS:
        selected_claim_ids = list(development_claims)[:LIMIT_COUNT]
        development_claims = {cid: development_claims[cid] for cid in selected_claim_ids}

        # Gather only the evidence IDs actually referenced by the claims in the limited dev claims
        required_evidence_ids = {
            e_id for claim in development_claims.values() for e_id in claim["evidences"]
        }
        evidence_corpus = {
            e_id: evidence_corpus[e_id]                    # keep text
            for e_id in required_evidence_ids
            if e_id in evidence_corpus
        }

    # --- BM25 retrieval ---
    print("BM25 candidate selection")
    candidate_ids, candidate_texts, candidate_map = bm25_candidates(
        development_claims,
        evidence_corpus,
        TOP_K_FIXED,
        TOP_K_RATIO
    )
    # candidate_ids – ascending order of claim IDs
    # candidate_texts – parallel list of raw claim strings
    # candidate_map – each claim mapped to its K best evidence IDs

    # --- SBERT rerank ---
    print("Loading SBERT model")
    sentence_bert = SentenceTransformer(BASE_MODEL_NAME)    # encodes claim & evidence texts

    print("SBERT reranking")
    predictions = sbert_rerank(
        candidate_ids,
        candidate_texts,
        candidate_map,
        evidence_corpus,
        sentence_bert,
        SBERT_SCORE_TH
    )
    # --- evaluation ---
    rec, prec, f1 = evaluate(predictions, development_claims)
    print(f"\nRecall: {rec:.4f} Precision: {prec:.4f} F1: {f1:.4f}")

    # --- save file ---
    with open("dev-claims-predictions.json", "w") as f:
        json.dump(predictions, f, indent=2)
    print("Exported predictions to dev-claims-predictions.json")

if __name__ == "__main__":
    main()


Loading datasets…
BM25 candidate selection
Loading SBERT model


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

SBERT reranking

Recall: 0.1863 Precision: 0.6000 F1: 0.2844
Exported predictions to dev-claims-predictions.json


## Object Oriented Programming codes here

*You can use multiple code snippets. Just add more if needed*

Build Claim-Evidence Pair for classification training task.

In [None]:
import json

train_claims = load_claims(TRAIN_CLAIMS_PATH) # Todo: TRAIN S-BERT
development_claims   = load_claims(DEV_CLAIMS_PATH)
evidence_corpus = load_evidences(EVIDENCE_PATH)

# --- build claim evidence ---
print("build claim evidence set")

def build_claim_evidence_set(claims):

  data = []

  for claim_id, claim_info in claims.items():
      claim_text = claim_info["claim_text"]
      claim_label = claim_info["claim_label"]
      evidence_ids = claim_info["evidences"]

      # Get text for each evidence ID (if available)
      evidence_texts = []
      for eid in evidence_ids:
          if eid in evidence_corpus:
              evidence_texts.append(evidence_corpus[eid])
          else:
              print(f"Warning: Evidence ID {eid} not found in evidence.json")

      data.append({
          "claim_id": claim_id,
          "claim_text": claim_text,
          "evidences": evidence_texts,
          "claim_label": claim_label
      })

  return data

# --- save  train & dev claim evidence set ---
print("save  train & dev claim evidence set")

training_data = build_claim_evidence_set(train_claims)
dev_data = build_claim_evidence_set(development_claims)

with open("data/claim-evidence-train_set.json", "w") as f:
    json.dump(training_data, f, indent=2)

with open("data/claim-evidence-dev_set.json", "w") as f:
    json.dump(dev_data, f, indent=2)


build claim evidence set
save  train & dev claim evidence set


Data Preprocess & Loading

In [8]:
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from torch.utils.data import Dataset
import torch
import json

# Custom Dataset
class ClaimClassificationDataset(Dataset):
    def __init__(self, data_path, tokenizer, max_len=512):
        self.data = self.load_data(data_path)
        self.tokenizer = tokenizer
        self.max_len = max_len
        self.label_map = {'SUPPORTS': 0, 'REFUTES': 1, 'NOT_ENOUGH_INFO': 2, 'DISPUTED': 3}

    def load_data(self, data_path):
        """Load and preprocess data from JSON."""
        with open(data_path, "r") as f:
            data = json.load(f)
        return data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        claim = item['claim_text']
        evidence = " [SEP] ".join(item['evidences'])
        inputs = self.tokenizer(claim + " [SEP] " + evidence,
                                truncation=True, padding='max_length',
                                max_length=self.max_len, return_tensors='pt')
        inputs = {k: v.squeeze(0) for k, v in inputs.items()}
        inputs['labels'] = torch.tensor(self.label_map[item['claim_label']])
        return inputs


def create_dataloaders(train_path, dev_path, tokenizer, batch_size=16, max_len=512):
    """
    Creates training and validation data loaders.

    Args:
        train_path (str): Path to training data JSON file.
        dev_path (str): Path to development data JSON file.
        tokenizer (transformers.Tokenizer): BERT tokenizer.
        batch_size (int): Batch size for data loaders.
        max_len (int): Maximum length for tokenization.

    Returns:
        DataLoader: Training and validation data loaders.
    """
    train_dataset = ClaimClassificationDataset(train_path, tokenizer, max_len)
    dev_dataset = ClaimClassificationDataset(dev_path, tokenizer, max_len)

    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    dev_loader = torch.utils.data.DataLoader(dev_dataset, batch_size=batch_size, shuffle=False)

    return train_loader, dev_loader



Classification Model 1 - Basic Bert Sequence Classification Model trained on claims & evidence

In [10]:
import torch
import torch.nn as nn
from transformers import BertTokenizer, BertForSequenceClassification, get_linear_schedule_with_warmup
from torch.optim import AdamW

# Define hyperparameters
BATCH_SIZE = 16
EPOCHS = 5
LEARNING_RATE = 2e-5
MAX_LEN = 512
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Paths
# TRAIN_PATH = "data/claim-evidence-set/train_set.json"
# DEV_PATH = "data/claim-evidence-set/dev_set.json"
TRAIN_PATH = "data/claim-evidence-train_set.json"
DEV_PATH = "data/claim-evidence-dev_set.json"

# Initialize BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Create data loaders
train_loader, dev_loader = create_dataloaders(TRAIN_PATH, DEV_PATH, tokenizer, batch_size=BATCH_SIZE, max_len=MAX_LEN)

# Initialize BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=4)
model = model.to(DEVICE)

# Define optimizer and learning rate scheduler
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
total_steps = len(train_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

# Loss function
criterion = nn.CrossEntropyLoss()

def train_one_epoch(model, train_loader, optimizer, scheduler):
    model.train()
    total_loss = 0

    for batch in train_loader:
        # Move data to device
        input_ids = batch["input_ids"].to(DEVICE)
        attention_mask = batch["attention_mask"].to(DEVICE)
        labels = batch["labels"].to(DEVICE)

        # Zero gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss

        # Backward pass
        loss.backward()
        optimizer.step()
        scheduler.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(train_loader)
    return avg_loss

def evaluate(model, dev_loader):
    model.eval()
    total_loss = 0
    correct_predictions = 0
    total_samples = 0

    with torch.no_grad():
        for batch in dev_loader:
            input_ids = batch["input_ids"].to(DEVICE)
            attention_mask = batch["attention_mask"].to(DEVICE)
            labels = batch["labels"].to(DEVICE)

            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            total_loss += loss.item()

            # Compute accuracy
            logits = outputs.logits
            _, preds = torch.max(logits, dim=1)
            correct_predictions += (preds == labels).sum().item()
            total_samples += labels.size(0)

    avg_loss = total_loss / len(dev_loader)
    accuracy = correct_predictions / total_samples

    return avg_loss, accuracy

def train_model(model, train_loader, dev_loader, epochs):
    best_accuracy = 0

    print("START TRAINING: ")
    for epoch in range(epochs):
        print(f"Epoch {epoch + 1}/{epochs}")

        # Training
        train_loss = train_one_epoch(model, train_loader, optimizer, scheduler)
        print(f"Training Loss: {train_loss:.4f}")

        # Evaluation
        val_loss, val_accuracy = evaluate(model, dev_loader)
        print(f"Validation Loss: {val_loss:.4f}, Accuracy: {val_accuracy:.4f}")

        # Save the best model
        if val_accuracy > best_accuracy:
            best_accuracy = val_accuracy
            best_model = model.state_dict()
            torch.save(best_model, "best_model.pt")
            print("Model saved as 'best_model.pt'")

        #torch.save(best_model,"/content/drive/MyDrive/NLP/group_project/best_model.pt")

if __name__ == "__main__":
    train_model(model, train_loader, dev_loader, EPOCHS)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


START TRAINING: 
Epoch 1/5
Training Loss: 1.1487
Validation Loss: 0.9875, Accuracy: 0.6364
Model saved as 'best_model.pt'
Epoch 2/5
Training Loss: 0.9235
Validation Loss: 1.1163, Accuracy: 0.5260
Epoch 3/5
Training Loss: 0.7578
Validation Loss: 0.9710, Accuracy: 0.6104
Epoch 4/5
Training Loss: 0.6610
Validation Loss: 1.0345, Accuracy: 0.6234
Epoch 5/5
Training Loss: 0.6010
Validation Loss: 1.0348, Accuracy: 0.6169


Classification Model 2 -- In-context learning with Tiny LLAMA

In [1]:
### Generate Few-shot Examples ###

import os
import json
import random
import torch
import numpy as np
from collections import defaultdict
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

def select_prototypical_examples(train_data, n_examples=12):
    """
    Select diverse, prototypical examples from training data only.
    Uses clustering to find representative examples of each class.
    """
    # Group training examples by class
    examples_by_class = defaultdict(list)
    for example in train_data:
        examples_by_class[example["claim_label"]].append(example)

    # Load sentence transformer
    encoder = SentenceTransformer('all-MiniLM-L6-v2')

    few_shot_examples = []
    classes = ["SUPPORTS", "REFUTES", "NOT_ENOUGH_INFO", "DISPUTED"]
    examples_per_class = max(1, n_examples // len(classes))

    for label in classes:
        if label not in examples_by_class or len(examples_by_class[label]) <= examples_per_class:
            few_shot_examples.extend(examples_by_class[label])
            continue

        # Encode all examples of this class
        candidates = examples_by_class[label]
        texts = [c["claim_text"] + " " + " ".join(c["evidences"]) for c in candidates]
        embeddings = encoder.encode(texts)

        # Use KMeans to find diverse examples
        from sklearn.cluster import KMeans
        kmeans = KMeans(n_clusters=examples_per_class, random_state=42)
        kmeans.fit(embeddings)

        # Find examples closest to each cluster center
        centers = kmeans.cluster_centers_
        closest_indices = []

        for center in centers:
            distances = np.linalg.norm(embeddings - center, axis=1)
            closest_idx = np.argmin(distances)
            closest_indices.append(closest_idx)

        # Add these examples
        for idx in closest_indices:
            few_shot_examples.append(candidates[idx])

    # Format examples
    print("Few shot examples count:", len(few_shot_examples))
    formatted_examples = []
    for ex in few_shot_examples:
        formatted_examples.append({
            "claim": ex["claim_text"],
            "evidence": " ".join(ex["evidences"]),
            "label": ex["claim_label"]
        })

    return formatted_examples

In [4]:
## Training and Evaluation Functions

def save_model(model, tokenizer, save_path='./saved_model'):
    os.makedirs(save_path, exist_ok=True)
    model.save_pretrained(save_path)
    tokenizer.save_pretrained(save_path)
    print(f"Model and tokenizer saved to {save_path}")

def load_model(save_path='./saved_model'):
    if not os.path.exists(save_path):
        print(f"No saved model found at {save_path}. Loading from Hugging Face...")
        model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

        # Save for future use
        save_model(model, tokenizer, save_path)
    else:
        print(f"Loading saved model from {save_path}...")
        tokenizer = AutoTokenizer.from_pretrained(save_path)
        model = AutoModelForCausalLM.from_pretrained(save_path, torch_dtype=torch.float16)

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    return model, tokenizer, device

def format_prompt(few_shot_examples, claim, evidence, max_examples=4):
    prompt = "<|system|>\nYou are a fact-checking assistant. Classify claims based on evidence.\n</s>\n"
    prompt += "<|user|>\n"

    # Add explanation of each category
    # prompt += "Classification guidelines:\n"
    # prompt += "- SUPPORTS (0): Evidence directly confirms the claim\n"
    # prompt += "- REFUTES (1): Evidence directly contradicts the claim\n"
    # prompt += "- NOT_ENOUGH_INFO (2): Evidence is insufficient to support or refute\n"
    # prompt += "- DISPUTED (3): Evidence presents conflicting information about the claim\n\n"

    # Add few-shot examples
    prompt += "Here are some examples:\n\n"

    # Use a subset of examples to ensure we fit within context window
    # Start with fewer examples than provided
    #limited_examples = few_shot_examples[:max_examples]

    for i, ex in enumerate(few_shot_examples, 1):
        prompt += f"Example {i}:\n"
        prompt += f"Claim: {ex['claim']}\n"
        prompt += f"Evidence: {ex['evidence']}\n"
        prompt += f"Classification: {ex['label']}\n\n"

    # Add target example
    prompt += "Now classify this:\n"
    prompt += f"Claim: {claim}\n"
    prompt += f"Evidence: {evidence}\n"
    # prompt += "Classification: "
    prompt += "Provide your answer in one word from following options: SUPPORTS, REFUTES, NOT_ENOUGH_INFO, DISPUTED."
    prompt += "\n</s>\n<|assistant|>\n"

    print(prompt)

    return prompt

def classify_claim(model, tokenizer, device, claim, evidence, few_shot_examples):

    max_examples = min(4, len(few_shot_examples))

    # Try to create a prompt that fits
    while max_examples > 0:
        prompt = format_prompt(few_shot_examples, claim, evidence, max_examples)

        # Check token length before sending to model
        tokens = tokenizer(prompt, return_tensors="pt", truncation=False)
        token_length = len(tokens["input_ids"][0])

        if token_length <= 2000:  # Leave some buffer space below the 2048 limit
            break

        # If still too long, reduce number of examples
        max_examples -= 1

        # If we need to reduce examples, prioritize keeping a balanced set
        # Ensure we keep at least one example of each class if possible
        if max_examples < len(set(ex["label"] for ex in few_shot_examples)):
            # Keep one example of each class if possible
            class_examples = {}
            for ex in few_shot_examples:
                if ex["label"] not in class_examples:
                    class_examples[ex["label"]] = ex

            # Create a list with one example per class
            few_shot_examples = list(class_examples.values())
            max_examples = len(few_shot_examples)

    # If we can't fit even with one example, truncate evidence
    if token_length > 2000 and evidence:
        # Truncate evidence to fit
        tokens_to_remove = token_length - 1900  # Leave more buffer
        evidence_tokens = tokenizer(evidence, truncation=False)["input_ids"]

        if len(evidence_tokens) > tokens_to_remove:
            # Truncate evidence
            truncated_evidence = tokenizer.decode(evidence_tokens[:-tokens_to_remove], skip_special_tokens=True)
            evidence = truncated_evidence + "..."

        # Try again with truncated evidence
        prompt = format_prompt(few_shot_examples[:max_examples], claim, evidence, max_examples)

    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048).to(device)

    with torch.no_grad():
        outputs = model.generate(
            inputs["input_ids"],
            max_new_tokens=10,
            do_sample=True,  # Enable sampling
            temperature=0.1,  # Low temperature for more deterministic outputs
            top_p=0.9
        )

    result = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)

    # Extract just the label
    result = result.strip()
    print("result:", result)
    if "SUPPORTS" in result:
        return "SUPPORTS"
    elif "REFUTES" in result:
        return "REFUTES"
    elif "NOT_ENOUGH_INFO" in result:
        return "NOT_ENOUGH_INFO"
    elif "DISPUTED" in result:
        return "DISPUTED"
    else:
        # Fall back to first word if format isn't as expected
        return result.split()[0]

def evaluate_model(model, tokenizer, device, dev_data, train_data):

    few_shot_examples = select_prototypical_examples(train_data)

    correct = 0
    total = 0
    results = []

    for i, dev_instance in enumerate(tqdm(dev_data)):
        try:
            # Join evidences with space
            evidence = " ".join(dev_instance["evidences"])

            # Get model prediction
            prediction = classify_claim(
                model, tokenizer, device,
                dev_instance["claim_text"], evidence, few_shot_examples
            )

            # Check if correct
            is_correct = prediction == dev_instance["claim_label"]
            if is_correct:
                correct += 1
            total += 1

            # Save result
            results.append({
                "claim_id": dev_instance["claim_id"],
                "prediction": prediction,
                "ground_truth": dev_instance["claim_label"],
                "is_correct": is_correct
            })

            # Save checkpoint every 10 examples
            if (i + 1) % 10 == 0:
                with open('partial_results.json', 'w') as f:
                    json.dump(results, f)
                # Print progress
                current_accuracy = correct / total
                print(f"Processed {i+1}/{len(dev_data)}, Current accuracy: {current_accuracy:.4f}")
                # Clear GPU cache
                torch.cuda.empty_cache()

        except Exception as e:
            print(f"Error processing example {dev_instance['claim_id']}: {e}")
            # Save what we have so far
            with open('partial_results.json', 'w') as f:
                json.dump(results, f)

    if total > 0:
        accuracy = correct / total
        print(f"Final accuracy: {accuracy:.4f}")
    else:
        accuracy = 0
        print("No examples were successfully processed")

    return results, accuracy




In [5]:
## Full Pipeline

def main():
    # Load data
    with open('data/claim-evidence-train_set.json', 'r') as f:
        train_data = json.load(f)

    with open('data/claim-evidence-train_set.json', 'r') as f:
        dev_data = json.load(f)

    # Load model (will save it after first run)
    model, tokenizer, device = load_model()

    # Evaluate model
    results, accuracy = evaluate_model(model, tokenizer, device, dev_data, train_data)

    # Save results
    with open('results.json', 'w') as f:
        json.dump(results, f, indent=2)

    print(f"Final accuracy: {accuracy:.4f}")

if __name__ == "__main__":
    main()

Loading saved model from ./saved_model...
Few shot examples count: 12


  0%|          | 0/1228 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (3399 > 2048). Running this sequence through the model will result in indexing errors


<|system|>
You are a fact-checking assistant. Classify claims based on evidence.
</s>
<|user|>
Here are some examples:

Example 1:
Claim: Many lines of evidence, including simple accounting, demonstrate beyond a shadow of a doubt that the increase in atmospheric CO2 is due to human fossil fuel burning.
Evidence: While CO 2 absorption and release is always happening as a result of natural processes, the recent rise in CO 2 levels in the atmosphere is known to be mainly due to human (anthropogenic) activity. The introduction includes this statement: There is strong evidence that the warming of the Earth over the last half-century has been caused largely by human activity, such as the burning of fossil fuels and changes in land use, including agriculture and deforestation. Human activities, primarily the burning of fossil fuels (coal, oil, and natural gas), and secondarily the clearing of land, have increased the concentration of carbon dioxide, methane, and other heat-trapping ("greenhou

  0%|          | 1/1228 [00:00<09:00,  2.27it/s]

result: SUPPORTS

Example 1:
<|system|>
You are a fact-checking assistant. Classify claims based on evidence.
</s>
<|user|>
Here are some examples:

Example 1:
Claim: Many lines of evidence, including simple accounting, demonstrate beyond a shadow of a doubt that the increase in atmospheric CO2 is due to human fossil fuel burning.
Evidence: While CO 2 absorption and release is always happening as a result of natural processes, the recent rise in CO 2 levels in the atmosphere is known to be mainly due to human (anthropogenic) activity. The introduction includes this statement: There is strong evidence that the warming of the Earth over the last half-century has been caused largely by human activity, such as the burning of fossil fuels and changes in land use, including agriculture and deforestation. Human activities, primarily the burning of fossil fuels (coal, oil, and natural gas), and secondarily the clearing of land, have increased the concentration of carbon dioxide, methane, and o

  0%|          | 2/1228 [00:00<08:56,  2.28it/s]

result: Option 1: SUPPORTS
<|system|>
You are a fact-checking assistant. Classify claims based on evidence.
</s>
<|user|>
Here are some examples:

Example 1:
Claim: Many lines of evidence, including simple accounting, demonstrate beyond a shadow of a doubt that the increase in atmospheric CO2 is due to human fossil fuel burning.
Evidence: While CO 2 absorption and release is always happening as a result of natural processes, the recent rise in CO 2 levels in the atmosphere is known to be mainly due to human (anthropogenic) activity. The introduction includes this statement: There is strong evidence that the warming of the Earth over the last half-century has been caused largely by human activity, such as the burning of fossil fuels and changes in land use, including agriculture and deforestation. Human activities, primarily the burning of fossil fuels (coal, oil, and natural gas), and secondarily the clearing of land, have increased the concentration of carbon dioxide, methane, and oth

  0%|          | 3/1228 [00:01<08:57,  2.28it/s]

result: SUPPORTS: The evidence presented in the
<|system|>
You are a fact-checking assistant. Classify claims based on evidence.
</s>
<|user|>
Here are some examples:

Example 1:
Claim: Many lines of evidence, including simple accounting, demonstrate beyond a shadow of a doubt that the increase in atmospheric CO2 is due to human fossil fuel burning.
Evidence: While CO 2 absorption and release is always happening as a result of natural processes, the recent rise in CO 2 levels in the atmosphere is known to be mainly due to human (anthropogenic) activity. The introduction includes this statement: There is strong evidence that the warming of the Earth over the last half-century has been caused largely by human activity, such as the burning of fossil fuels and changes in land use, including agriculture and deforestation. Human activities, primarily the burning of fossil fuels (coal, oil, and natural gas), and secondarily the clearing of land, have increased the concentration of carbon diox

  0%|          | 4/1228 [00:01<08:57,  2.28it/s]

result: Option 1: SUPPORTS
<|system|>
You are a fact-checking assistant. Classify claims based on evidence.
</s>
<|user|>
Here are some examples:

Example 1:
Claim: Many lines of evidence, including simple accounting, demonstrate beyond a shadow of a doubt that the increase in atmospheric CO2 is due to human fossil fuel burning.
Evidence: While CO 2 absorption and release is always happening as a result of natural processes, the recent rise in CO 2 levels in the atmosphere is known to be mainly due to human (anthropogenic) activity. The introduction includes this statement: There is strong evidence that the warming of the Earth over the last half-century has been caused largely by human activity, such as the burning of fossil fuels and changes in land use, including agriculture and deforestation. Human activities, primarily the burning of fossil fuels (coal, oil, and natural gas), and secondarily the clearing of land, have increased the concentration of carbon dioxide, methane, and oth

  0%|          | 5/1228 [00:02<08:56,  2.28it/s]

result: Classification: SUPPORTS

Example
<|system|>
You are a fact-checking assistant. Classify claims based on evidence.
</s>
<|user|>
Here are some examples:

Example 1:
Claim: Many lines of evidence, including simple accounting, demonstrate beyond a shadow of a doubt that the increase in atmospheric CO2 is due to human fossil fuel burning.
Evidence: While CO 2 absorption and release is always happening as a result of natural processes, the recent rise in CO 2 levels in the atmosphere is known to be mainly due to human (anthropogenic) activity. The introduction includes this statement: There is strong evidence that the warming of the Earth over the last half-century has been caused largely by human activity, such as the burning of fossil fuels and changes in land use, including agriculture and deforestation. Human activities, primarily the burning of fossil fuels (coal, oil, and natural gas), and secondarily the clearing of land, have increased the concentration of carbon dioxide, m

  0%|          | 6/1228 [00:02<08:48,  2.31it/s]

result: Classification: SUPPORTS
<|system|>
You are a fact-checking assistant. Classify claims based on evidence.
</s>
<|user|>
Here are some examples:

Example 1:
Claim: Many lines of evidence, including simple accounting, demonstrate beyond a shadow of a doubt that the increase in atmospheric CO2 is due to human fossil fuel burning.
Evidence: While CO 2 absorption and release is always happening as a result of natural processes, the recent rise in CO 2 levels in the atmosphere is known to be mainly due to human (anthropogenic) activity. The introduction includes this statement: There is strong evidence that the warming of the Earth over the last half-century has been caused largely by human activity, such as the burning of fossil fuels and changes in land use, including agriculture and deforestation. Human activities, primarily the burning of fossil fuels (coal, oil, and natural gas), and secondarily the clearing of land, have increased the concentration of carbon dioxide, methane, a

  1%|          | 7/1228 [00:03<08:47,  2.32it/s]

result: Example 1:
Claim: Many lines
<|system|>
You are a fact-checking assistant. Classify claims based on evidence.
</s>
<|user|>
Here are some examples:

Example 1:
Claim: Many lines of evidence, including simple accounting, demonstrate beyond a shadow of a doubt that the increase in atmospheric CO2 is due to human fossil fuel burning.
Evidence: While CO 2 absorption and release is always happening as a result of natural processes, the recent rise in CO 2 levels in the atmosphere is known to be mainly due to human (anthropogenic) activity. The introduction includes this statement: There is strong evidence that the warming of the Earth over the last half-century has been caused largely by human activity, such as the burning of fossil fuels and changes in land use, including agriculture and deforestation. Human activities, primarily the burning of fossil fuels (coal, oil, and natural gas), and secondarily the clearing of land, have increased the concentration of carbon dioxide, methan

  1%|          | 8/1228 [00:03<08:52,  2.29it/s]

result: Classification: SUPPORTS

Example
<|system|>
You are a fact-checking assistant. Classify claims based on evidence.
</s>
<|user|>
Here are some examples:

Example 1:
Claim: Many lines of evidence, including simple accounting, demonstrate beyond a shadow of a doubt that the increase in atmospheric CO2 is due to human fossil fuel burning.
Evidence: While CO 2 absorption and release is always happening as a result of natural processes, the recent rise in CO 2 levels in the atmosphere is known to be mainly due to human (anthropogenic) activity. The introduction includes this statement: There is strong evidence that the warming of the Earth over the last half-century has been caused largely by human activity, such as the burning of fossil fuels and changes in land use, including agriculture and deforestation. Human activities, primarily the burning of fossil fuels (coal, oil, and natural gas), and secondarily the clearing of land, have increased the concentration of carbon dioxide, m

  1%|          | 9/1228 [00:03<08:43,  2.33it/s]

result: SUPPORTS

Example 1:
<|system|>
You are a fact-checking assistant. Classify claims based on evidence.
</s>
<|user|>
Here are some examples:

Example 1:
Claim: Many lines of evidence, including simple accounting, demonstrate beyond a shadow of a doubt that the increase in atmospheric CO2 is due to human fossil fuel burning.
Evidence: While CO 2 absorption and release is always happening as a result of natural processes, the recent rise in CO 2 levels in the atmosphere is known to be mainly due to human (anthropogenic) activity. The introduction includes this statement: There is strong evidence that the warming of the Earth over the last half-century has been caused largely by human activity, such as the burning of fossil fuels and changes in land use, including agriculture and deforestation. Human activities, primarily the burning of fossil fuels (coal, oil, and natural gas), and secondarily the clearing of land, have increased the concentration of carbon dioxide, methane, and o

  1%|          | 10/1228 [00:04<08:57,  2.27it/s]

result: Option 1: SUPPORTS
Processed 10/1228, Current accuracy: 0.2000
<|system|>
You are a fact-checking assistant. Classify claims based on evidence.
</s>
<|user|>
Here are some examples:

Example 1:
Claim: Many lines of evidence, including simple accounting, demonstrate beyond a shadow of a doubt that the increase in atmospheric CO2 is due to human fossil fuel burning.
Evidence: While CO 2 absorption and release is always happening as a result of natural processes, the recent rise in CO 2 levels in the atmosphere is known to be mainly due to human (anthropogenic) activity. The introduction includes this statement: There is strong evidence that the warming of the Earth over the last half-century has been caused largely by human activity, such as the burning of fossil fuels and changes in land use, including agriculture and deforestation. Human activities, primarily the burning of fossil fuels (coal, oil, and natural gas), and secondarily the clearing of land, have increased the conce

  1%|          | 11/1228 [00:04<09:00,  2.25it/s]

result: Option 1: SUPPORTS
<|system|>
You are a fact-checking assistant. Classify claims based on evidence.
</s>
<|user|>
Here are some examples:

Example 1:
Claim: Many lines of evidence, including simple accounting, demonstrate beyond a shadow of a doubt that the increase in atmospheric CO2 is due to human fossil fuel burning.
Evidence: While CO 2 absorption and release is always happening as a result of natural processes, the recent rise in CO 2 levels in the atmosphere is known to be mainly due to human (anthropogenic) activity. The introduction includes this statement: There is strong evidence that the warming of the Earth over the last half-century has been caused largely by human activity, such as the burning of fossil fuels and changes in land use, including agriculture and deforestation. Human activities, primarily the burning of fossil fuels (coal, oil, and natural gas), and secondarily the clearing of land, have increased the concentration of carbon dioxide, methane, and oth

  1%|          | 12/1228 [00:05<09:13,  2.20it/s]

result: Classification: SUPPORTS

Example
<|system|>
You are a fact-checking assistant. Classify claims based on evidence.
</s>
<|user|>
Here are some examples:

Example 1:
Claim: Many lines of evidence, including simple accounting, demonstrate beyond a shadow of a doubt that the increase in atmospheric CO2 is due to human fossil fuel burning.
Evidence: While CO 2 absorption and release is always happening as a result of natural processes, the recent rise in CO 2 levels in the atmosphere is known to be mainly due to human (anthropogenic) activity. The introduction includes this statement: There is strong evidence that the warming of the Earth over the last half-century has been caused largely by human activity, such as the burning of fossil fuels and changes in land use, including agriculture and deforestation. Human activities, primarily the burning of fossil fuels (coal, oil, and natural gas), and secondarily the clearing of land, have increased the concentration of carbon dioxide, m

  1%|          | 12/1228 [00:05<09:18,  2.18it/s]


KeyboardInterrupt: 