# Task 4: Model Evaluation

Evaluate the predictions from both fine-tuned models using the ConflictQA test set.

This notebook computes Exact Match (EM) and F1 scores for:
- Overall performance (Total)
- Conflicting questions (C)
- Non-conflicting questions (NC)

The evaluation logic follows the original evaluate_conflictqa.py script.

---

## Setup Instructions

1. Enable GPU runtime (optional for evaluation)
2. Upload required files to Google Drive:
   - ConflictQA_Dataset.json (ground truth)
   - predictions_A.json (Model A predictions)
   - predictions_B.json (Model B predictions)
3. Execute cells in order

Expected runtime: Less than 5 minutes

## Step 1: Environment Detection & Path Setup

In [5]:
import os
import sys

try:
    from google.colab import drive
    IN_COLAB = True
    print("Environment: Google Colab")
except ImportError:
    IN_COLAB = False
    print("Environment: Local (VS Code)")

if IN_COLAB:
    drive.mount('/content/drive')
    BASE_PATH = "/content/drive/MyDrive/reproducing_project"
    print(f"Drive mounted successfully")
else:
    BASE_PATH = os.path.abspath("..")
    print(f"Using local repository path")

print(f"Base path: {BASE_PATH}")

Environment: Local (VS Code)
Using local repository path
Base path: /Users/francescodangolo/Desktop/CS 421 - Natural Language Processing/Research Project/qa-with-conflicting-context


## Step 2: Install Dependencies & Import Libraries

In [6]:
!pip install -q numpy pandas

import numpy as np
import json
import pandas as pd
import string
import re

print("Dependencies installed successfully")

Dependencies installed successfully


## Step 3: Define Evaluation Functions

These functions are adapted from the original evaluate_conflictqa.py script.

In [7]:
def normalize_text(s):
    """Normalize text for comparison: lowercase, remove punctuation and articles."""
    def remove_articles(text):
        return re.sub(r'\b(a|an|the)\b', ' ', text)
    
    def white_space_fix(text):
        return ' '.join(text.split())
    
    def remove_punc(text):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in text if ch not in exclude)
    
    def lower(text):
        return text.lower()
    
    return white_space_fix(remove_articles(remove_punc(lower(s))))

def compute_exact_match(prediction, ground_truth):
    """Check if normalized prediction exactly matches normalized ground truth."""
    return int(normalize_text(prediction) == normalize_text(ground_truth))

def compute_f1(prediction, ground_truth):
    """Compute F1 score between prediction and ground truth tokens."""
    pred_tokens = normalize_text(prediction).split()
    truth_tokens = normalize_text(ground_truth).split()
    
    if len(pred_tokens) == 0 or len(truth_tokens) == 0:
        return int(pred_tokens == truth_tokens)
    
    common_tokens = set(pred_tokens) & set(truth_tokens)
    num_common = len(common_tokens)
    
    if num_common == 0:
        return 0
    
    precision = num_common / len(pred_tokens)
    recall = num_common / len(truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    
    return f1

print("Helper functions defined")

Helper functions defined


## Step 4: Load Ground Truth and Predictions

In [8]:
dataset_path = f"{BASE_PATH}/data/ConflictQA_Dataset.json"
predictions_a_path = f"{BASE_PATH}/predictions_A.json"
predictions_b_path = f"{BASE_PATH}/predictions_B.json"

try:
    full_df = pd.read_json(dataset_path)
    ground_truth_df = full_df[full_df['split'] == 'test'].reset_index(drop=True)
    print(f"Loaded {len(ground_truth_df)} test examples")
except FileNotFoundError:
    print(f"ERROR: Dataset not found at {dataset_path}")

try:
    with open(predictions_a_path, 'r') as f:
        predictions_A = json.load(f)
    print(f"Loaded {len(predictions_A)} predictions for Model A")
except FileNotFoundError:
    print(f"ERROR: predictions_A.json not found at {predictions_a_path}")

try:
    with open(predictions_b_path, 'r') as f:
        predictions_B = json.load(f)
    print(f"Loaded {len(predictions_B)} predictions for Model B")
except FileNotFoundError:
    print(f"ERROR: predictions_B.json not found at {predictions_b_path}")

assert len(ground_truth_df) == len(predictions_A) == len(predictions_B), "Data length mismatch"
print(f"\nValidation passed: All datasets have {len(ground_truth_df)} examples")

Loaded 813 test examples
Loaded 813 predictions for Model A
Loaded 813 predictions for Model B

Validation passed: All datasets have 813 examples


## Step 5: Evaluate Both Models

In [9]:
def evaluate_model(predictions, ground_truth_df):
    """
    Evaluate predictions against ground truth using EM and F1 metrics.
    Returns scores for Total, Conflict, and Non-Conflict subsets.
    """
    em_total = []
    f1_total = []
    em_c = []
    em_nc = []
    f1_c = []
    f1_nc = []

    for i in range(len(predictions)):
        pred = predictions[i]
        ground_truth_list = ground_truth_df.iloc[i]["ambigqa_answer"]
        is_conflict = ground_truth_df.iloc[i]["secondAnswerExist"] == "A"

        max_em = max(compute_exact_match(pred, str(ref)) for ref in ground_truth_list)
        max_f1 = max(compute_f1(pred, str(ref)) for ref in ground_truth_list)

        em_total.append(max_em)
        f1_total.append(max_f1)
        
        if is_conflict:
            em_c.append(max_em)
            f1_c.append(max_f1)
        else:
            em_nc.append(max_em)
            f1_nc.append(max_f1)

    return {
        "EM-T": np.round(np.mean(em_total) * 100, 2),
        "F1-T": np.round(np.mean(f1_total) * 100, 2),
        "EM-C": np.round(np.mean(em_c) * 100, 2),
        "F1-C": np.round(np.mean(f1_c) * 100, 2),
        "EM-NC": np.round(np.mean(em_nc) * 100, 2),
        "F1-NC": np.round(np.mean(f1_nc) * 100, 2),
    }

print("Evaluating Model A (Context-Only)...")
results_A = evaluate_model(predictions_A, ground_truth_df)
print(f"Model A Results: {results_A}")

print("\nEvaluating Model B (Explain-and-Answer)...")
results_B = evaluate_model(predictions_B, ground_truth_df)
print(f"Model B Results: {results_B}")

Evaluating Model A (Context-Only)...
Model A Results: {'EM-T': np.float64(47.23), 'F1-T': np.float64(58.93), 'EM-C': np.float64(42.03), 'F1-C': np.float64(53.41), 'EM-NC': np.float64(49.01), 'F1-NC': np.float64(60.81)}

Evaluating Model B (Explain-and-Answer)...
Model A Results: {'EM-T': np.float64(47.23), 'F1-T': np.float64(58.93), 'EM-C': np.float64(42.03), 'F1-C': np.float64(53.41), 'EM-NC': np.float64(49.01), 'F1-NC': np.float64(60.81)}

Evaluating Model B (Explain-and-Answer)...
Model B Results: {'EM-T': np.float64(0.62), 'F1-T': np.float64(13.72), 'EM-C': np.float64(0.0), 'F1-C': np.float64(12.19), 'EM-NC': np.float64(0.83), 'F1-NC': np.float64(14.24)}
Model B Results: {'EM-T': np.float64(0.62), 'F1-T': np.float64(13.72), 'EM-C': np.float64(0.0), 'F1-C': np.float64(12.19), 'EM-NC': np.float64(0.83), 'F1-NC': np.float64(14.24)}


## Step 6: Display Results

In [10]:
results_df = pd.DataFrame(
    [results_A, results_B], 
    index=["Model A (Context-Only)", "Model B (Explain-and-Answer)"]
)

print("\n" + "="*50)
print("FINAL EVALUATION RESULTS")
print("="*50 + "\n")
print(results_df.to_string())
print("\n" + "="*50)


FINAL EVALUATION RESULTS

                               EM-T   F1-T   EM-C   F1-C  EM-NC  F1-NC
Model A (Context-Only)        47.23  58.93  42.03  53.41  49.01  60.81
Model B (Explain-and-Answer)   0.62  13.72   0.00  12.19   0.83  14.24



## Step 7: Update Reproducibility Log

In [11]:
from datetime import datetime

log_path = os.path.join(BASE_PATH, "REPRODUCIBILITY_LOG.md")

log_entry = f"""
---
**Evaluation Completed - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}**

### Model A: Context-Only (fdangolo/flan-t5-context-only)
- EM-Total: {results_A['EM-T']}%
- F1-Total: {results_A['F1-T']}%
- EM-Conflict: {results_A['EM-C']}%
- F1-Conflict: {results_A['F1-C']}%
- EM-Non-Conflict: {results_A['EM-NC']}%
- F1-Non-Conflict: {results_A['F1-NC']}%

### Model B: Explain-and-Answer (fdangolo/flan-t5-exp-ans)
- EM-Total: {results_B['EM-T']}%
- F1-Total: {results_B['F1-T']}%
- EM-Conflict: {results_B['EM-C']}%
- F1-Conflict: {results_B['F1-C']}%
- EM-Non-Conflict: {results_B['EM-NC']}%
- F1-Non-Conflict: {results_B['F1-NC']}%

Environment: {'Google Colab' if IN_COLAB else 'Local (VS Code)'}
Evaluated on: {len(ground_truth_df)} test examples
"""

try:
    with open(log_path, 'a') as f:
        f.write(log_entry)
    print(f"Successfully updated {log_path}")
except Exception as e:
    print(f"Error updating log: {e}")

Successfully updated /Users/francescodangolo/Desktop/CS 421 - Natural Language Processing/Research Project/qa-with-conflicting-context/REPRODUCIBILITY_LOG.md


In [14]:
import torch
import gc
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import PeftModel
from datasets import load_dataset

# --- Configuration ---
base_model_id = "google/flan-t5-base" #
adapter_B_id = "fdangolo/flan-t5-exp-ans" #
device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"--- STARTING DIAGNOSTIC ---")
print(f"Loading base model and adapter for Model B...")

# --- Load Model & Tokenizer ---
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(base_model_id).to(device)
model = PeftModel.from_pretrained(model, adapter_B_id).to(device)
model.eval()

# --- Load Test Data ---
# We need the 'input' and the 'output' (ground truth)
data_path = os.path.join(BASE_PATH, "data/splits/test_context_only.jsonl")
ground_truth_dataset = load_dataset('json', data_files={'test': data_path})['test']
test_inputs = ground_truth_dataset['input']
test_truths = ground_truth_dataset['output']

print(f"Generating 10 samples...")
print("="*40)

# --- Generate 10 Samples for Manual Inspection ---
with torch.no_grad():
    for i in range(10):
        input_text = test_inputs[i]
        ground_truth = test_truths[i]
        
        # Prepare input
        inputs = tokenizer(input_text, return_tensors="pt", max_length=1024, truncation=True).to(device)
        
        # Generate output (raw text)
        outputs = model.generate(
            **inputs, 
            max_new_tokens=300, 
            early_stopping=True
        )
        raw_prediction_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # Apply our parser
        parsed_answer = raw_prediction_text.split("\n\n")[-1].strip()
        
        # --- Print Comparison ---
        print(f"\n--- SAMPLE {i+1} ---")
        print(f"INPUT:\n{input_text[:300]}...\n")
        print(f"GROUND TRUTH ANSWER:\n{ground_truth}\n")
        print(f"MODEL B RAW OUTPUT:\n{raw_prediction_text}\n")
        print(f"PARSED ANSWER (what we evaluated):\n{parsed_answer}")
        print("="*40)

# Clean up memory
del model
gc.collect()
torch.cuda.empty_cache()

--- STARTING DIAGNOSTIC ---
Loading base model and adapter for Model B...




Generating 10 samples...





--- SAMPLE 1 ---
INPUT:
question: Who created the convention on the rights of the child?

contexts: The Convention on the Rights of the Child was adopted by the General Assembly of the United Nations by its resolution 44/25 of 20 November 1989.
---
Nov 17, 2014 ... The Convention on the Rights of the Child is an internatio...

GROUND TRUTH ANSWER:
most common

MODEL B RAW OUTPUT:
The General Assembly of the United Nations

PARSED ANSWER (what we evaluated):
The General Assembly of the United Nations

--- SAMPLE 2 ---
INPUT:
question: Who played zoe hart on hart of dixie?

contexts: Jan 18, 2019 ... Dr. Zoe Hart-Kinsella is the protagonist and titular character. She is a girl with a plan. That plan is focused around her dedication to ...
---
Aug 25, 2022 ... Rachel Bilson (Zoe Hart) · Jaime King (Lemon Breeland) · Cress...

GROUND TRUTH ANSWER:
Rachel Bilson

MODEL B RAW OUTPUT:
Only few context contain the correct answers. Others are not Rachel Bilson

PARSED ANSWER (what we evaluated