## Evaluation of the fine-tuned models

1. create a dataset of 100 claims and summaries, excluding the onces used for training
2. run the prompt with both models, and store the results and latency in a dataset
3. compare the output and calculate the average latency

In [62]:
import json

# retrieve all the claims in the jsonl file used for training and store in a list
excluded_claims = set()

with open("Data/socratic_questions_GPTOSS3000.jsonl", "r", encoding="utf-8") as f:
    for line in f:
        obj = json.loads(line)
        excluded_claims.add(obj["claim"])

excluded_claims = list(excluded_claims)

Create a filtered set without duplicates and without the training data

In [64]:
import pandas as pd

# Load the data
factors_df = pd.read_csv("Data/FACTors.csv")

# Exclude the claims used for training
filtered_df = factors_df[~factors_df['claim'].isin(excluded_claims)]

# Identify article_ids that occur only once (after exclusion)
article_counts = filtered_df['article_id'].value_counts()
duplicate_article_ids = article_counts[article_counts > 1]
unique_article_ids = article_counts[article_counts == 1].index

# Keep only unique article_ids
clean_factors_df = filtered_df[filtered_df['article_id'].isin(unique_article_ids)]

# Take a subset of the largest fact checking organisations
factors_sub_df=clean_factors_df[clean_factors_df["organisation"].isin(["PolitiFact", "AFP Fact Check", "Snopes", "WebQoof", "FactCheck.org"])]
factors_sample_df= factors_sub_df.sample(n=500, random_state=12)

# Confirm removal
print(f"Original rows: {len(factors_df)}")
print(f"Rows after excluding specific claims: {len(filtered_df)}")
print(f"Rows after removing duplicates: {len(clean_factors_df)}")
print(factors_sample_df[["claim","date_published"]])

Original rows: 118112
Rows after excluding specific claims: 115112
Rows after removing duplicates: 114981
                                                    claim       date_published
90392   In January 2021, House Speaker Nancy Pelosi an...  2021-01-05T13:16:13
76334   "Pipe bomber suspect pictured last year with I...  2018-11-01T00:00:00
61393   This video shows a cartel member in Mexico car...  2023-06-15T00:00:00
2794    Family received $1,400 per week in government ...  2019-05-13T11:02:00
99252   The Obama administration has ordered $1 billio...  2014-01-25T12:00:00
...                                                   ...                  ...
38213   Florida Gov. Charlie Crist's position on the n...  2010-07-29T14:14:27
89479   U.S. President Joe Biden's son Hunter is guest...  2021-04-28T11:31:00
88197   Denzel Washington said: "You'll never be criti...  2022-04-01T11:00:30
91521   A bottle of hand sanitizer will spontaneously ...  2020-05-22T16:46:29
101692                   

Now we will process the data in 3 steps:

1. generate a summary and alerts from the claim, using GPT OSS 120B
2. generate a critical question using the finetuned Llama and retrieve latency
2. generate a critical question using the finetuned Mistral and retrieve latency

In [65]:
from langchain_core.messages import HumanMessage
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from langchain_ollama import ChatOllama
from langchain_groq import ChatGroq
from typing_extensions import List

load_dotenv(dotenv_path=".env", override=True)

class MoreInfoResult(BaseModel):
    alerts: List[str] = Field([], description="Any alerts or warnings about the claim")
    summary: str = Field("", description="A concise summary of the claim")

# lower temperature for more factual answers, 
llmGPTOSS = ChatGroq(model_name="openai/gpt-oss-120b", model_kwargs={"tool_choice": "none"}, temperature=0.1)

# higher temperature for more creativity in questions
llmMistral = ChatOllama(model="mistral7b-q4km:latest", temperature=0.5, base_url="http://localhost:11434")
llmLlama = ChatOllama(model="llama3_1-8b-q4km:latest", temperature=0.5, base_url="http://localhost:11434")

In [66]:
get_information_prompt = """
### Role
You are a neutral, guiding assistant that helps students through the fact-checking process step by step. 
In this step your are tasked with extracting detailed information about a claim to determine its checkability.

### Claim
{claim}

The claim has already been fact-checked and the outcome was published on this date:
### Date published
{date_published}

### Steps
1. Identify the subject.
2. Determine if the claim is *quantitative*. 
3. Assess precision: "precise", "vague", or "absolute (100%)". 
4. Identify what the claim is *based on* (e.g., "survey â€¦", "official statistics"). 
5. Identify the geography and time period mentioned in the claim, if provided. You may assume that the date_published occurs shortly after the claim was made.
6. Identify *alerts/warnings*: unclear subject, qualitative claim, vague quantitative claim, geography missing, time period missing, methodological details absent. 
Don't mention an alert when the information is present.
7. Summarize concisely* what is currently known about the claim.
   - Include: the information found in the first 5 steps such as subject, type (quantitative/qualitative), precision, basis, and uncertainties.
   - Mention any active alerts or missing information.

Keep your tone neutral and analytical.

### Output Format
Return a single JSON object with exactly these fields:

- "alerts": array of strings. Each alert as a short string; use [] if none.
- "summary": string. A concise summary of the claim and its checkability status.

The response must be valid JSON and contain **only** this JSON object, with no extra text before or after it.

### Examples
Example A (qualitative):
{{
  "alerts": ["qualitative claim", "methodological details absent", "geography present", "time period present"],
  "summary": "A qualitative claim about a specific legal event; methodology implied but not fully detailed."
}}

Example B (quantitative but vague):
{{
  "alerts": ["vague quantitative claim", "time period missing", "source/methodology missing", "geography: EU (present)"],
  "summary": "A quantitative claim lacking precision and methodological details; several key elements are missing for checkability."
}}
"""

get_socratic_question = """
### Role
Pedagogical Facilitator and Socratic Coach.

### Objective
Your goal is to be a "thought partner." Instead of pointing out errors, you ask one question that leads the student to discover gaps in the claim's logic or evidence themselves.

### Inputs
- {claim}
- {summary}

- Gaps in claim:
{alerts}

### Output rules (IMPORTANT)
- Output exactly ONE Socratic question.
- Output ONLY the question text.
- Do NOT include explanations, prefixes, labels, or markdown.
- The output must be a single string ending with a question mark (?).

### Question:
"""

In [68]:
import time

def retrieve_info(claim: str, date_published: str) -> dict:

    """Generate a summary and alerts from a claim """

    # Use structured output
    structured_llm = llmGPTOSS.with_structured_output(MoreInfoResult,method="json_mode")

    # Create a prompt
    prompt = get_information_prompt.format(
        claim=claim,
        date_published=date_published
    )

    #invoke the LLM and store the output
    result = structured_llm.invoke(prompt)

    # return a Python dict instead of a Pydantic model
    return result.model_dump()


def critical_question(llm, claim: str, summary: str, alerts: List):

    """ Ask a socratic question to make the user think about the consequences of a fact checking a claim """

    # retrieve alerts and format to string for the prompt
    alerts_str= "\n".join(f"- {a}" for a in alerts)

    # Create a prompt
    prompt  =  get_socratic_question.format(
        alerts=alerts_str,
        claim=claim,
        summary=summary,
    )

    #invoke the LLM and return the question + calculate latency
    t0 = time.perf_counter()
    result = llm.invoke([HumanMessage(content=prompt)])
    latency = (time.perf_counter() - t0) * 1000

    return (result.content, latency)


def retrieve_info_and_question(claim: str, date_published: str):
    """Function to run all the steps"""

    # run the first LLM to create a summary and alerts
    info = retrieve_info(claim, date_published)

    # retrieve fields
    summary = info.get("summary", "")
    alerts = info.get("alerts", [])

    # Create questions with both models + latency
    q_llama, l_llama = critical_question(llmLlama, claim, summary, alerts)
    q_mistral, l_mistral = critical_question(llmMistral, claim, summary, alerts)

    return (q_llama, l_llama,q_mistral, l_mistral)


In [None]:
# Run loop and retrieve questions and latencies
results = []
claims = []

for _, row in factors_sample_df.iterrows():
    claims.append(row["claim"])
    results.append(retrieve_info_and_question(row["claim"], row["date_published"]))

# add them to a dataset
results_df = pd.DataFrame(
    results,
    columns=["q_llama", "l_llama", "q_mistral", "l_mistral"]
)

results_df.insert(0, "claim", claims)

results_df.to_excel("Data/eval_question_latency.xlsx", index=False)

## We run some evaluations

1. Trustworthiness: do they always generate a question
2. Latency: generating speed in ms.
3. Pattern diversity: how many diverse patterns are in the questions set.
4. Semantic diversity: How do the questions differ in semantics.

In [None]:
# trustworthiness
empty_llama_count = (results_df["q_llama"].astype(str).str.strip() == "").sum()
empty_mistral_count = (results_df["q_mistral"].astype(str).str.strip() == "").sum()

print(f"For llama there are {empty_llama_count} empty rows")
print(f"For mistral there are {empty_mistral_count} empty rows")

For llama there are 49 empty rows
For mistral there are 0 empty rows


In [87]:
# Calculate average latency
llama_latency=results_df["l_llama"].mean()
mistral_latency=results_df["l_mistral"].mean()

print(f"Latency for llama: {llama_latency:.0f} ms")
print(f"Latency for mistral: {mistral_latency:.0f} ms")


Latency for llama: 440 ms
Latency for mistral: 412 ms


In [89]:
import numpy as np

# number of different patterns, and diversity in patterns (entropy)
def pattern_diversity(series):
    counts = series.value_counts()
    probs = counts / counts.sum()
    entropy = -(probs * np.log2(probs)).sum()
    return {
        "unique_patterns": len(counts),
        "entropy": entropy,
    }


div_llama = pattern_diversity(results_df["q_llama"])
div_mistral = pattern_diversity(results_df["q_mistral"])

print(
    f"For LLaMA there are {div_llama['unique_patterns']} unique patterns "
    f"and the entropy is {div_llama['entropy']:.2f}"
)

print(
    f"For Mistral there are {div_mistral['unique_patterns']} unique patterns "
    f"and the entropy is {div_mistral['entropy']:.2f}"
)

For LLaMA there are 452 unique patterns and the entropy is 8.42
For Mistral there are 499 unique patterns and the entropy is 8.96


In [90]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_distances
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

# calculate semantic
def embedding_diversity(texts):
    texts = texts.dropna()
    emb = model.encode(texts, normalize_embeddings=True)
    dists = cosine_distances(emb)
    return dists[np.triu_indices_from(dists, k=1)].mean()

div_llama = embedding_diversity(results_df["q_llama"])
div_mistral = embedding_diversity(results_df["q_mistral"])

print(f"Semantic diversity (LLaMA): {div_llama:.2f}")
print(f"Semantic diversity (Mistral): {div_mistral:.2f}")

Semantic diversity (LLaMA): 0.81
Semantic diversity (Mistral): 0.74
