## Setting up a manually validated reference dataset

This notebook outlines the methodology for constructing a manually validated reference dataset. A semi-automated annotation pipeline was emplyed where initial data processing is handled programmatically and manually

First we will start with some preprocessing.

In [None]:
import pandas as pd

eufactcheck=pd.read_csv('../EUfactcheckData/eufactcheck_posts_2019_2025.csv', encoding="utf-8")

#get rid of leading and trailing quotes in titles
eufactcheck['title']= eufactcheck['title'].str.replace(r'^“+|”+$', '', regex=True).str.strip()

#filter only factcheck URLs
eufactcheck = eufactcheck[eufactcheck["url"].str.contains("https://eufactcheck.eu/factcheck", na=False)]

eufactcheck.to_csv('eufactcheck_factchecks_2019_2025.csv', index=False, encoding="utf-8")

In [1]:
import os
from dotenv import load_dotenv
from langchain_groq import ChatGroq
#from langchain_ollama import ChatOllama
from tavily import TavilyClient

# Load alle the API keys
load_dotenv(dotenv_path="../.env", override=True)

# Initialize Tavily client 
tavily_client = TavilyClient(api_key=os.getenv("TAVILY_API_KEY", ""))


#llama 3.3 70b is good a multilingual tasks
llm_llama = ChatGroq(model_name="llama-3.3-70b-versatile", temperature=0.1)

#GPT OSS 120B is good for reasoning tasks
llmGPTOSS = ChatGroq(model_name="openai/gpt-oss-120b", model_kwargs={"tool_choice": "none"}, temperature=0.1)

  from .autonotebook import tqdm as notebook_tqdm


A sample of 152 fact-checks from the EUfactcheck platform was used and manually added a source URL indicating where each claim was originally published online, when such information was available in the corresponding EUfactcheck.com article.

This step was essential because incorporating source context, also accessible to the assistant, substantially increases the cognitive load on the LLM, making the presence or absence of a source link a critical variable in our evaluation. A manual audit of the 152 most recent fact-check articles revealed considerable heterogeneity in source availability and format. Many articles linked to non-textual sources, such as videos or social media posts (e.g., X/Twitter), while others contained no external links at all. Moreover, even when primary URLs were provided, access was often restricted by paywalls or limited by technical scraping constraints. As a result, only 14 articles ultimately included a link to an original, retrievable source.

In [29]:
import pandas as pd

factchecks = pd.read_csv('eufactcheck_factchecks_2019_2025.csv', sep=';', encoding="utf-8")

factchecks

Unnamed: 0,url,origin,claim,rating,year
0,https://eufactcheck.eu/factcheck/mostly-false-...,,We have to provide our soldier with basic equi...,MOSTLY FALSE,2021
1,https://eufactcheck.eu/factcheck/mostly-true-h...,https://www.vice.com/nl/article/horrorfans-kun...,Horror fans are better at coping during the gl...,MOSTLY TRUE,2021
2,https://eufactcheck.eu/factcheck/mostly-true-g...,,Government measures positively affected Sloven...,MOSTLY TRUE,2021
3,https://eufactcheck.eu/factcheck/mostly-true-w...,,Wages grew more than prices in countries that ...,MOSTLY TRUE,2022
4,https://eufactcheck.eu/factcheck/mostly-false-...,,The 2022 FIFA World Cup in Qatar is fully carb...,MOSTLY FALSE,2022
...,...,...,...,...,...
147,https://eufactcheck.eu/factcheck/mostly-false-...,,Germany faces the highest energy costs worldwi...,MOSTLY FALSE,2025
148,https://eufactcheck.eu/factcheck/mostly-true-o...,,"Of the men who arrived in Germany in 2015/16, ...",MOSTLY TRUE,2025
149,https://eufactcheck.eu/factcheck/mostly-false-...,https://www.spiegel.de/politik/deutschland/bij...,Permanent border controls are “a necessity” in...,MOSTLY FALSE,2025
150,https://eufactcheck.eu/factcheck/true-germanys...,,Germany’s defense expenditures have increased ...,TRUE,2025


As this is a European project, the majority of the 14 articles linked within the fact-checks were not published in English. However, the scope of this study is limited to the English language. Consequently, all non-English source texts were retrieved using the Tavily client, translated into English (with llama 3.3 70B, which excels in multilingual tasks), and then incorporated back into the dataset for analysis. Errors introduced during this process were identified through manual review and removed.

In [30]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

def process_and_translate(url):

    # Only process if it's a valid string and starts with http
    if not isinstance(url, str) or not url.startswith('http'):
        return None
    
    try:
        # Extract content using Tavily
        extract_response = tavily_client.extract(urls=[url])
        raw_text = extract_response['results'][0].get('raw_content', "")
        
        if not raw_text:
            return "No content extracted."

        # Step 2: Set up the translation chain
        prompt = ChatPromptTemplate.from_messages([
            ("system", "You are a professional translator. If the following text is in English, return it exactly as is. If it is in any other language, translate it into clear, fluent English."),
            ("user", "{text}")
        ])
        
        # Build the chain: Prompt -> LLM -> String Output
        translation_chain = prompt | llm_llama | StrOutputParser()
        
        # Execute (truncating to 4000 chars to save context/costs)
        return translation_chain.invoke({"text": raw_text[:4000]})
        
    except Exception as e:
        return f"Error: {str(e)}"

In [None]:
# Apply to your 'factchecks' dataset
factchecks['translated'] = factchecks['origin'].apply(process_and_translate)

# Save to file
factchecks[["url","claim","rating","translated","year"]].to_csv('eval_ground.csv', sep=';', encoding="utf-8", index=False)

factchecks

Unnamed: 0,url,origin,claim,rating,year,translated
0,https://eufactcheck.eu/factcheck/mostly-false-...,,We have to provide our soldier with basic equi...,MOSTLY FALSE,2021,
1,https://eufactcheck.eu/factcheck/mostly-true-h...,https://www.vice.com/nl/article/horrorfans-kun...,Horror fans are better at coping during the gl...,MOSTLY TRUE,2021,Error: list index out of range
2,https://eufactcheck.eu/factcheck/mostly-true-g...,,Government measures positively affected Sloven...,MOSTLY TRUE,2021,
3,https://eufactcheck.eu/factcheck/mostly-true-w...,,Wages grew more than prices in countries that ...,MOSTLY TRUE,2022,
4,https://eufactcheck.eu/factcheck/mostly-false-...,,The 2022 FIFA World Cup in Qatar is fully carb...,MOSTLY FALSE,2022,
...,...,...,...,...,...,...
147,https://eufactcheck.eu/factcheck/mostly-false-...,,Germany faces the highest energy costs worldwi...,MOSTLY FALSE,2025,
148,https://eufactcheck.eu/factcheck/mostly-true-o...,,"Of the men who arrived in Germany in 2015/16, ...",MOSTLY TRUE,2025,
149,https://eufactcheck.eu/factcheck/mostly-false-...,https://www.spiegel.de/politik/deutschland/bij...,Permanent border controls are “a necessity” in...,MOSTLY FALSE,2025,Error: list index out of range
150,https://eufactcheck.eu/factcheck/true-germanys...,,Germany’s defense expenditures have increased ...,TRUE,2025,


### Step 1: is a claim checkable?

The manually edited dataset of fact-checks was loaded, with 14 entries containing translated versions of the original articles, were the student found the claim.

In this first step, it was checked if the claim is POTENTIALLY CHECKABLE OR UNCHECKABLE, which is also the first step in the Assistant after the user provide the claim. It also gives a short explanation and a question for the user.

In [2]:
from typing import Literal
from pydantic import BaseModel, Field

class SubjectResult(BaseModel):
    checkable: Literal["POTENTIALLY CHECKABLE", "UNCHECKABLE"]
    explanation: str = Field("")
    question: str = Field("")

In [5]:
from langchain_core.prompts import ChatPromptTemplate

checkable_check_prompt = """
### Role
Neutral Fact-Checking Analyst.

### Inputs
Claim: {claim}
Dataset rating (validated reference label): {rating}

### Task
Classify the claim and determine if it can be fact-checked.

### Classification Logic
- **UNCHECKABLE**:
  - Opinion or value judgment
  - Prediction or future-oriented statement
  - If rating is UNCHECKABLE, it probably is one of the above
- **POTENTIALLY CHECKABLE**:
  - Factual claims about the past or present
  - Rating is not UNCHECKABLE

### Task
Use the dataset rating to set the checkability label:
- If rating is "UNCHECKABLE" -> checkable MUST be "UNCHECKABLE"
- Otherwise -> checkable MUST be "POTENTIALLY CHECKABLE"

Then:
1) Write a brief explanation why the claim is classified this way, don't mention the link with the rating, ONLY explain why you think it is UNCHECKABLE.
2) Ask a polite confirmation question to the user (do not offer help).

### Output (JSON)
{{
  "checkable": "POTENTIALLY CHECKABLE | UNCHECKABLE",
  "explanation": "Brief justification",
  "question": "Polite confirmation question."
}}
""".strip()

prompt = ChatPromptTemplate.from_template(checkable_check_prompt)

In [6]:
import pandas as pd

# Build the langchain chain
def build_chain(llm):
    structured_llm = llm.with_structured_output(SubjectResult, method="json_mode")
    return prompt | structured_llm

import pandas as pd

# Function to add checkable columns to DataFrame
def add_checkable_columns(df: pd.DataFrame, chain, claim_col: str = "claim", rating_col: str = "rating",) -> pd.DataFrame:
    
    # Copy the dataframe
    out = df.copy()

    # For each row in the dataset call the llm
    def _run_row(row):
        claim = row[claim_col]
        rating = row[rating_col]

        return chain.invoke({
            "claim": claim,
            "rating": rating,
        })

    results = out.apply(_run_row, axis=1)

    # Add everything to the dataset
    out["checkable"] = results.apply(lambda r: r.checkable)
    out["explanation"] = results.apply(lambda r: r.explanation)
    out["question"] = results.apply(lambda r: r.question)

    return out

# Call the chain building
chain = build_chain(llmGPTOSS)

# Run it for every row
step1_df = pd.read_csv("eval_ground_v2.csv", sep=';', encoding="utf-8")
step1_df = add_checkable_columns(step1_df, chain)

# Save the dataset
step1_df.to_csv("eval_ground_step1.csv", index=False)


### Step 2: Retrieve al info from a claim and if available source

Step 2 retrieves all information from the claim, which is also the second step in the assistant. In the assistant the user can also provide a url to the source where the claim was published (if it was published in an article), in this case the translate text will be used if such a source was found in the fact check article. As mentioned before this was only for 14 claims the case.

In [8]:
from typing import Literal, List
from pydantic import BaseModel, Field
from langchain_core.prompts import ChatPromptTemplate

class RetrieveInfoResult(BaseModel):
    claim_source: str = Field("unknown")
    primary_source: bool = Field(False)
    source_description: str = Field("")
    subject: str = Field("unclear")
    quantitative: str = Field("") 
    precision: str = Field("")
    based_on: str = Field("")
    question: str = Field("")
    alerts: List[str] = Field(default_factory=list)
    geography: str = Field("unclear")
    time_period: str = Field("unclear")
    details: str = Field("")

In [9]:
retrieve_info_prompt = """
### Role
Neutral Fact-Checking Analyst. Focus on objective evaluation and guiding the user's reasoning through reflective inquiry rather than providing definitive answers.

### Context
- Claim: {claim}
- Year: {year}

### Additional context the user provided
"{additional_context}"

### Task 1: Source & Intent Extraction
1. **claim_source**: Identify the person or organization who originated the claim.
2. **primary_source**: Set to true ONLY if the evidence confirms this is the original/foundational origin.
3. **source_description**: Describe the medium (e.g., "Official PDF", "Social Media Post").

### Task 2: Factual Dimension Analysis
1. **Subject**: Identify the core entity or event.
2. **Quantitative/Qualitative**: Explain if it is measurable data or a description.
3. **Precision**: Categorize as Precise, Vague, or Absolute (100%), and provide specific numbers, or names from the evidence.
4. **Based On**: Identify the likely methodology (e.g., Official stats, Survey, research). Provide a brief explanation.
5. **Geography**: Identify the geographic scope of the claim.
6. **Time Period**: Identify the time frame relevant to the claim, if nothing available use {year}.

### Task 3: Guidance & Risk
1. **Alerts**: Flag missing Geography, Time Period, unclear subject, qualitative claim, vague quantitative claim, methodological details absent. Do not flag if the info is present.
2. **The Question**: Formulate exactly **one** polite, open-ended question to help the user refine the claim.
3. **details**: Include specific details (dates, numbers, names) from the additional context if available:
"{additional_context}"

### Output Format (JSON)
{{
  "claim_source": "Person/Organisation" or "unknown",
  "primary_source": true/false,
  "source_description": "medium description",
  "subject": "subject text" or "unclear",
  "quantitative": "quantitative/qualitative + short explanation",
  "precision": "precise/vague/absolute + specifics",
  "based_on": "methodology + short explanation" or "unclear",
  "question": "one polite open question",
  "alerts": ["..."],
  "geography": "..." or "unclear",
  "time_period": "..." or "unclear",
  "details": "specific extracted details"
}}
""".strip()

retrieve_prompt = ChatPromptTemplate.from_template(retrieve_info_prompt)


In [12]:
import pandas as pd

# Build the langchain chain
def build_retrieve_chain(llm):
    structured_llm = llm.with_structured_output(RetrieveInfoResult, method="json_mode")
    return retrieve_prompt | structured_llm

# Function to add checkable columns to DataFrame
def add_retrieved_info_columns(df: pd.DataFrame, chain, claim_c: str = "claim", context_c: str = "translated", year_c: int = "year") -> pd.DataFrame:
    
    # Copy the dataframe
    out = df.copy()

    # For each row in the dataset call the llm
    def _run_row(row):
        claim = row[claim_c]
        translated = row[context_c]
        year = row[year_c]
        additional_context = translated if isinstance(translated, str) and translated.strip() else ""

        return chain.invoke({
            "claim": claim,
            "year":year,
            "additional_context": additional_context,
        })

    results = out.apply(_run_row, axis=1)

    # Build the human-readable summary per row
    def _details_text(r: RetrieveInfoResult) -> str:
        text = (
            f"- claim_source: {r.claim_source or 'unknown'}\n"
            f"- primary_source: {r.primary_source}\n"
            f"- source_description: {r.source_description or 'not clearly specified'}\n"
            f"- subject: {r.subject or 'unclear'}\n"
            f"- quantitative: {r.quantitative or 'not clearly specified'}\n"
            f"- precision: {r.precision or 'not clearly specified'}\n"
            f"- based_on: {r.based_on or 'unclear'}\n"
            f"- geography: {r.geography or 'unclear'}\n"
            f"- time_period: {r.time_period or 'unclear'}\n"
        )
        return text

    # Add everything to the dataset
    out["details_text"] = results.apply(_details_text)
    out["alerts"] = results.apply(lambda r: r.alerts)
    out["question"] = results.apply(lambda r: r.question)

    return out

In [13]:
# Call the chain building
chain = build_retrieve_chain(llmGPTOSS)

# Run it for every row
step2_df = pd.read_csv("eval_ground_step1.csv", encoding="utf-8")
step2_df = add_retrieved_info_columns(step2_df, chain)

# Save the dataset
step2_df.to_csv("eval_ground_step2.csv", index=False)

Finally, a streamlit interface was created with vibe coding, to verify all the answers manually