## LLM as judge evaluation

In **part 1** different models will perform the same 3 tasks as are in the ground truth model (and in the assistant). 
1. The first task is to check if a claim is checkable.
2. The next task is to retrieve all relevant information, and if available also additional context from the original source.
3. Finally a confirmation prompt will try to decide is the user confirms and wants to continue or not.

In **part 2**
Two larger different models will score the generated output of each model and task against the ground truth (the manually verified output). 
All tasks will be scored on a binary scale
1. The first task will be scored on *Reasoning*, with 0="reasoning not that clear", 1="clear reasoning"
2. The second task will be scored on:
    - *Completeness*, with 0="missing facts", 1="complete"
    - *Hallucination*, with 0="contains hallucinated facts", 1="no hallucinations found"
3. The last task will be scored on *Intend* (did the user want to proceed or not), with 0="confirmed differs", 1="confirmed is the same"

### Part 1 Generating output

In [63]:
import os
from dotenv import load_dotenv
from langchain_groq import ChatGroq
from langchain_ollama import ChatOllama
from tavily import TavilyClient

# Load alle the API keys
load_dotenv(dotenv_path="../.env", override=True)

# Initialize Tavily client 
tavily_client = TavilyClient(api_key=os.getenv("TAVILY_API_KEY", ""))

# Medium Models on Groq, to evaluate
llama8 = ChatGroq(model_name="llama-3.1-8b-instant", temperature=0.1)
GPTOSS20 = ChatGroq(model_name="openai/gpt-oss-20b", model_kwargs={"tool_choice": "none"}, temperature=0.1)

# Judges
llama70 = ChatGroq(model_name="llama-3.3-70b-versatile", temperature=0.1)
GPTOSS120 = ChatGroq(model_name="openai/gpt-oss-120b", model_kwargs={"tool_choice": "none"}, temperature=0.1)

In [None]:
from typing import Literal,List
from pydantic import BaseModel, Field

class CheckResult(BaseModel):
    checkable: Literal["POTENTIALLY CHECKABLE", "UNCHECKABLE"]
    explanation: str = Field("")

class RetrieveInfoResult(BaseModel):
    claim_source: str = Field("unknown")
    primary_source: bool = Field(False)
    source_description: str = Field("")
    subject: str = Field("unclear")
    quantitative: str = Field("") 
    precision: str = Field("")
    based_on: str = Field("")
    alerts: List[str] = Field(default_factory=list)
    geography: str = Field("unclear")
    time_period: str = Field("unclear")

class ConfirmResult(BaseModel):
    confirmed: bool = Field(False)

In [None]:
from langchain_core.prompts import ChatPromptTemplate

checkable_check_prompt = """
### Role
Neutral Fact-Checking Analyst.

### Inputs
Claim: {claim}

### Task
Classify the claim and determine if it can be fact-checked.

### Classification Logic
- **UNCHECKABLE**:
  - Opinion or value judgment
  - Prediction or future-oriented statement
- **POTENTIALLY CHECKABLE**:
  - Factual claims about the past or present
  - Rating is not UNCHECKABLE

### Task
Use the dataset rating to set the checkability label:
- If rating is "UNCHECKABLE" -> checkable MUST be "UNCHECKABLE"
- Otherwise -> checkable MUST be "POTENTIALLY CHECKABLE"
- Write a brief explanation why the claim is classified this way, don't mention the link with the rating, ONLY explain why you think it is UNCHECKABLE.

### Output (JSON)
{{
  "checkable": "POTENTIALLY CHECKABLE | UNCHECKABLE",
  "explanation": "Brief justification",
}}
""".strip()

c_prompt = ChatPromptTemplate.from_template(checkable_check_prompt)

retrieve_info_prompt = """
### Role
Neutral Fact-Checking Analyst. Focus on objective evaluation.

### Context
- Claim: {claim}
- Year: {year}

### Additional context the user provided
"{additional_context}"

### Task 1: Source & Intent Extraction
1. **claim_source**: Identify the person or organization who originated the claim.
2. **primary_source**: Set to true ONLY if the evidence confirms this is the original/foundational origin.
3. **source_description**: Describe the medium (e.g., "Official PDF", "Social Media Post").

### Task 2: Factual Dimension Analysis
1. **Subject**: Identify the core entity or event.
2. **Quantitative/Qualitative**: Explain if it is measurable data or a description.
3. **Precision**: Categorize as Precise, Vague, or Absolute (100%), and provide specific numbers, or names from the evidence.
4. **Based On**: Identify the likely methodology (e.g., Official stats, Survey, research). Provide a brief explanation.
5. **Geography**: Identify the geographic scope of the claim.
6. **Time Period**: Identify the time frame relevant to the claim, if nothing available use {year}.

### Task 3: Guidance & Risk
1. **Alerts**: Flag missing Geography, Time Period, unclear subject, qualitative claim, vague quantitative claim, methodological details absent. Do not flag if the info is present.
3. Include specific details (dates, numbers, names) from the additional context if available:
"{additional_context}"

### Output Format (JSON)
{{
  "claim_source": "Person/Organisation" or "unknown",
  "primary_source": true/false,
  "source_description": "medium description",
  "subject": "subject text" or "unclear",
  "quantitative": "quantitative/qualitative + short explanation",
  "precision": "precise/vague/absolute + specifics",
  "based_on": "methodology + short explanation" or "unclear",
  "alerts": ["..."],
  "geography": "..." or "unclear",
  "time_period": "..." or "unclear",
}}
""".strip()

r_prompt = ChatPromptTemplate.from_template(retrieve_info_prompt)

# Prompt to confirm the extracted information with the user
intent_prompt = """
### Role
Linguistic Analyst specializing in intent detection.

### Context
- User's Response: "{user_answer}"

### Task
Determine if the User's Response provides a "Green Light" to proceed.

### Decision Rules
**Set "confirmed": true IF:**
- User explicitly agrees (e.g., "Yes," "Correct," "Exactly").
- User provides a neutral command somwhere in the answer to proceed (e.g., "Continue," "Next" "Proceed" "Move on").
- User admits they have no more information (e.g., "I don't know," "That's all I have," "No more details").

**Set "confirmed": false IF:**
- User provides **new additional context or corrections** (even if they agree with the rest).
- User expresses uncertainty or asks a new question.

### Important Rules
- Maintain a neutral, analytical tone.

### Output (JSON)
{{
  "confirmed": boolean
}}
""".strip()

i_prompt = ChatPromptTemplate.from_template(intent_prompt)


In [None]:
# Function to add checkable columns to DataFrame
def add_model_columns(
        df: pd.DataFrame, 
        retrieval_chain, 
        check_chain,
        intent_chain,
        model_name: str, 
        claim_col: str = "claim", 
        context_col: str = "translated", 
        year_col: int = "year"
        user_answer_col: str = "user_answer"
    ) -> pd.DataFrame:
    
    # Copy the dataframe
    out = df.copy()

    # For each row in the dataset call the llm
    def _run_row(row):
        claim = row[claim_col]
        translated = row[context_col]
        year = row[year_col]
        user_answer= row[user_answer]
        additional_context = translated if isinstance(translated, str) and translated.strip() else ""

        check= check_chain.invoke({
            "claim": claim,
        })

        retrieval= retrieval_chain.invoke({
            "claim": claim,
            "year":year,
            "additional_context": additional_context
        })

         intent= intent_chain.invoke({
            "user_answer": user_answer,
        })

        # Build the human-readable summary per row
        details_text = (
            f"- claim_source: {retrieval.claim_source}\n"
            f"- primary_source: {retrieval.primary_source}\n"
            f"- source_description: {retrieval.source_description}\n"
            f"- subject: {retrieval.subject}\n"
            f"- quantitative: {retrieval.quantitative or 'not clearly specified'}\n"
            f"- precision: {retrieval.precision}\n"
            f"- based_on: {retrieval.based_on}\n"
            f"- geography: {retrieval.geography}\n"
            f"- time_period: {retrieval.time_period}\n"
        )
        

        # Return a Series so we can easily join it back to the dataframe
        return pd.Series({
            f"checkable_{model_name}": check.checkable,
            f"explanation_{model_name}": check.explanation,
            f"details_{model_name}": details_text,
            f"alerts_{model_name}": retrieval.alerts,
            f"confirmed_{model_name}": intent.confirmed,
        })

    # Apply the function and join the new columns to the original dataframe
    results = out.apply(_run_row, axis=1)
    out = pd.concat([out, results], axis=1)

    return out

In [None]:
import pandas as pd

# Build the langchain chain
def build_check_chain(llm):
    structured_llm = llm.with_structured_output(CheckResult, method="json_mode")
    return c_prompt | structured_llm

# Build the langchain chain
def build_retrieval_chain(llm):
    structured_llm = llm.with_structured_output(RetrieveInfoResult, method="json_mode")
    return r_prompt | structured_llm

# Build the langchain chain
def build_intent_chain(llm):
    structured_llm = llm.with_structured_output(ConfirmResult, method="json_mode")
    return i_prompt | structured_llm

# Build chains for Llama
llama_retrieval = build_retrieval_chain(llama8)
llama_check = build_check_chain(llama8)
llama_intent = build_intent_chain(llama8)

# Build chains for GPT
gpt_retrieval = build_retrieval_chain(GPTOSS20)
gpt_check = build_check_chain(GPTOSS20)
gpt_intent = build_intent_chain(GPTOSS20)

# Map the model nicknames to the chains
models_to_run = {
    "llama8": {"retrieval": llama_retrieval, "check": llama_check, "intent": llama_intent},
    "gptoss20": {"retrieval": gpt_retrieval, "check": gpt_check, "intent": gpt_intent}
}

# Load your ground dataset
df_models = pd.read_csv("eval_ground_step3.csv", encoding="utf-8")

# Loop through the dictionary
for name, chains in models_to_run.items():
    print(f"Running evaluation for: {name}...")
    df_models = add_model_columns(
        df_models,
        retrieval_chain=chains["retrieval"],
        check_chain=chains["check"],
        intent_chain=chains["check"],
        model_name=name
    )

# 4. Save the final combined results
df_models.to_csv("eval_models_output.csv", index=False)

Running evaluation for: llama8...
Running evaluation for: gptoss20...


### Part 2 Judging the output

In [86]:
from typing import Literal,List
from pydantic import BaseModel, Field

class Task1Result(BaseModel):
    reasoning: bool

class Task2Result(BaseModel):
    hallucination: bool
    completeness: bool

In [87]:
from langchain_core.prompts import ChatPromptTemplate

score_task1_prompt = """
### Role
You are an expert Logical Analyst acting as an LLM Judge. 
Your goal is to perform a semantic and logical comparison between a validated reference data and a generated candidate output.

### Inputs
- Claim: {claim}

### Reference
- **Explanation (Reference):** 
<RExplanation>{r_explanation}</RExplanation>

- **Rating (Reference):** {r_rating}

### Candidate 
- **Explanation (Candidate):** 
<CExplanation>{c_explanation}</CExplanation>

- **Rating (Candidate):** {c_rating}

### Evaluation Criteria
You must determine if the **core logical path** of the *Candidate explanation* matches the validated *Reference explanation*  
- **Focus on:** Underlying premises, causal links, and if the conclusion: the *rating* is the same.
- **Ignore:** Differences in tone, word choice, sentence structure, or level of verbosity, provided the facts remain the same.
- **Fail Criteria:** Mark as `false` if the candidate introduces factual hallucinations,  misses a critical step in the logical chain, or differs in rating. 
- **Succes Criteria:**  Mark as `true` if the reasoning is similar to the *Reference explanation*, or makes sense as an alternative reasoning (only based on the claim). 

### Classification Logic
- **true**: The generated reasoning follows a similar logical flow and reaches the same rating as the reference reasoning, or gives a valid alternative reasoning.
- **false**: The generated reasoning introduces logical fallacies, misses key constraints, or differs in the rating.

### Output Format
Return a JSON object with the following keys:
1. `comparison`: A brief point-by-point analysis of similarities and differences.
2. `reasoning`: A boolean (true/false) indicating if the core logic is preserved.
### Output (JSON)
{{
  "reasoning": true/false,
}}
""".strip()

t1_prompt = ChatPromptTemplate.from_template(score_task1_prompt)

score_task2_prompt = """
### Role
You are a Precision Auditor. Your task is to evaluate a generated response based on a set of provided source materials.

### Inputs
- Claim: {claim}
- Year: {year}

### Additional context the user provided
"{additional_context}"

### Reference
- **Details(Reference):** 
<RDetails>{r_details}</RDetails>

- **Alerts (Reference):** 
<RDetails>{r_alerts}</RDetails>

### Candidate 
- **Details (Candidate):**
<CDetails>{c_details}</CDetails>

- **Alerts (Candidate):**
<CDetails>{c_alerts}</CDetails>


### Evaluation Criteria

#### 1. Completeness
Check if the Generated Response from the **Candidate data** includes most critical information found in the **Reference** Details and Alerts.
- **Score 1 (Complete):** Most essential facts, constraints, and warnings from the *Reference data* are present.
- **Score 0 (Missing Facts):** The response omits key information and significant details from the *Reference data*. 

#### 2. Hallucination
Check if the Generated Response from the **Candidate data** contains information NOT supported by the sources.
- **Score 1 (No Hallucinations):** Every evidence in the Details or Alerts, exists in the *claim* or if available in the *additional context*.
- **Score 0 (Contains Hallucinations):** The response includes "made up" facts, external knowledge not found in the *claim* or *additional context*.

### Output Format
Return a JSON object. 

{{
  "completeness": true/false,
  "hallucination": true/false,
}}
"""
t2_prompt = ChatPromptTemplate.from_template(score_task2_prompt)


In [88]:
from typing import List

def run_multi_model_evaluation(
    df: pd.DataFrame, 
    task1_chain, 
    task2_chain, 
    model_suffixes: List[str],
) -> pd.DataFrame:
    """
    Runs evaluation for multiple models and returns a NEW dataframe
    with the results.
    """
    # Copy of the dataframe
    out_df = df.copy()

    for suffix in model_suffixes:
        print(f"Evaluating model: {suffix}...")
        
        # Define dynamic column names based on the suffix
        c_expl_col = f"explanation_{suffix}"
        c_rating_col = f"checkable_{suffix}"
        c_details_col = f"details_{suffix}"
        c_alerts_col = f"alerts_{suffix}"
        
        # Reference columns (assuming these are constant/static)
        r_expl_col = "explanation"
        r_rating_col = "checkable"
        r_details_col = "details_text"
        r_alerts_col = "alerts"

        # primary source data
        claim_col = "claim"
        translated = "translated"
        year_col = "year"
        additional_context_col = translated if isinstance(translated, str) and translated.strip() else ""

        def _evaluate_row(row):
            # Reasoning Comparison
            t1_output = task1_chain.invoke({
                "r_explanation": row[r_expl_col],
                "c_explanation": row[c_expl_col],
                "r_rating": row[r_rating_col],
                "c_rating": row[c_rating_col],
                "claim": row[claim_col],
            })

            # Completeness & Hallucination
            t2_output = task2_chain.invoke({
                "r_details": row[r_details_col],
                "c_details": row[c_details_col],
                "r_alerts": row[r_alerts_col],
                "c_alerts": row[c_alerts_col],
                "claim": row[claim_col],
                "year":row[year_col],
                "additional_context": row[additional_context_col],
            })

            return (
                t1_output.reasoning,
                t2_output.completeness,
                t2_output.hallucination
            )

        # Apply the evaluation
        temp_results = out_df.apply(_evaluate_row, axis=1)

        # Unpack results into new columns in our results_df
        out_df[f"reason_{suffix}"] = temp_results.apply(lambda t: t[0])
        out_df[f"complete_{suffix}"] = temp_results.apply(lambda t: t[1])
        out_df[f"halluci_{suffix}"] = temp_results.apply(lambda t: t[2])

    return out_df

First we will run an evaluation with GPT OSS 120B

In [89]:
# these models are being evaluated
models_to_test = ["llama8", "gptoss20"]

# Build the langchain chain
def build_task1_chain(llm):
    structured_llm = llm.with_structured_output(Task1Result, method="json_mode")
    return t1_prompt | structured_llm

# Build the langchain chain
def build_task2_chain(llm):
    structured_llm = llm.with_structured_output(Task2Result, method="json_mode")
    return t2_prompt | structured_llm

# Build chains for GPT
task1_chain = build_task1_chain(GPTOSS120)
task2_chain = build_task2_chain(GPTOSS120)

# Load your ground dataset
original_df = pd.read_csv("eval_models_output.csv", encoding="utf-8")

scores_df = run_multi_model_evaluation(original_df, task1_chain, task2_chain, models_to_test)

scores_df.to_csv("eval_gptoss_output.csv", index=False)

Evaluating model: llama8...
Evaluating model: gptoss20...


In [90]:
scores_df

Unnamed: 0,url,claim,rating,translated,year,checkable,explanation,details_text,alerts,question,...,checkable_gptoss20,explanation_gptoss20,details_gptoss20,alerts_gptoss20,reason_llama8,complete_llama8,halluci_llama8,reason_gptoss20,complete_gptoss20,halluci_gptoss20
0,https://eufactcheck.eu/factcheck/mostly-true-w...,Wages grew more than prices in countries that ...,MOSTLY TRUE,,2022.0,POTENTIALLY CHECKABLE,The statement makes a factual assertion about ...,- claim_source: unknown\n- primary_source: Fal...,['missing specific geography (list of countrie...,Could you provide more details about the sourc...,...,POTENTIALLY CHECKABLE,The claim states a specific economic trend—wag...,- claim_source: unknown\n- primary_source: Fal...,"['claim source not identified', 'no specific s...",True,False,False,True,True,False
1,https://eufactcheck.eu/factcheck/mostly-false-...,The 2022 FIFA World Cup in Qatar is fully carb...,MOSTLY FALSE,,2022.0,POTENTIALLY CHECKABLE,The claim asserts a factual statement about a ...,- claim_source: unknown\n- primary_source: Fal...,"['missing source', 'methodological details abs...",Could you share where this claim originated or...,...,POTENTIALLY CHECKABLE,The claim asserts a factual condition about a ...,- claim_source: FIFA\n- primary_source: True\n...,[],True,False,True,True,False,True
2,https://eufactcheck.eu/factcheck/mostly-true-e...,EU now spends three times more than Russia on ...,MOSTLY TRUE,,2022.0,POTENTIALLY CHECKABLE,"The statement asserts a specific, quantitative...",- claim_source: unknown\n- primary_source: Fal...,"['Missing claim source', 'Methodology not prov...",Could you share where you saw this claim or an...,...,POTENTIALLY CHECKABLE,The claim states a specific quantitative compa...,- claim_source: unknown\n- primary_source: Fal...,[],True,False,False,True,False,True
3,https://eufactcheck.eu/factcheck/false-the-eu-...,The EU has not fulfilled its role as a guarant...,FALSE,,2022.0,POTENTIALLY CHECKABLE,The statement asserts a factual condition abou...,- claim_source: unknown\n- primary_source: Fal...,"['missing source information', 'missing method...",Could you provide more details about the sourc...,...,POTENTIALLY CHECKABLE,The claim asserts a factual statement about th...,- claim_source: unknown\n- primary_source: Fal...,[],False,False,True,True,False,False
4,https://eufactcheck.eu/factcheck/mostly-false-...,Herd immunity in Germany has already been reached,MOSTLY FALSE,,2022.0,POTENTIALLY CHECKABLE,The claim asserts a factual condition about th...,- claim_source: unknown\n- primary_source: Fal...,"['missing claim source', 'missing methodology ...",Could you provide more information about where...,...,POTENTIALLY CHECKABLE,The claim asserts a specific factual state abo...,- claim_source: German Health Minister Jens Sp...,[],True,False,False,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
144,https://eufactcheck.eu/factcheck/mostly-false-...,Germany faces the highest energy costs worldwi...,MOSTLY FALSE,,2025.0,POTENTIALLY CHECKABLE,The statement makes a factual assertion about ...,- claim_source: unknown\n- primary_source: Fal...,"['missing source', 'methodological details abs...",Could you provide the source or any specific d...,...,POTENTIALLY CHECKABLE,The claim states a specific factual assertion ...,- claim_source: unknown\n- primary_source: Fal...,"['Missing source attribution', 'Missing quanti...",True,False,True,True,True,True
145,https://eufactcheck.eu/factcheck/mostly-true-o...,"Of the men who arrived in Germany in 2015/16, ...",MOSTLY TRUE,,2025.0,POTENTIALLY CHECKABLE,The statement asserts a specific statistic abo...,- claim_source: unknown\n- primary_source: Fal...,"['missing source', 'methodology details absent']",Could you share where you saw this statistic o...,...,POTENTIALLY CHECKABLE,The claim states a specific employment statist...,- claim_source: unknown\n- primary_source: Fal...,"['source not identified', 'methodology not pro...",True,False,True,True,True,False
146,https://eufactcheck.eu/factcheck/mostly-false-...,Permanent border controls are “a necessity” in...,MOSTLY FALSE,The provided text is in German. Here is the tr...,2025.0,POTENTIALLY CHECKABLE,The statement makes a factual claim about the ...,- claim_source: Bijan Djir‑Sarai (FDP General ...,"['qualitative claim', 'vague claim', 'methodol...","Could you share any specific data, studies, or...",...,UNCHECKABLE,The claim expresses a normative judgment that ...,- claim_source: B. Djir-Sarai (FDP General Sec...,"['qualitative claim', 'methodological details ...",False,False,False,False,False,False
147,https://eufactcheck.eu/factcheck/true-germanys...,Germany’s defense expenditures have increased ...,TRUE,,2025.0,POTENTIALLY CHECKABLE,The statement asserts specific historical figu...,- claim_source: unknown\n- primary_source: Fal...,"['missing source information', 'missing method...",Could you share where you saw this claim or an...,...,POTENTIALLY CHECKABLE,The claim states a specific numerical change i...,- claim_source: unknown\n- primary_source: Fal...,"['No primary source identified', 'Methodology ...",True,False,True,True,True,False


In [None]:
models_to_test = ["llama70", "gptoss20"]

# Build the langchain chain
def build_task1_chain(llm):
    structured_llm = llm.with_structured_output(Task1Result, method="json_mode")
    return t1_prompt | structured_llm

# Build the langchain chain
def build_task2_chain(llm):
    structured_llm = llm.with_structured_output(Task2Result, method="json_mode")
    return t2_prompt | structured_llm

# Build chains for Llama
llama_task1 = build_task1_chain(llama70)
llama_task2 = build_task2_chain(llama70)

# Build chains for GPT
gpt_task1 = build_task1_chain(GPTOSS120)
gpt_task2 = build_task2_chain(GPTOSS120)

# Load your ground dataset
original_df = pd.read_csv("eval_models_output.csv", encoding="utf-8")

scores_df = run_multi_model_evaluation(original_df, t1_chain, t2_chain, models_to_test)

In [79]:
original_df

Unnamed: 0,url,claim,rating,translated,year,checkable,explanation,details_text,alerts,question,user_answer,confirmed,checkable_llama8,explanation_llama8,details_llama8,alerts_llama8,checkable_gptoss20,explanation_gptoss20,details_gptoss20,alerts_gptoss20
0,https://eufactcheck.eu/factcheck/mostly-true-w...,Wages grew more than prices in countries that ...,MOSTLY TRUE,,2022.0,POTENTIALLY CHECKABLE,The statement makes a factual assertion about ...,- claim_source: unknown\n- primary_source: Fal...,['missing specific geography (list of countrie...,Could you provide more details about the sourc...,The claim is based on the World Bank’s 2021 Gl...,False,POTENTIALLY CHECKABLE,The claim is about a general trend in multiple...,- claim_source: International Monetary Fund (I...,[],POTENTIALLY CHECKABLE,The claim states a specific economic trend—wag...,- claim_source: unknown\n- primary_source: Fal...,"['claim source not identified', 'no specific s..."
1,https://eufactcheck.eu/factcheck/mostly-false-...,The 2022 FIFA World Cup in Qatar is fully carb...,MOSTLY FALSE,,2022.0,POTENTIALLY CHECKABLE,The claim asserts a factual statement about a ...,- claim_source: unknown\n- primary_source: Fal...,"['missing source', 'methodological details abs...",Could you share where this claim originated or...,"I don’t have a concrete source for that claim,...",True,POTENTIALLY CHECKABLE,The claim is about a specific event (2022 FIFA...,- claim_source: FIFA\n- primary_source: True\n...,['missing methodology details'],POTENTIALLY CHECKABLE,The claim asserts a factual condition about a ...,- claim_source: FIFA\n- primary_source: True\n...,[]
2,https://eufactcheck.eu/factcheck/mostly-true-e...,EU now spends three times more than Russia on ...,MOSTLY TRUE,,2022.0,POTENTIALLY CHECKABLE,"The statement asserts a specific, quantitative...",- claim_source: unknown\n- primary_source: Fal...,"['Missing claim source', 'Methodology not prov...",Could you share where you saw this claim or an...,I saw the claim in a New York Times article fr...,False,POTENTIALLY CHECKABLE,The claim is about a specific numerical compar...,- claim_source: unknown\n- primary_source: Fal...,"['missing Geography', 'missing Time Period', '...",POTENTIALLY CHECKABLE,The claim states a specific quantitative compa...,- claim_source: unknown\n- primary_source: Fal...,[]
3,https://eufactcheck.eu/factcheck/false-the-eu-...,The EU has not fulfilled its role as a guarant...,FALSE,,2022.0,POTENTIALLY CHECKABLE,The statement asserts a factual condition abou...,- claim_source: unknown\n- primary_source: Fal...,"['missing source information', 'missing method...",Could you provide more details about the sourc...,"I don’t have the source details, so let’s move...",True,POTENTIALLY CHECKABLE,The claim is a statement about the EU's fulfil...,- claim_source: unknown\n- primary_source: Fal...,"['missing Geography', 'missing Time Period', '...",POTENTIALLY CHECKABLE,The claim asserts a factual statement about th...,- claim_source: unknown\n- primary_source: Fal...,[]
4,https://eufactcheck.eu/factcheck/mostly-false-...,Herd immunity in Germany has already been reached,MOSTLY FALSE,,2022.0,POTENTIALLY CHECKABLE,The claim asserts a factual condition about th...,- claim_source: unknown\n- primary_source: Fal...,"['missing claim source', 'missing methodology ...",Could you provide more information about where...,I saw it in a German health blog that referenc...,False,POTENTIALLY CHECKABLE,The claim about herd immunity in Germany being...,- claim_source: unknown\n- primary_source: Fal...,"['missing Geography', 'missing Time Period', '...",POTENTIALLY CHECKABLE,The claim asserts a specific factual state abo...,- claim_source: German Health Minister Jens Sp...,[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
144,https://eufactcheck.eu/factcheck/mostly-false-...,Germany faces the highest energy costs worldwi...,MOSTLY FALSE,,2025.0,POTENTIALLY CHECKABLE,The statement makes a factual assertion about ...,- claim_source: unknown\n- primary_source: Fal...,"['missing source', 'methodological details abs...",Could you provide the source or any specific d...,I saw the figures in a Bloomberg report from M...,False,POTENTIALLY CHECKABLE,The claim is a factual statement about the cur...,- claim_source: unknown\n- primary_source: Fal...,"['missing Geography', 'missing Time Period', '...",POTENTIALLY CHECKABLE,The claim states a specific factual assertion ...,- claim_source: unknown\n- primary_source: Fal...,"['Missing source attribution', 'Missing quanti..."
145,https://eufactcheck.eu/factcheck/mostly-true-o...,"Of the men who arrived in Germany in 2015/16, ...",MOSTLY TRUE,,2025.0,POTENTIALLY CHECKABLE,The statement asserts a specific statistic abo...,- claim_source: unknown\n- primary_source: Fal...,"['missing source', 'methodology details absent']",Could you share where you saw this statistic o...,I found the statistic in the WHO 2022 report o...,False,POTENTIALLY CHECKABLE,The claim is about a specific percentage of me...,- claim_source: unknown\n- primary_source: Fal...,"['missing Geography', 'missing Time Period', '...",POTENTIALLY CHECKABLE,The claim states a specific employment statist...,- claim_source: unknown\n- primary_source: Fal...,"['source not identified', 'methodology not pro..."
146,https://eufactcheck.eu/factcheck/mostly-false-...,Permanent border controls are “a necessity” in...,MOSTLY FALSE,The provided text is in German. Here is the tr...,2025.0,POTENTIALLY CHECKABLE,The statement makes a factual claim about the ...,- claim_source: Bijan Djir‑Sarai (FDP General ...,"['qualitative claim', 'vague claim', 'methodol...","Could you share any specific data, studies, or...",According to the Federal Office for Migration ...,False,POTENTIALLY CHECKABLE,The claim is a subjective statement as it uses...,- claim_source: Bijan Djir-Sarai\n- primary_so...,"['Geography unclear', 'Time Period unclear', '...",UNCHECKABLE,The claim expresses a normative judgment that ...,- claim_source: B. Djir-Sarai (FDP General Sec...,"['qualitative claim', 'methodological details ..."
147,https://eufactcheck.eu/factcheck/true-germanys...,Germany’s defense expenditures have increased ...,TRUE,,2025.0,POTENTIALLY CHECKABLE,The statement asserts specific historical figu...,- claim_source: unknown\n- primary_source: Fal...,"['missing source information', 'missing method...",Could you share where you saw this claim or an...,"I don’t have the exact source right now, so le...",True,POTENTIALLY CHECKABLE,The claim is about a specific numerical increa...,- claim_source: unknown\n- primary_source: Fal...,"['missing Geography', 'missing Time Period', '...",POTENTIALLY CHECKABLE,The claim states a specific numerical change i...,- claim_source: unknown\n- primary_source: Fal...,"['No primary source identified', 'Methodology ..."
