## LLM as judge evaluation

In **part 1** different models will perform the same 3 tasks as are in the ground truth model (and in the assistant). 
1. The first task is to check if a claim is checkable.
2. The next task is to retrieve all relevant information, and if available also additional context from the original source.
3. Finally a confirmation prompt will try to decide is the user confirms and wants to continue or not.

In **part 2**
Two larger different models will score the generated output of each model and task against the ground truth (the manually verified output). 
All tasks will be scored on a binary scale
1. The first task will be scored on *Reasoning*, with 0="reasoning not that clear", 1="clear reasoning"
2. The second task will be scored on:
    - *Completeness*, with 0="missing facts", 1="complete"
    - *Hallucination*, with 0="contains hallucinated facts", 1="no hallucinations found"
3. The last task will be scored on *Intend* (did the user want to proceed or not), with 0="confirmed differs", 1="confirmed is the same"

### Part 1 Generating output

In [None]:
import os
from dotenv import load_dotenv
from langchain_groq import ChatGroq
from langchain_ollama import ChatOllama
from langchain_openai import ChatOpenAI

# Load alle the API keys
load_dotenv(dotenv_path="../.env", override=True)

# Models on Groq,
llama8 = ChatGroq(model_name="llama-3.1-8b-instant", temperature=0.1)
GPTOSS20 = ChatGroq(model_name="openai/gpt-oss-20b", temperature=0.1)
qwen3_32 = ChatGroq(model_name="qwen/qwen3-32b", temperature=0.1)
llama70 = ChatGroq(model_name="llama-3.3-70b-versatile", temperature=0.1)
GPTOSS120 = ChatGroq(model_name="openai/gpt-oss-120b", temperature=0.1)

# Ollama Models
llama1 = ChatOllama(model="llama3.2:1b", temperature=0.1, base_url="http://localhost:11434")
llama3 = ChatOllama(model="llama3.2:3b", temperature=0.1, base_url="http://localhost:11434")
mistral7 = ChatOllama(model="mistral:7b", temperature=0.1, base_url="http://localhost:11434")
qwen3_1 = ChatOllama(model="qwen3:1.7b", temperature=0.1, base_url="http://localhost:11434")
qwen3_4 = ChatOllama(model="qwen3:4b", temperature=0.1, base_url="http://localhost:11434")

#judge
llmGPT5 = ChatOpenAI(model="gpt-5", temperature=0.1)

In [224]:
from typing import Literal,List
from pydantic import BaseModel, Field

class CheckResult(BaseModel):
    checkable: Literal["POTENTIALLY CHECKABLE", "UNCHECKABLE"]
    explanation: str = Field("")

class RetrieveInfoResult(BaseModel):
    claim_source: str = Field("unknown")
    primary_source: bool = Field(False)
    source_description: str = Field("")
    subject: str = Field("unclear")
    data_type:str = Field("") 
    precision: str = Field("")
    based_on: str = Field("")
    alerts: str = Field("",description="A comma-separated list of flags or 'None' if no issues found")
    geography: str = Field("unclear")
    time_period: str = Field("unclear")

class ConfirmResult(BaseModel):
    confirmed: bool = Field(False)

In [225]:
from langchain_core.prompts import ChatPromptTemplate

checkable_check_prompt = """
### Role
Neutral Fact-Checking Analyst.

### Inputs
Claim: {claim}

### Task
Classify the claim and determine if it can be fact-checked.

### Classification Logic
- **UNCHECKABLE**: Opinion, value judgment, or prediction.
- **POTENTIALLY CHECKABLE**: Factual claims about the past or present.

### Output (JSON)
{{
  "checkable": "POTENTIALLY CHECKABLE",
  "explanation": "Brief justification for the classification."
}}
""".strip()

c_prompt = ChatPromptTemplate.from_template(checkable_check_prompt)

retrieve_info_prompt = """
### Role
Neutral Fact-Checking Analyst. Focus on objective evaluation.

### Context
- Claim: {claim}
- Year: {year}

### Additional context the user provided
"{additional_context}"

### Task 1: Source & Intent Extraction
1. **claim_source**: Identify the person or organization who originated the claim.
2. **primary_source**: Set to true ONLY if the evidence confirms this is the original/foundational origin.
3. **source_description**: Describe the medium (e.g., "Official PDF", "Social Media Post").

### Task 2: Factual Dimension Analysis
1. **Subject**: Identify the core entity or event.
2. **Data Type**: Quantitative or Qualitative, explain if it is measurable data or a descriptive data.
3. **Precision**: Categorize as Precise, Vague, or Absolute (100%), and provide specific numbers, or names from the evidence.
4. **Based On**: Identify the likely methodology (e.g., Official stats, Survey, research). Provide a brief explanation.
5. **Geography**: Identify the geographic scope of the claim.
6. **Time Period**: The time frame is normally {year}, unless otherwise stated in the claim or additional context.

### Task 3: Guidance & Risk
1. **Alerts**: Flag "missing geography", "qualitative claim", "missing methodological details", "missing source". 
  If the claim is quantitative, but no numbers are provide or textual precise indications (e.g., "All", "None"), all), *Flag in Alerts*: "vague quantitative claim".
  Do not flag if the info is present.
  Example output: "missing geography, missing source"
2. Include specific details (dates, numbers, names) from the additional context if available:
"{additional_context}"

### IMPORTANT OUTPUT CONSTRAINTS
- "alerts" MUST be a single STRING, NOT a list, NOT an object/dict.

### Output Format (JSON)
{{
  "claim_source": "a Person or an Organisation" or "unknown"
  "primary_source": true/false
  "source_description": "medium description"
  "subject": "subject text" or "unclear"
  "data_type": "STRING ONLY. Must start with one of: 'quantitative:', 'qualitative:', or 'unclear:' followed by a short explanation. Never boolean/null."
  "precision": "precise/vague/absolute and some specifics"
  "based_on": "methodology and short explanation" or "unclear"
  "alerts": "MUST be a String NOT a List, NOT a Dict. Either '' or a comma-separated(e.g., 'missing geography, missing source')"
  "geography": "Geography",
  "time_period": "Year or more precise timeframe"
}}
""".strip()

r_prompt = ChatPromptTemplate.from_template(retrieve_info_prompt)

# Prompt to confirm the extracted information with the user
intent_prompt = """
### Role
Linguistic Analyst specializing in intent detection.

### Context
- User's Response: "{user_answer}"

### Task
Analyze the "User's Response" to determine if the assistant has permission to proceed to the next stage of the process ("Green Light").

### Decision Rules
**Set "confirmed": true IF:**
1. **Explicit Command:** The user uses navigational keywords (e.g., "Next," "Continue," "Proceed," "Move on," "Go ahead").
2. **Affirmation:** The user agrees with a previous summary (e.g., "Yes," "Correct," "Exactly," "That's right").
3. **Closure:** The user indicates they have no further details to provide (e.g., "I don't know," "That's all I have," "No more info").

**Set "confirmed": false IF:**
1. **New Info Only:** The user provides additional facts or context but does NOT include a command to move on.
2. **Correction:** The user is correcting a previous mistake.
3. **Uncertainty:** The user asks a follow-up question or expresses confusion.

### Important Rules
- Focus on the **intent to progress**.
- If the response is ambiguous, default to `false`.

### Output (JSON)
{{
  "confirmed": true
}}
""".strip()

i_prompt = ChatPromptTemplate.from_template(intent_prompt)


In [227]:
from langchain_core.exceptions import OutputParserException

# Function to add checkable columns to DataFrame
def add_model_columns(
        df: pd.DataFrame, 
        retrieval_chain, 
        check_chain,
        intent_chain,
        model_name: str, 
        claim_col: str = "claim", 
        context_col: str = "translated", 
        year_col: int = "year",
        user_answer_col: str = "user_answer"
    ) -> pd.DataFrame:
    
    # Copy the dataframe
    out = df.copy()

    # For each row in the dataset call the llm
    def _run_row(row):
        try:
            claim = row[claim_col]
            translated = row[context_col]
            year = row[year_col]
            user_answer= row[user_answer_col]
            additional_context = (
                translated if isinstance(translated, str) and translated.strip() else ""
            )

            check= check_chain.invoke({
                "claim": claim,
            })

            retrieval= retrieval_chain.invoke({
                "claim": claim,
                "year":year,
                "additional_context": additional_context
            })

            intent= intent_chain.invoke({
                "user_answer": user_answer,
            })

            # Build the human-readable summary per row
            details_text = (
                f"- claim_source: {retrieval.claim_source}\n"
                f"- primary_source: {retrieval.primary_source}\n"
                f"- source_description: {retrieval.source_description}\n"
                f"- subject: {retrieval.subject}\n"
                f"- quantitative: {retrieval.data_type or 'not clearly specified'}\n"
                f"- precision: {retrieval.precision}\n"
                f"- based_on: {retrieval.based_on}\n"
                f"- geography: {retrieval.geography}\n"
                f"- time_period: {retrieval.time_period}\n"
            )
            

            # Return a Series so we can easily join it back to the dataframe
            return pd.Series({
                f"checkable_{model_name}": check.checkable,
                f"explanation_{model_name}": check.explanation,
                f"details_{model_name}": details_text,
                f"alerts_{model_name}": retrieval.alerts,
                f"confirmed_{model_name}": intent.confirmed,
            })

        except (OutputParserException, ValueError, KeyError) as e:
            # Soft failure: log and continue
            return pd.Series({
                f"checkable_{model_name}": None,
                f"explanation_{model_name}": "",
                f"details_{model_name}": "",
                f"alerts_{model_name}": "",
                f"confirmed_{model_name}": None,
                f"error_{model_name}": str(e),
        })

    # Apply the function and join the new columns to the original dataframe
    results = out.apply(_run_row, axis=1)
    out = pd.concat([out, results], axis=1)

    return out

First a number of Models from Groq will perform the set tasks and later on scored on the output

In [None]:
import pandas as pd

# Build the langchain chain
def build_check_chain(llm):
    structured_llm = llm.with_structured_output(CheckResult, method="json_mode")
    return c_prompt | structured_llm

# Build the langchain chain
def build_retrieval_chain(llm):
    structured_llm = llm.with_structured_output(RetrieveInfoResult, method="json_mode")
    return r_prompt | structured_llm

# Build the langchain chain
def build_intent_chain(llm):
    structured_llm = llm.with_structured_output(ConfirmResult, method="json_mode")
    return i_prompt | structured_llm

# Build chains for Llama8
llama8_retrieval = build_retrieval_chain(llama8)
llama8_check = build_check_chain(llama8)
llama8_intent = build_intent_chain(llama8)

# Build chains for GPT OSS-20B
gpt20_retrieval = build_retrieval_chain(GPTOSS20)
gpt20_check = build_check_chain(GPTOSS20)
gpt20_intent = build_intent_chain(GPTOSS20)

# Build chains for GPT OSS-120B
gpt120_retrieval = build_retrieval_chain(GPTOSS120)
gpt120_check = build_check_chain(GPTOSS120)
gpt120_intent = build_intent_chain(GPTOSS120)

# Build chains for Qwen2-32B
qwen_retrieval = build_retrieval_chain(qwen3_32)
qwen_check = build_check_chain(qwen3_32)
qwen_intent = build_intent_chain(qwen3_32)

# Build chains for Llama70
llama70_retrieval = build_retrieval_chain(llama70)
llama70_check = build_check_chain(llama70)
llama70_intent = build_intent_chain(llama70)


# Map the model nicknames to the chains
models_to_run = {
    "llama8": {"retrieval": llama8_retrieval, "check": llama8_check, "intent": llama8_intent},
    "llama70": {"retrieval": llama70_retrieval, "check": llama70_check, "intent": llama70_intent},
    "gptoss20": {"retrieval": gpt20_retrieval, "check": gpt20_check, "intent": gpt20_intent},
    "gptoss120": {"retrieval": gpt120_retrieval, "check": gpt120_check, "intent": gpt120_intent},
    "qwen3_32": {"retrieval": qwen_retrieval, "check": qwen_check, "intent": qwen_intent}
}

# Load your ground dataset
df_models = pd.read_csv("validated_reference_data.csv", encoding="utf-8")

# Loop through the dictionary
for name, chains in models_to_run.items():
    print(f"Running evaluation for: {name}...")
    df_models = add_model_columns(
        df_models,
        retrieval_chain=chains["retrieval"],
        check_chain=chains["check"],
        intent_chain=chains["intent"],
        model_name=name
    )

# 4. Save the final combined results
df_models.to_csv("eval_models_output.csv", index=False)

In [228]:
import pandas as pd

# Build the langchain chain
def build_check_chain(llm):
    structured_llm = llm.with_structured_output(CheckResult, method="json_mode")
    return c_prompt | structured_llm

# Build the langchain chain
def build_retrieval_chain(llm):
    structured_llm = llm.with_structured_output(RetrieveInfoResult, method="json_mode")
    return r_prompt | structured_llm

# Build the langchain chain
def build_intent_chain(llm):
    structured_llm = llm.with_structured_output(ConfirmResult, method="json_mode")
    return i_prompt | structured_llm

# Build chains for Llama8
llama1_retrieval = build_retrieval_chain(llama1)
llama1_check = build_check_chain(llama1)
llama1_intent = build_intent_chain(llama1)

# Build chains for Llama3.2 3B
llama3_retrieval = build_retrieval_chain(llama3)
llama3_check = build_check_chain(llama3)
llama3_intent = build_intent_chain(llama3)

# Build chains for Mistral 7B
mistral7_retrieval = build_retrieval_chain(mistral7)
mistral7_check = build_check_chain(mistral7)
mistral7_intent = build_intent_chain(mistral7)

# Build chains for Qwen3-4B
qwen3_1_retrieval = build_retrieval_chain(qwen3_1)
qwen3_1_check = build_check_chain(qwen3_1)
qwen3_1_intent = build_intent_chain(qwen3_1)

# Build chains for Qwen3-4B
qwen3_4_retrieval = build_retrieval_chain(qwen3_4)
qwen3_4_check = build_check_chain(qwen3_4)
qwen3_4_intent = build_intent_chain(qwen3_4)


# Map the model nicknames to the chains
models_to_run = {
    "qwen3_1": {"retrieval": qwen3_1_retrieval, "check": qwen3_1_check, "intent": qwen3_1_intent},
    "qwen3_4": {"retrieval": qwen3_4_retrieval, "check": qwen3_4_check, "intent": qwen3_4_intent},
    "llama1": {"retrieval": llama1_retrieval, "check": llama1_check, "intent": llama1_intent},
    "llama3": {"retrieval": llama3_retrieval, "check": llama3_check, "intent": llama3_intent},
    "mistral7": {"retrieval": mistral7_retrieval, "check": mistral7_check, "intent": mistral7_intent},
}+
df_models = pd.read_csv("validated_reference_data.csv", encoding="utf-8")

# Loop through the dictionary
for name, chains in models_to_run.items():
    print(f"Running evaluation for: {name}...")
    df_models = add_model_columns(
        df_models,
        retrieval_chain=chains["retrieval"],
        check_chain=chains["check"],
        intent_chain=chains["intent"],
        model_name=name
    )

# 4. Save the final combined results
df_models.to_csv("eval_models_output.csv", index=False)

Running evaluation for: qwen3_1...
Running evaluation for: qwen3_4...
Running evaluation for: llama1...
Running evaluation for: llama3...
Running evaluation for: mistral7...


### Part 2 Judging the output

In [229]:
from typing import Literal,List
from pydantic import BaseModel, Field

class Task1Result(BaseModel):
    reasoning: bool

class Task2Result(BaseModel):
    hallucination: bool
    completeness: float

In [230]:
from langchain_core.prompts import ChatPromptTemplate

score_task1_prompt = """
### Role
You are an expert Logical Analyst acting as an LLM Judge. 
Your goal is to perform a semantic and logical comparison between a validated reference data and a generated candidate output.

### Inputs
- Claim: {claim}

### Reference
- **Explanation (Reference):** 
<RExplanation>{r_explanation}</RExplanation>

- **Rating (Reference):** {r_rating}

### Candidate 
- **Explanation (Candidate):** 
<CExplanation>{c_explanation}</CExplanation>

- **Rating (Candidate):** {c_rating}

### Evaluation Objective
Determine if the **Candidate Explanation** aligns with the **Reference Explanation** or provides a logically sound alternative based solely on the provided Claim.

### Judgment Principles
- **Logical Path**: Prioritize the underlying premises and causal links over tone, vocabulary, or verbosity.
- **Rating Consistency**: The Candidate must reach the same conclusion or "rating" as the Reference.
- **Permissible Variance**: Mark as TRUE if the Candidate uses a different but valid logical approach to reach the same conclusion, provided it does not contradict known facts.

### Scoring Rubric
**Set "reasoning": true IF:**
- The core logical flow matches the Reference reasoning.
- OR the reasoning differs but is factually correct, logically consistent, and reaches the same final rating.
- All stated facts are accurate and no critical steps in the chain are missing.

**Set "reasoning": false IF:**
- The candidate introduces factual hallucinations (data or entities not in the claim/reference).
- The explanation contains logical fallacies or misses a critical constraint.
- The candidate suggests/implies a different fact-checking outcome than the Reference.

### Output Format
Return a JSON object with the following keys:
### Output (JSON)
{{
  "reasoning": true/false,
}}
""".strip()

t1_prompt = ChatPromptTemplate.from_template(score_task1_prompt)

score_task2_prompt = """
### Role
You are a Precision Auditor. Your task is to evaluate a generated response based on a set of provided source materials.

### Inputs
- Claim: {claim}
- Year: {year}

### Additional context the user provided
"{additional_context}"

### Reference
- **Details(Reference):** 
<RDetails>{r_details}</RDetails>

- **Alerts (Reference):** 
<RDetails>{r_alerts}</RDetails>

### Candidate 
- **Details (Candidate):**
<CDetails>{c_details}</CDetails>

- **Alerts (Candidate):**
<CDetails>{c_alerts}</CDetails>

### Evaluation Objective 1: Completeness
Assess if the **Candidate Data** includes the critical information, numbers, and alerts found in the **Reference Data**.

**Judgment Principles**:
  - Focus on essential facts: names, specific numbers, and methodological approach.
  - Check for Alignment: Ensure the most critical "Alerts" or "Flags" from the Reference are present.
  - **Leniency**: Do not penalize for minor differences in "Quantitative" vs "Qualitative" labeling, as these are often subjective.

### Evaluation Objective 2: Hallucination
Determine if the **Candidate Data** introduces information that is not supported by the provided sources.

**Judgment Principles**:
  - **No Hallucination (1)**: Every piece of evidence in the details or alerts exists strictly within the **Claim** or the **Additional Context**.
  - **Hallucination (0)**: The response includes "made up" facts, figures, or external knowledge not explicitly found in the provided context.

### Scoring Rubric

#### **1. Completeness Score (0.0 to 1.0)**
- **1.0**: All critical facts, numbers, and alerts from the Reference are present.
- **0.5**: Some key details are missing, but the core essence remains.
- **0.0**: Critical names, numbers, or alerts are entirely absent.

#### **2. Hallucination Free (Boolean)**
- **true**: Every claim in the Candidate is grounded in the Claim/Context.
- **false**: The Candidate introduces information from outside the provided text.

### Output (JSON)
{{
  "completeness": 0.0 to 1.0
  "hallucination": true/false
}}.strip()
"""
t2_prompt = ChatPromptTemplate.from_template(score_task2_prompt)


In [None]:
from typing import List

def run_multi_model_evaluation(
    df: pd.DataFrame, 
    task1_chain, 
    task2_chain, 
    model_suffixes: List[str],
) -> pd.DataFrame:
    """
    Runs evaluation for multiple models and returns a NEW dataframe
    with the results.
    """
    # Copy of the dataframe
    out_df = df.copy()

    for suffix in model_suffixes:
        print(f"Evaluating model: {suffix}...")
        
        # Define dynamic column names based on the suffix
        c_expl_col = f"explanation_{suffix}"
        c_rating_col = f"checkable_{suffix}"
        c_details_col = f"details_{suffix}"
        c_alerts_col = f"alerts_{suffix}"
        c_confirmed_col = f"confirmed_{suffix}"
        
        # Reference columns (assuming these are constant/static)
        r_expl_col = "explanation"
        r_rating_col = "checkable"
        r_details_col = "details_text"
        r_alerts_col = "alerts"
        r_confirmed_col = "confirmed"

        # primary source data
        claim_col = "claim"
        translated = "translated"
        year_col = "year"

        def _evaluate_row(row):
            try:
                # Get the additional context from the translated column, if it is not empty
                additional_context = row["translated"] if "translated" in row and isinstance(row["translated"], str) and row["translated"].strip() else ""

                # Reasoning Comparison
                t1_output = task1_chain.invoke({
                    "r_explanation": row[r_expl_col],
                    "c_explanation": row[c_expl_col],
                    "r_rating": row[r_rating_col],
                    "c_rating": row[c_rating_col],
                    "claim": row[claim_col],
                })

                # Completeness & Hallucination
                t2_output = task2_chain.invoke({
                    "r_details": row[r_details_col],
                    "c_details": row[c_details_col],
                    "r_alerts": row[r_alerts_col],
                    "c_alerts": row[c_alerts_col],
                    "claim": row[claim_col],
                    "year":row[year_col],
                    "additional_context": additional_context,
                })

                # Check confirmed columns
                intent = row[r_confirmed_col] == row[c_confirmed_col]

                return (
                    t1_output.reasoning,
                    t2_output.completeness,
                    t2_output.hallucination,
                    intent,
                )
            except (OutputParserException, ValueError, KeyError) as e:
                # Soft failure: log and continue
                return (None,None,None,None)

        # Apply the evaluation
        temp_results = out_df.apply(_evaluate_row, axis=1)

        # Unpack results into new columns in our results_df
        out_df[f"reason_{suffix}"] = temp_results.apply(lambda t: t[0])
        out_df[f"complete_{suffix}"] = temp_results.apply(lambda t: t[1])
        out_df[f"halluci_{suffix}"] = temp_results.apply(lambda t: t[2])
        out_df[f"intent_{suffix}"] = temp_results.apply(lambda t: t[3])

    return out_df

Finally we will run the evaluation calculating all the scores

In [239]:
# these models are being evaluated
# models_to_test = ["llama8", "llama70", "gptoss20", "gptoss120","qwen3_32"]
# models_to_test = ["llama1","llama3","mistral7","qwen3_1","qwen3_4"]
models_to_test = ["llama1"]

# Build the langchain chain
def build_task1_chain(llm):
    structured_llm = llm.with_structured_output(Task1Result, method="json_mode")
    return t1_prompt | structured_llm

# Build the langchain chain
def build_task2_chain(llm):
    structured_llm = llm.with_structured_output(Task2Result, method="json_mode")
    return t2_prompt | structured_llm

# Build chains for GPT
task1_chain = build_task1_chain(llmGPT5)
task2_chain = build_task2_chain(llmGPT5)

# Load your ground dataset
original_df = pd.read_csv("eval_models_output.csv", encoding="utf-8")

scores_df = run_multi_model_evaluation(original_df, task1_chain, task2_chain, models_to_test)

scores_df.to_csv("scores_output-llama1.csv", index=False)

Evaluating model: llama1...


In [271]:
df3.info()
df3.to_csv("scores_output-ollama.csv", index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 59 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   url                   149 non-null    object 
 1   reason_llama1         149 non-null    bool   
 2   complete_llama1       149 non-null    float64
 3   halluci_llama1        149 non-null    bool   
 4   intent_llama1         149 non-null    bool   
 5   claim                 149 non-null    object 
 6   rating                148 non-null    object 
 7   translated            25 non-null     object 
 8   year                  148 non-null    float64
 9   checkable             149 non-null    object 
 10  explanation           149 non-null    object 
 11  details_text          149 non-null    object 
 12  alerts                149 non-null    object 
 13  question              149 non-null    object 
 14  user_answer           149 non-null    object 
 15  confirmed             1

Next to the LLM as judge approach, Bertscore was also used to measure the simularity between the reference an candidate texts

First for the Groq models

In [None]:
from bert_score import BERTScorer
scorer = BERTScorer(model_type="distilroberta-base",lang="en",rescale_with_baseline=True,device="cuda")

# Load your ground dataset
metrics_df  = pd.read_csv("scores_output-groq.csv", encoding="utf-8")

# Models to evaluate
models_to_test = ["llama8", "llama70", "gptoss20", "gptoss120","qwen3_32"]

# Reference texts
r_explanation = metrics_df["explanation"].fillna("").astype(str).tolist()
r_details = metrics_df["details_text"].fillna("").astype(str).tolist()

for model in models_to_test:
    print(f"Computing BERTScore for {model}...")

    # Candidate texts
    c_explanation = metrics_df[f"explanation_{model}"].fillna("").astype(str).tolist()
    c_details = metrics_df[f"details_{model}"].fillna("").astype(str).tolist()

    # Explanation scores
    P_e, R_e, F1_e = scorer.score(c_explanation, r_explanation)

    # Details scores
    P_d, R_d, F1_d = scorer.score(c_details, r_details)

    # Store results
    metrics_df[f"P_explanation_{model}"] = P_e.cpu().numpy()
    metrics_df[f"R_explanation_{model}"] = R_e.cpu().numpy()
    metrics_df[f"F1_explanation_{model}"] = F1_e.cpu().numpy()

    metrics_df[f"P_details_{model}"] = P_d.cpu().numpy()
    metrics_df[f"R_details_{model}"] = R_d.cpu().numpy()
    metrics_df[f"F1_details_{model}"] = F1_d.cpu().numpy()

metrics_df.to_csv("metrics_output-groq.csv", index=False)

Computing BERTScore for llama8...
Computing BERTScore for llama70...
Computing BERTScore for gptoss20...
Computing BERTScore for gptoss120...
Computing BERTScore for qwen3_32...


In [237]:
rows = []

for model in models_to_test:
    rows.append({
        "model": model,
        "reasoning": metrics_df[f"reason_{model}"].mean(),
        "hallucination": metrics_df[f"halluci_{model}"].mean(),
        "completeness": metrics_df[f"complete_{model}"].mean(),
        "intent": metrics_df[f"intent_{model}"].mean(),

        "P_e": metrics_df[f"P_explanation_{model}"].mean(),
        "R_e": metrics_df[f"R_explanation_{model}"].mean(),
        "F1_e": metrics_df[f"F1_explanation_{model}"].mean(),

        "P_d": metrics_df[f"P_details_{model}"].mean(),
        "R_d": metrics_df[f"R_details_{model}"].mean(),
        "F1_d": metrics_df[f"F1_details_{model}"].mean(),
    })

summary_df = pd.DataFrame(rows).set_index("model")
summary_df

Unnamed: 0_level_0,reasoning,hallucination,completeness,intent,P_e,R_e,F1_e,P_d,R_d,F1_d
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
llama8,0.885906,0.308725,0.667919,0.671141,0.562578,0.571306,0.566904,0.790965,0.651549,0.720314
llama70,0.872483,0.671141,0.789396,1.0,0.588661,0.608305,0.598483,0.737947,0.699739,0.718642
gptoss20,0.832215,0.503356,0.647315,1.0,0.678425,0.640113,0.659091,0.785242,0.673822,0.728718
gptoss120,0.791946,0.624161,0.895973,1.0,0.707879,0.674877,0.691228,0.815085,0.825006,0.819807
qwen3_32,0.838926,0.704698,0.77047,1.0,0.522467,0.608977,0.565265,0.732736,0.71806,0.725126


Next for the local Ollama models

In [None]:
from bert_score import BERTScorer
scorer = BERTScorer(model_type="distilroberta-base",lang="en",rescale_with_baseline=True,device="cuda")

# Load your ground dataset
metrics_df  = pd.read_csv("scores_output-ollama.csv", encoding="utf-8")

# Models to evaluate
models_to_test = ["llama1","llama3","mistral7","qwen3_1","qwen3_4"]

# Reference texts
r_explanation = metrics_df["explanation"].fillna("").astype(str).tolist()
r_details = metrics_df["details_text"].fillna("").astype(str).tolist()

for model in models_to_test:
    print(f"Computing BERTScore for {model}...")

    # Candidate texts
    c_explanation = metrics_df[f"explanation_{model}"].fillna("").astype(str).tolist()
    c_details = metrics_df[f"details_{model}"].fillna("").astype(str).tolist()

    # Explanation scores
    P_e, R_e, F1_e = scorer.score(c_explanation, r_explanation)

    # Details scores
    P_d, R_d, F1_d = scorer.score(c_details, r_details)

    # Store results
    metrics_df[f"P_explanation_{model}"] = P_e.cpu().numpy()
    metrics_df[f"R_explanation_{model}"] = R_e.cpu().numpy()
    metrics_df[f"F1_explanation_{model}"] = F1_e.cpu().numpy()

    metrics_df[f"P_details_{model}"] = P_d.cpu().numpy()
    metrics_df[f"R_details_{model}"] = R_d.cpu().numpy()
    metrics_df[f"F1_details_{model}"] = F1_d.cpu().numpy()

metrics_df.to_csv("metrics_output-ollama.csv", index=False)

Computing BERTScore for llama1...




Computing BERTScore for llama3...
Computing BERTScore for mistral7...
Computing BERTScore for qwen3_1...




Computing BERTScore for qwen3_4...


In [281]:
rows = []

for model in models_to_test:
    rows.append({
        "model": model,
        "reasoning": metrics_df[f"reason_{model}"].mean(),
        "hallucination": metrics_df[f"halluci_{model}"].mean(),
        "completeness": metrics_df[f"complete_{model}"].mean(),
        "intent": metrics_df[f"intent_{model}"].mean(),

        "P_e": metrics_df[f"P_explanation_{model}"].mean(),
        "R_e": metrics_df[f"R_explanation_{model}"].mean(),
        "F1_e": metrics_df[f"F1_explanation_{model}"].mean(),

        "P_d": metrics_df[f"P_details_{model}"].mean(),
        "R_d": metrics_df[f"R_details_{model}"].mean(),
        "F1_d": metrics_df[f"F1_details_{model}"].mean(),
    })

summary_df = pd.DataFrame(rows).set_index("model")
summary_df

Unnamed: 0_level_0,reasoning,hallucination,completeness,intent,P_e,R_e,F1_e,P_d,R_d,F1_d
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
llama1,0.167785,0.134228,0.095973,0.449664,-0.018388,0.100064,0.040633,0.278482,0.189079,0.233333
llama3,0.771812,0.409396,0.263087,0.47651,0.366785,0.511736,0.438353,0.742092,0.593688,0.666777
mistral7,0.872483,0.09396,0.432886,0.838926,0.490872,0.548634,0.519646,0.728498,0.632122,0.679709
qwen3_1,0.718121,0.328859,0.455034,0.946309,0.185339,0.343355,0.263365,0.609139,0.354528,0.479303
qwen3_4,0.691275,0.657718,0.705168,1.0,0.389928,0.585888,0.486039,0.789125,0.680113,0.733916
