<a href="https://colab.research.google.com/github/deepakgarg08/llm-diary/blob/main/llm_chronicles_6_7_hallucination_detection_faithfulness_gpt4_llamaindex_ragas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1 - LLM Chronicles: Hallucination Detection with Answer Faithfulness

This notebook is part of the **LLM Chronicles** series ([https://llm-chronicles.com/](https://llm-chronicles.com/)) and accompanies the episode on **LLM Hallucinations**, available here: [#6.6: Hallucination Detection and Reduction for RAG systems (RAGAS, Lynx)](https://www.youtube.com/watch?v=xsDNArrmyuo).

In this notebook, we'll focus on detecting the faithfulness or groundedness of LLM responses to ensure they are rooted in the source material provided. This means determining if an LLM's answer remains within the scope of the given context without introducing or contradicting information.

**Key topics**:

- Overview of faithfulness detection methods
- Testing LLM responses against various faithfulness frameworks
- Applying faithfulness evaluation to detect ungrounded responses

This notebook is designed as a practical companion to the YouTube episode, with **hands-on evaluation** of faithfulness controls.


# 2 - Setup

## 2.1 - Imports and Environment Variables

We begin by importing the necessary libraries and setting the OpenAI key environment variable.


In [None]:
%pip install --upgrade openai langchain langchain_openai llama-index ragas

In [None]:
import os
from openai import OpenAI
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

## 2.2 Samples
The following cells include two sample outputs from a Retrieval-Augmented Generation (RAG) pipeline, each containing:

- **User Question**
- **Retrieved Context**
- **LLM Answer**

One sample contains a hallucination, while the other does not. These will be used in testing faithfulness methods.

![picture](https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/6.7%20-%20Lab%20-%20Hallucinations/rag-pipeline.png)


In [None]:
sample_no_hallucination = {
  "question": "When was the first super bowl?",
  "llm_answer": "The first superbowl was held on Jan 15, 1967.",
  "context": "The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
}

sample_with_hallucination = {
  "question": "When was the first super bowl?",
  "llm_answer": "The first superbowl was held on Jan 15, 1989.",
  "context": "The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
}


# 3 - Faithfulness Methods

The next section covers various methods to evaluate faithfulness in LLM responses.


## 3.1 LLM-as-judge (gpt-4o-mini)
This section demonstrates how to use an LLM as a judge to detect faithfulness, using the Lynx prompt ([PatronusAI/Llama-3-Patronus-Lynx-8B-Instruct-v1.1](https://huggingface.co/PatronusAI/Llama-3-Patronus-Lynx-8B-Instruct-v1.1)) with the `gpt-4o-mini` model.

![picture](https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/6.7%20-%20Lab%20-%20Hallucinations/llm-judge.png)

This setup allows you to experiment with different LLMs to see how accurately they assess faithfulness. We do not load the Lynx model directly due to high resource requirements (e.g., A100 GPU), but instructions for loading it are available on its Hugging Face page.

The next cell contains the full prompt format that guides the LLM in assessing faithfulness.


In [None]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

PROMPT = """
Given the following QUESTION, DOCUMENT and ANSWER you must analyze the provided answer and determine whether it is faithful to the contents of the DOCUMENT. The ANSWER must not offer new information beyond the context provided in the DOCUMENT. The ANSWER also must not contradict information provided in the DOCUMENT. Output your final verdict by strictly following this format: "PASS" if the answer is faithful to the DOCUMENT and "FAIL" if the answer is not faithful to the DOCUMENT. Show your reasoning.

--
QUESTION (THIS DOES NOT COUNT AS BACKGROUND INFORMATION):
{question}

--
DOCUMENT:
{context}

--
ANSWER:
{answer}

--

Your output should be in JSON FORMAT with the keys "REASONING" and "SCORE":
{{"REASONING": <your reasoning as bullet points>, "SCORE": <your final score>}}
"""

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

hallucination_check_prompt = PromptTemplate(
    template=PROMPT,
    input_variables=["question", "context", "answer"]
)

hallucination_check_chain = (
    hallucination_check_prompt
    | llm
    | StrOutputParser()
)

In [None]:
test_sample = sample_no_hallucination

hallucination_check_chain.invoke({
    "question" : test_sample['question'],
    "context" : test_sample['context'],
    "answer" : test_sample['llm_answer'],
})

'{"REASONING": ["The DOCUMENT states that the first AFL–NFL World Championship Game was played on January 15, 1967, which is commonly known as the first Super Bowl.", "The ANSWER correctly identifies the date of the first Super Bowl as January 15, 1967.", "The ANSWER does not introduce any new information or contradict the information provided in the DOCUMENT."], "SCORE": "PASS"}'

In [None]:
from langchain_core.output_parsers import JsonOutputParser

hallucination_check_chain = (
    hallucination_check_prompt
    | llm
    | JsonOutputParser()
)

result = hallucination_check_chain.invoke({
    "question" : test_sample['question'],
    "context" : test_sample['context'],
    "answer" : test_sample['llm_answer'],
})

result.get('SCORE')

'PASS'

## 3.3.1 - Llama-3-Patronus-Lynx-8B-Instruct-v1.1

**SKIP UNLESS YOU ATUALLY WANT TO RUN THIS!!!**

Lynx: An Open Source Hallucination Evaluation Model
: https://arxiv.org/abs/2407.08488

https://huggingface.co/PatronusAI/Llama-3-Patronus-Lynx-8B-Instruct-v1.1

In [None]:
import torch
import transformers

model_name = 'PatronusAI/Llama-3-Patronus-Lynx-8B-Instruct-v1.1'
# Uncomment to run
#pipe = transformers.pipeline(
#          "text-generation",
#          model=model_name,
#          max_new_tokens=600,
#          device="cuda",
#          return_full_text=False
#        )

messages = [
    {"role": "user",
     "content": hallucination_check_prompt.format(
          question = test_sample['question'],
          context = test_sample['context'],
          answer = test_sample['llm_answer']
     )},
]

#result = pipe(messages)
#print(result[0]['generated_text'])

## 3.2 Llama-Index Faithfulness

[Llama-Index](https://www.llamaindex.ai/) includes a faithfulness evaluation specifically for RAG pipelines, described here: [faithfulness evaluation](https://docs.llamaindex.ai/en/stable/examples/evaluation/faithfulness_eval/). This method uses only the context and the LLM answer (ignores the user question) and works by deploying a simple LLM judge/verifier for faithfulness.

Implementation: https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/evaluation/faithfulness.py


In [None]:
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import FaithfulnessEvaluator
import nest_asyncio

nest_asyncio.apply()

# create llm
llm = OpenAI(model="gpt-4o-mini", temperature=0.0)

# define evaluator
llamaindex_evaluator = FaithfulnessEvaluator(llm=llm)

In [None]:
test_sample = sample_no_hallucination

eval_result = llamaindex_evaluator.evaluate(
        response=test_sample['llm_answer'],
        contexts=[test_sample['context']]
)

eval_result

EvaluationResult(query=None, contexts=['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles.'], response='The first superbowl was held on Jan 15, 1967.', passing=True, feedback='YES', score=1.0, pairwise_source=None, invalid_result=False, invalid_reason=None)

## 3.3 RAGAS Faithfulness

In this approach, [RAGAS](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/faithfulness/) evaluates answer faithfulness in two steps:
1. The LLM extracts individual claims from the answer.
2. Each claim is then verified against the context to see if it’s supported by the context.

The faithfulness score is calculated as the ratio of supported claims to total claims, indicating the extent to which the answer is grounded in the context.

Implementation: https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_faithfulness.py

![picture](https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/6.7%20-%20Lab%20-%20Hallucinations/ragas-faithfulness.png)


In [None]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [None]:
# create a Single Turn Sample
from ragas import SingleTurnSample
from ragas.metrics import Faithfulness
faithfulness_metric = Faithfulness(llm=evaluator_llm)

In [None]:
test_sample = sample_no_hallucination

sample = SingleTurnSample(
  user_input= test_sample['question'],
  response= test_sample['llm_answer'],
  retrieved_contexts=[test_sample['context']]
)

score = await faithfulness_metric.single_turn_ascore(sample=sample)
print(score)

1.0


## 3.4 RAGAS with Vectara’s HHEM-2.1-Open
RAGAS can also integrate with Vectara's **HHEM-2.1-Open**, a T5 classifier model specifically trained for hallucination detection in LLM-generated text. This model is used in the second step of faithfulness evaluation to verify claims against the context, offering an efficient, open-source solution for real-time applications.


In [None]:
from ragas.metrics import FaithfulnesswithHHEM
faithfulness_hhem = FaithfulnesswithHHEM(llm=evaluator_llm)

score = await faithfulness_hhem.single_turn_ascore(sample=sample)
print(score)

You are using a model of type HHEMv2Config to instantiate a model of type HHEMv2. This is not supported for all configurations of models and can yield errors.


tensor(0.)


## 3.5 Azure Groundedness Checks

The [Azure Groundedness API](https://learn.microsoft.com/en-us/azure/ai-services/content-safety/quickstart-groundedness?tabs=python) evaluates whether LLM responses are grounded in the user-provided source materials. Although it isn’t explicitly documented, the backend is likely implemented with an LLM judge, similar to Lynx, and returns both an ungroundedness score and reasoning.


In [None]:
import http.client
import json

def az_check_groundedness(query, answer, context):
    key = userdata.get('AZURE_CONTENT_SAFETY_KEY')
    endpoint = userdata.get('AZURE_CONTENT_SAFETY_ENDPOINT')

    conn = http.client.HTTPSConnection(endpoint)

    payload = json.dumps({
        "domain": "Generic",
        "task": "QnA",
        "qna": {
            "query": query
        },
        "text": answer,
        "groundingSources": [context],
        "reasoning": False
    })

    headers = {
        'Ocp-Apim-Subscription-Key': key,
        'Content-Type': 'application/json'
    }

    conn.request("POST", "/contentsafety/text:detectGroundedness?api-version=2024-09-15-preview", payload, headers)
    res = conn.getresponse()
    data = res.read()

    response_json = json.loads(data.decode("utf-8"))

    # Extracting only the required fields
    result = {
        "ungroundedDetected": response_json.get("ungroundedDetected"),
        "ungroundedPercentage": response_json.get("ungroundedPercentage")
    }

    return result

In [None]:
test_sample = sample_no_hallucination

az_check_groundedness(
    query=test_sample['question'],
    answer=test_sample['llm_answer'],
    context=test_sample['context']
)

{'ungroundedDetected': False, 'ungroundedPercentage': 0}

# 4 - Evaluation with HaluBench

[HaluBench](https://huggingface.co/datasets/PatronusAI/HaluBench) is a benchmark for evaluating hallucinations, containing 15,000 context-question-answer triplets annotated for hallucination presence. It includes real-world contexts from domains like finance and medicine. Data is sourced from existing QA datasets such as FinanceBench, PubmedQA, CovidQA, HaluEval, DROP, and RAGTruth.


In [None]:
import pandas as pd

df = pd.read_parquet("hf://datasets/PatronusAI/HaluBench/data/test-00000-of-00001.parquet")

In [None]:
len(df)

14900

In [None]:
pd.set_option('display.max_colwidth', None)

df.iloc[1000]

Unnamed: 0,1000
id,25985014
passage,"We investigated the role of surgical ablation targeting the autonomous nervous system during a Cox-Maze IV procedure in the maintenance of sinus rhythm at long-term follow-up.\nThe patient population consisted of 519 subjects with persistent or long-standing persistent atrial fibrillation (AF) undergoing radiofrequency Maze IV during open heart surgery between January 2006 and July 2013 at three institutions without (Group 1) or with (Group 2) ganglionated plexi (GP) ablation. Recurrence of atrial fibrillation off-antiarrhythmic drugs was the primary outcome. Predictors of AF recurrence were evaluated by means of competing risk regression. Median follow-up was 36.7 months.\nThe percentage of patients in normal sinus rhythm (NSR) off-antiarrhythmic drugs did not differ between groups (Group 1-75.5%, Group 2-67.8%, p = 0.08). Duration of AF ≥ 38 months (p = 0.01), left atrial diameter ≥ 54 mm (0.001), left atrial area ≥ 33 cm(2) (p = 0.005), absence of connecting lesions (p= 0.04), and absence of right atrial ablation (p<0.001) were independently associated with high incidence of AF recurrence. In contrast the absence of GP ablation was not a significant factor (p = 0.12)."
question,Is ganglionated plexi ablation during Maze IV procedure beneficial for postoperative long-term stable sinus rhythm?
answer,No. GP ablation did not prove to be beneficial for postoperative stable NSR. A complete left atrial lesion set and biatrial ablation are advisable for improving rhythm outcomes. Randomized controlled trials are necessary to confirm our findings.
label,PASS
source_ds,pubmedQA


## 4.1 - HaluBenchMini

In this notebook, we create a **Halu-Bench-Mini** subset by sampling 100 examples from HaluBench. We'll use this subset to test various faithfulness checks, including:

- GPT-4o-mini judge
- Llama-Index faithfulness check
- RAGAS faithfulness
- Azure Groundedness checks

In [None]:
# Filter the dataset by label
pass_items = df[df['label'] == 'PASS']
fail_items = df[df['label'] == 'FAIL']

# Sample 25 items from each group
pass_sample = pass_items.sample(n=50, random_state=42)
fail_sample = fail_items.sample(n=50, random_state=42)

# Small dataset
halu_bench_mini = pd.concat([pass_sample, fail_sample]).reset_index(drop=True)
len(halu_bench_mini)

100

## 4.2 - Wrapper Evaluation Function

In [None]:
import json
import asyncio

def check_hallucination(evaluator_name, question, context, llm_answer):
    """
    Checks if the answer is faithful to the provided context using the specified evaluator.

    Parameters:
    - evaluator_name: str, the name of the evaluator to use
    - question: str, the question or prompt provided
    - context: str, the context passage used to verify faithfulness
    - llm_answer: str, the answer generated by the LLM

    Returns:
    - bool: True if the answer is faithful, False if it is a hallucination
    """

    def run_async_func(async_func, *args, **kwargs):
        """Helper function to run an async function synchronously."""
        return asyncio.run(async_func(*args, **kwargs))

    if evaluator_name == 'llm-judge-gpt4o-mini':
        # Invoke hallucination_check_chain and parse JSON response
        eval_result = hallucination_check_chain.invoke({
            "question": question,
            "context": context,
            "answer": llm_answer
        })
        return eval_result.get("SCORE") == "PASS"

    elif evaluator_name == 'llama-index-faithfulness-gpt4o-mini':
      eval_result = llamaindex_evaluator.evaluate(
        response=llm_answer,
        contexts=[context]
      )
      return eval_result.passing

    elif evaluator_name == 'ragas-faithfulness-gpt4o-mini':
        sample = SingleTurnSample(
            user_input=question,
            response=llm_answer + ".",
            retrieved_contexts=[context]
        )
        score = run_async_func(faithfulness_metric.single_turn_ascore, sample=sample)
        return score >= 0.5

    elif evaluator_name == 'azure-groundedness':
        eval_result = az_check_groundedness(
            query=question,
            answer=llm_answer,
            context=context
        )
        return not eval_result['ungroundedDetected']

    else:
        raise ValueError("Unknown evaluator")


In [None]:
# List of evaluators to test
evaluators = [
    'llm-judge-gpt4o-mini',
    'llama-index-faithfulness-gpt4o-mini',
    'ragas-faithfulness-gpt4o-mini',
    'azure-groundedness'
]

# Loop through each evaluator, check hallucination, and print the result
for evaluator_name in evaluators:
    try:
        is_faithful = check_hallucination(
            evaluator_name,
            sample_no_hallucination['question'],
            sample_no_hallucination['context'],
            sample_no_hallucination['llm_answer']
        )
        print(f"Evaluator: {evaluator_name} - Faithful: {is_faithful}")
    except Exception as e:
        print(f"Evaluator: {evaluator_name} - Error: {e}")

Evaluator: llm-judge-gpt4o-mini - Faithful: True
Evaluator: llama-index-faithfulness-gpt4o-mini - Faithful: True
Evaluator: ragas-faithfulness-gpt4o-mini - Faithful: True
Evaluator: azure-groundedness - Faithful: True


## 4.3 - Evaluation loop

In [None]:
from tqdm import tqdm

def evaluate(dataset, evaluator_name):
    """
    Evaluates a dataset using a specified evaluator and returns a DataFrame of results.

    Parameters:
    - dataset: DataFrame, the dataset to evaluate
    - evaluator_name: str, the name of the evaluator to use

    Returns:
    - DataFrame: A DataFrame with the evaluation results
    """
    # Initialize a list to store results
    results = []

    # Loop through each row in the dataset with tqdm progress bar
    for i in tqdm(range(len(dataset)), desc=f"Evaluating with {evaluator_name}"):
        row = dataset.iloc[i]

        # Use check_hallucination to determine faithfulness
        is_faithful = check_hallucination(
            evaluator_name,
            row['question'],
            row['passage'],
            row['answer']
        )

        # Check if faithfulness matches the dataset's `label`
        is_correct = (is_faithful and row['label'] == 'PASS') or \
                     (not is_faithful and row['label'] == 'FAIL')

        # Append result to the list
        results.append({
            'evaluator': evaluator_name,
            'faithful': is_faithful,
            'label': row['label'],
            'is_correct': is_correct
        })

    # Convert results to a DataFrame for easy viewing
    results_df = pd.DataFrame(results)

    return results_df

In [None]:
# Dictionary to store results DataFrames for each evaluator
results_dict = {}

# Loop through each evaluator, evaluate the dataset, and store the results
for evaluator_name in evaluators:
    print(f"Evaluating with {evaluator_name}...")
    results_df = evaluate(halu_bench_mini, evaluator_name)
    results_dict[evaluator_name] = results_df  # Store the result in the dictionary

    # Display the first few rows of the result for this evaluator
    print(f"Results for {evaluator_name}:")
    print(results_df.head())
    print("\n" + "="*50 + "\n")

# Optionally, combine all results into a single DataFrame for easier comparison
all_results_df = pd.concat(
    [df.assign(evaluator=name) for name, df in results_dict.items()],
    ignore_index=True
)

# Display the combined results
print("Combined results across all evaluators:")
print(all_results_df.head())

Evaluating with llm-judge-gpt4o-mini...


Evaluating with llm-judge-gpt4o-mini: 100%|██████████| 100/100 [02:53<00:00,  1.74s/it]


Results for llm-judge-gpt4o-mini:
              evaluator  faithful label  is_correct
0  llm-judge-gpt4o-mini     False  PASS       False
1  llm-judge-gpt4o-mini      True  PASS        True
2  llm-judge-gpt4o-mini      True  PASS        True
3  llm-judge-gpt4o-mini      True  PASS        True
4  llm-judge-gpt4o-mini      True  PASS        True


Evaluating with llama-index-faithfulness-gpt4o-mini...


Evaluating with llama-index-faithfulness-gpt4o-mini: 100%|██████████| 100/100 [00:36<00:00,  2.76it/s]


Results for llama-index-faithfulness-gpt4o-mini:
                             evaluator  faithful label  is_correct
0  llama-index-faithfulness-gpt4o-mini     False  PASS       False
1  llama-index-faithfulness-gpt4o-mini      True  PASS        True
2  llama-index-faithfulness-gpt4o-mini      True  PASS        True
3  llama-index-faithfulness-gpt4o-mini      True  PASS        True
4  llama-index-faithfulness-gpt4o-mini      True  PASS        True


Evaluating with ragas-faithfulness-gpt4o-mini...


Evaluating with ragas-faithfulness-gpt4o-mini: 100%|██████████| 100/100 [06:38<00:00,  3.98s/it]


Results for ragas-faithfulness-gpt4o-mini:
                       evaluator  faithful label  is_correct
0  ragas-faithfulness-gpt4o-mini     False  PASS       False
1  ragas-faithfulness-gpt4o-mini      True  PASS        True
2  ragas-faithfulness-gpt4o-mini     False  PASS       False
3  ragas-faithfulness-gpt4o-mini     False  PASS       False
4  ragas-faithfulness-gpt4o-mini      True  PASS        True


Evaluating with azure-groundedness...


Evaluating with azure-groundedness: 100%|██████████| 100/100 [03:25<00:00,  2.05s/it]

Results for azure-groundedness:
            evaluator  faithful label  is_correct
0  azure-groundedness     False  PASS       False
1  azure-groundedness      True  PASS        True
2  azure-groundedness      True  PASS        True
3  azure-groundedness     False  PASS       False
4  azure-groundedness      True  PASS        True


Combined results across all evaluators:
              evaluator  faithful label  is_correct
0  llm-judge-gpt4o-mini     False  PASS       False
1  llm-judge-gpt4o-mini      True  PASS        True
2  llm-judge-gpt4o-mini      True  PASS        True
3  llm-judge-gpt4o-mini      True  PASS        True
4  llm-judge-gpt4o-mini      True  PASS        True





## 4.4 - Results

In [None]:
# Initialize a dictionary to store the accuracy (as percentage) for each evaluator
accuracy_results = {}

# Calculate accuracy for each evaluator based on the `is_correct` column
for evaluator_name, results_df in results_dict.items():
    # Calculate accuracy as the mean of `is_correct` column, convert to percentage, and round to 2 decimal points
    accuracy = round(results_df['is_correct'].mean() * 100, 2)
    accuracy_results[evaluator_name] = accuracy

# Convert accuracy results to a DataFrame for easy viewing
accuracy_df = pd.DataFrame(list(accuracy_results.items()), columns=['Evaluator', 'Accuracy (%)'])

# Display the accuracy table
print("Accuracy of each evaluator on the halu_bench_mini dataset:")
print(accuracy_df)


Accuracy of each evaluator on the halu_bench_mini dataset:
                             Evaluator  Accuracy (%)
0                 llm-judge-gpt4o-mini          84.0
1  llama-index-faithfulness-gpt4o-mini          76.0
2        ragas-faithfulness-gpt4o-mini          63.0
3                   azure-groundedness          70.0
