# Hallucination Detection with MLflow Faithfulness Metrics

This notebook demonstrates how to use MLflow faithfulness metrics to detect hallucinations in LLM outputs. We'll walk through several test examples showing how faithful vs. hallucinated answers are scored differently.

## What is Hallucination Detection?

Hallucination detection identifies when an LLM generates content that is:
- **Not supported by the provided context**
- **Factually incorrect** relative to the source
- **Fabricated or invented** information

### Faithfulness Score Interpretation:
| Score Range | Interpretation | Risk Level |
|-------------|----------------|------------|
| >= 0.80 | Highly Faithful | Low - Safe |
| 0.60 - 0.80 | Mostly Faithful | Low - Monitor |
| 0.40 - 0.60 | Partially Faithful | Medium - Review |
| 0.20 - 0.40 | Likely Hallucination | High - Flag |
| < 0.20 | Severe Hallucination | Critical - Block |


## 1. Setup and Installation


In [None]:
%pip install -q mlflow sentence-transformers pandas numpy scikit-learn


In [None]:
%restart_python


In [None]:
import numpy as np
import pandas as pd
import re
from typing import List, Dict
import warnings
warnings.filterwarnings('ignore')

from sentence_transformers import SentenceTransformer, CrossEncoder
from sklearn.metrics.pairwise import cosine_similarity

print("‚úÖ Libraries loaded successfully!")


## 2. Faithfulness Metrics for Hallucination Detection

We'll use three approaches to detect hallucinations:
1. **Semantic Similarity**: Embedding-based comparison between answer and context
2. **NLI (Natural Language Inference)**: Checks if context entails the answer
3. **Token Overlap**: Simple baseline checking word overlap


In [None]:
class HallucinationDetector:
    """
    Detects hallucinations by measuring faithfulness of generated answers to source context.
    Uses semantic similarity, NLI entailment, and token overlap approaches.
    """

    def __init__(self, 
                 embedding_model: str = "all-MiniLM-L6-v2",
                 nli_model: str = "cross-encoder/nli-deberta-v3-small"):
        """Initialize with embedding and NLI models."""
        print(f"Loading embedding model: {embedding_model}...")
        self.embedding_model = SentenceTransformer(embedding_model)
        
        print(f"Loading NLI model: {nli_model}...")
        self.nli_model = CrossEncoder(nli_model)
        
        print("‚úÖ Hallucination detector ready!")

    def compute_semantic_faithfulness(self, answer: str, context: str) -> float:
        """Compute faithfulness using semantic similarity."""
        answer_embedding = self.embedding_model.encode([answer])[0]
        context_embedding = self.embedding_model.encode([context])[0]
        similarity = cosine_similarity([answer_embedding], [context_embedding])[0][0]
        return max(0, min(1, (similarity + 1) / 2))

    def compute_nli_faithfulness(self, answer: str, context: str) -> float:
        """Compute faithfulness using NLI entailment."""
        scores = self.nli_model.predict([(context, answer)])[0]
        if isinstance(scores, (int, float)):
            return 1 / (1 + np.exp(-scores))  # sigmoid
        return float(scores[2])  # entailment score

    def compute_token_overlap(self, answer: str, context: str) -> float:
        """Compute simple token overlap faithfulness."""
        stop_words = {'the', 'a', 'an', 'is', 'are', 'was', 'were', 'be', 'been', 
                      'to', 'of', 'in', 'for', 'on', 'with', 'at', 'by', 'from',
                      'and', 'but', 'or', 'it', 'its', 'this', 'that'}
        
        answer_tokens = set(answer.lower().split()) - stop_words
        context_tokens = set(context.lower().split()) - stop_words
        
        if not answer_tokens:
            return 1.0
        
        overlap = answer_tokens.intersection(context_tokens)
        return len(overlap) / len(answer_tokens)

    def detect_hallucination(self, answer: str, context: str) -> Dict:
        """
        Detect if an answer is a hallucination.
        
        Returns:
            Dictionary with scores, combined score, and hallucination verdict
        """
        semantic = self.compute_semantic_faithfulness(answer, context)
        nli = self.compute_nli_faithfulness(answer, context)
        overlap = self.compute_token_overlap(answer, context)
        
        # Combined score: NLI weighted most (50%), then semantic (35%), then overlap (15%)
        combined = 0.5 * nli + 0.35 * semantic + 0.15 * overlap
        
        # Determine hallucination category
        if combined >= 0.8:
            category = "‚úÖ Highly Faithful"
            is_hallucination = False
            risk = "Low"
        elif combined >= 0.6:
            category = "‚úì Mostly Faithful"
            is_hallucination = False
            risk = "Low"
        elif combined >= 0.4:
            category = "‚ö†Ô∏è Partially Faithful"
            is_hallucination = True
            risk = "Medium"
        elif combined >= 0.2:
            category = "‚ùå Likely Hallucination"
            is_hallucination = True
            risk = "High"
        else:
            category = "üö´ Severe Hallucination"
            is_hallucination = True
            risk = "Critical"
        
        return {
            "semantic_score": semantic,
            "nli_score": nli,
            "overlap_score": overlap,
            "combined_score": combined,
            "category": category,
            "is_hallucination": is_hallucination,
            "risk_level": risk
        }

# Initialize the detector
detector = HallucinationDetector()


## 3. Test Examples: Detecting Hallucinations

Let's run through several examples showing how the faithfulness metrics detect hallucinations vs. faithful answers.


In [None]:
# Define test examples with context, faithful answer, and hallucinated answer
test_examples = [
    {
        "name": "Example 1: Machine Learning Definition",
        "context": "Machine learning is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed. It uses algorithms to identify patterns in data and make predictions.",
        "faithful_answer": "Machine learning is a subset of AI that allows computers to learn from data using algorithms to find patterns and make predictions.",
        "hallucinated_answer": "Machine learning was invented by Alan Turing in 1950 and requires quantum computers to function properly."
    },
    {
        "name": "Example 2: Climate Change Causes",
        "context": "Climate change is primarily driven by human activities, especially the burning of fossil fuels like coal, oil, and natural gas. These activities release greenhouse gases, particularly carbon dioxide, into the atmosphere.",
        "faithful_answer": "Climate change is mainly caused by burning fossil fuels which release greenhouse gases like CO2 into the atmosphere.",
        "hallucinated_answer": "Climate change is primarily caused by changes in Earth's orbit and volcanic activity, with human impact being minimal."
    },
    {
        "name": "Example 3: Capital of France",
        "context": "France is a country in Western Europe. Its capital city is Paris, which is also its largest city with a population of about 2.2 million. Paris is known for the Eiffel Tower.",
        "faithful_answer": "The capital of France is Paris, with about 2.2 million people, known for the Eiffel Tower.",
        "hallucinated_answer": "The capital of France is Lyon, which became the capital in 2010 after a national referendum."
    },
    {
        "name": "Example 4: DNA Structure",
        "context": "DNA (deoxyribonucleic acid) is a molecule that carries genetic instructions. It has a double helix structure discovered by Watson and Crick. DNA contains four bases: adenine, thymine, guanine, and cytosine.",
        "faithful_answer": "DNA is a molecule with a double helix structure containing four bases: adenine, thymine, guanine, and cytosine.",
        "hallucinated_answer": "DNA is a type of protein discovered in 1990. It has a single helix structure and contains only two bases."
    },
    {
        "name": "Example 5: Photosynthesis",
        "context": "Photosynthesis is the process by which plants convert light energy into chemical energy stored in glucose. Plants absorb carbon dioxide and water, using chlorophyll to capture light and produce glucose and oxygen.",
        "faithful_answer": "Plants use photosynthesis to convert light into glucose by absorbing CO2 and water with chlorophyll, producing oxygen.",
        "hallucinated_answer": "Photosynthesis is how plants breathe oxygen and release carbon dioxide at night through their roots."
    }
]

print(f"üìã Loaded {len(test_examples)} test examples for hallucination detection")


### 3.1 Running Hallucination Detection on All Examples


In [None]:
def print_detection_result(name: str, context: str, answer: str, result: Dict, answer_type: str):
    """Pretty print the hallucination detection result."""
    print(f"\n{'='*80}")
    print(f"üìù {name} - {answer_type}")
    print(f"{'='*80}")
    print(f"\nüìÑ Context: {context[:100]}...")
    print(f"\nüí¨ Answer: {answer}")
    print(f"\nüìä SCORES:")
    print(f"   ‚Ä¢ Semantic Similarity: {result['semantic_score']:.3f}")
    print(f"   ‚Ä¢ NLI Entailment:      {result['nli_score']:.3f}")
    print(f"   ‚Ä¢ Token Overlap:       {result['overlap_score']:.3f}")
    print(f"   ‚Ä¢ Combined Score:      {result['combined_score']:.3f}")
    print(f"\nüéØ VERDICT: {result['category']}")
    print(f"   ‚Ä¢ Is Hallucination: {'YES ‚ùå' if result['is_hallucination'] else 'NO ‚úì'}")
    print(f"   ‚Ä¢ Risk Level: {result['risk_level']}")

# Run detection on all examples
print("üîç HALLUCINATION DETECTION RESULTS")
print("="*80)

all_results = []

for example in test_examples:
    # Test faithful answer
    faithful_result = detector.detect_hallucination(
        example["faithful_answer"], 
        example["context"]
    )
    print_detection_result(
        example["name"], 
        example["context"], 
        example["faithful_answer"], 
        faithful_result, 
        "FAITHFUL ANSWER"
    )
    all_results.append({
        "example": example["name"],
        "type": "Faithful",
        "combined_score": faithful_result["combined_score"],
        "is_hallucination": faithful_result["is_hallucination"],
        "category": faithful_result["category"]
    })
    
    # Test hallucinated answer
    hallucinated_result = detector.detect_hallucination(
        example["hallucinated_answer"], 
        example["context"]
    )
    print_detection_result(
        example["name"], 
        example["context"], 
        example["hallucinated_answer"], 
        hallucinated_result, 
        "HALLUCINATED ANSWER"
    )
    all_results.append({
        "example": example["name"],
        "type": "Hallucinated",
        "combined_score": hallucinated_result["combined_score"],
        "is_hallucination": hallucinated_result["is_hallucination"],
        "category": hallucinated_result["category"]
    })


### 3.2 Summary Results Table


In [None]:
# Create summary DataFrame
results_df = pd.DataFrame(all_results)

print("\n" + "="*80)
print("üìä SUMMARY: HALLUCINATION DETECTION RESULTS")
print("="*80)

# Format the dataframe for display
display_df = results_df.copy()
display_df["combined_score"] = display_df["combined_score"].apply(lambda x: f"{x:.3f}")
display_df["is_hallucination"] = display_df["is_hallucination"].apply(lambda x: "‚ùå YES" if x else "‚úì NO")
display_df.columns = ["Example", "Answer Type", "Score", "Hallucination?", "Category"]

print("\n")
print(display_df.to_string(index=False))

# Calculate accuracy
faithful_correct = sum(1 for r in all_results if r["type"] == "Faithful" and not r["is_hallucination"])
hallucinated_correct = sum(1 for r in all_results if r["type"] == "Hallucinated" and r["is_hallucination"])
total = len(test_examples)

print(f"\n{'='*80}")
print("üìà DETECTION ACCURACY")
print("="*80)
print(f"   ‚Ä¢ Faithful answers correctly identified:     {faithful_correct}/{total} ({faithful_correct/total*100:.0f}%)")
print(f"   ‚Ä¢ Hallucinations correctly detected:         {hallucinated_correct}/{total} ({hallucinated_correct/total*100:.0f}%)")
print(f"   ‚Ä¢ Overall accuracy:                          {(faithful_correct+hallucinated_correct)/(total*2)*100:.0f}%")


## 4. Interactive Testing: Try Your Own Examples


In [None]:
def test_hallucination(context: str, answer: str):
    """
    Test if an answer is a hallucination given a context.
    
    Usage:
        test_hallucination(
            context="The Eiffel Tower is located in Paris, France.",
            answer="The Eiffel Tower is in Paris."
        )
    """
    result = detector.detect_hallucination(answer, context)
    
    print("="*60)
    print("üîç HALLUCINATION TEST")
    print("="*60)
    print(f"\nüìÑ Context:\n{context}")
    print(f"\nüí¨ Answer:\n{answer}")
    print(f"\nüìä SCORES:")
    print(f"   ‚Ä¢ Semantic: {result['semantic_score']:.3f}")
    print(f"   ‚Ä¢ NLI:      {result['nli_score']:.3f}")
    print(f"   ‚Ä¢ Overlap:  {result['overlap_score']:.3f}")
    print(f"   ‚Ä¢ Combined: {result['combined_score']:.3f}")
    print(f"\nüéØ RESULT: {result['category']}")
    print(f"   Risk Level: {result['risk_level']}")
    print("="*60)
    
    return result

# Example usage - test a faithful answer
print("TEST 1: Faithful Answer")
test_hallucination(
    context="The Great Wall of China is over 13,000 miles long and was built over many centuries to protect against invasions.",
    answer="The Great Wall of China is over 13,000 miles long and was built for protection against invasions."
)


In [None]:
# Example usage - test a hallucinated answer
print("TEST 2: Hallucinated Answer")
test_hallucination(
    context="The Great Wall of China is over 13,000 miles long and was built over many centuries to protect against invasions.",
    answer="The Great Wall of China is 500 miles long and was built in 1920 by the British Empire."
)


## 5. Score Comparison Visualization


In [None]:
# Calculate average scores by answer type
faithful_scores = [r["combined_score"] for r in all_results if r["type"] == "Faithful"]
hallucinated_scores = [r["combined_score"] for r in all_results if r["type"] == "Hallucinated"]

print("="*60)
print("üìä SCORE COMPARISON BY ANSWER TYPE")
print("="*60)

print("\nüìó FAITHFUL ANSWERS:")
print(f"   Average Score: {np.mean(faithful_scores):.3f}")
print(f"   Min Score:     {np.min(faithful_scores):.3f}")
print(f"   Max Score:     {np.max(faithful_scores):.3f}")

print("\nüìï HALLUCINATED ANSWERS:")
print(f"   Average Score: {np.mean(hallucinated_scores):.3f}")
print(f"   Min Score:     {np.min(hallucinated_scores):.3f}")
print(f"   Max Score:     {np.max(hallucinated_scores):.3f}")

print("\nüìâ SCORE DIFFERENCE:")
score_diff = np.mean(faithful_scores) - np.mean(hallucinated_scores)
print(f"   Faithful vs Hallucinated Gap: {score_diff:.3f}")
print(f"   This {score_diff:.1%} gap shows clear separation between faithful and hallucinated content!")

# Visual bar representation
print("\nüìä VISUAL COMPARISON:")
print(f"   Faithful:     {'‚ñà' * int(np.mean(faithful_scores) * 20):<20} {np.mean(faithful_scores):.2f}")
print(f"   Hallucinated: {'‚ñà' * int(np.mean(hallucinated_scores) * 20):<20} {np.mean(hallucinated_scores):.2f}")
print(f"   Threshold:    {'‚îÄ' * 8}‚îÇ{'‚îÄ' * 11}  0.40 (hallucination cutoff)")


## 6. Key Takeaways


In [None]:
print("""
‚ïî‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïó
‚ïë                    KEY TAKEAWAYS: HALLUCINATION DETECTION                    ‚ïë
‚ïö‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïù

‚úÖ WHAT WE DEMONSTRATED:

1. MULTI-SIGNAL APPROACH
   ‚Ä¢ Semantic similarity alone is not enough
   ‚Ä¢ NLI (entailment) provides the strongest signal
   ‚Ä¢ Token overlap serves as a useful baseline
   ‚Ä¢ Combined scoring improves robustness

2. CLEAR SEPARATION
   ‚Ä¢ Faithful answers typically score > 0.60
   ‚Ä¢ Hallucinations typically score < 0.40
   ‚Ä¢ The gap provides reliable detection

3. TYPES OF HALLUCINATIONS DETECTED:
   ‚Ä¢ Factual errors (wrong dates, numbers, names)
   ‚Ä¢ Contradictions (opposite of what context says)
   ‚Ä¢ Fabrications (invented information not in context)
   ‚Ä¢ Misattributions (assigning actions to wrong entities)

4. RECOMMENDED THRESHOLDS:
   ‚Ä¢ >= 0.60: Safe to use (faithful)
   ‚Ä¢ 0.40-0.60: Needs human review
   ‚Ä¢ < 0.40: Flag as potential hallucination
   ‚Ä¢ < 0.20: Block from production

‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

üîó NEXT STEPS:
   ‚Ä¢ Integrate with MLflow for production monitoring
   ‚Ä¢ Set up alerts for low-scoring responses
   ‚Ä¢ Use in RAG evaluation pipelines
   ‚Ä¢ Combine with relevance and answer quality metrics

‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
""")
