<a href="https://colab.research.google.com/github/dimitarpg13/agentic_architectures_and_design_patterns/blob/main/notebooks/model_evaluation/faithfulness_metric_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Faithfulness Metric Implementation from Scratch

This notebook demonstrates how to implement faithfulness metrics for evaluating RAG (Retrieval Augmented Generation) systems **without using MLflow's built-in faithfulness metric**.

## What is Faithfulness?

Faithfulness measures whether a generated response is grounded in the provided context, without introducing hallucinations or fabricated information.

### Implementation Approaches in this Notebook:

1. **Keyword-Based Faithfulness**: Measures overlap between context and response keywords
2. **NLI-Based Faithfulness**: Uses Natural Language Inference to check entailment
3. **Semantic Similarity Faithfulness**: Uses sentence embeddings to measure semantic alignment
4. **Claim Decomposition Faithfulness**: Breaks response into claims and verifies each

### Faithfulness Score Range:
- 0.0 to 1.0 (or scaled to 1-5)
- Higher scores indicate better grounding in context
- A score of 1.0 means fully faithful with no hallucinations


## 1. Installation and Setup


In [None]:
# Install required packages
!pip install -q transformers sentence-transformers torch pandas numpy matplotlib seaborn plotly nltk scikit-learn


In [None]:
import pandas as pd
import numpy as np
import re
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass
import warnings
warnings.filterwarnings('ignore')

# NLP libraries
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

# Download NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('punkt_tab', quiet=True)

# Transformers for NLI and embeddings
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

sns.set_style("whitegrid")
plt.rcParams['figure.dpi'] = 100

print("‚úÖ All imports successful!")


## 2. Sample Evaluation Data

Let's create diverse examples with varying levels of faithfulness to test our implementations.


In [None]:
# Create diverse RAG examples with varying faithfulness levels
evaluation_data = [
    {
        "id": 1,
        "question": "What is the main function of mitochondria?",
        "context": "Mitochondria are membrane-bound organelles found in the cytoplasm of eukaryotic cells. They are often referred to as the 'powerhouse of the cell' because they generate most of the cell's supply of adenosine triphosphate (ATP), used as a source of chemical energy.",
        "faithful_response": "Mitochondria are the powerhouse of the cell, responsible for generating most of the cell's ATP, which serves as chemical energy.",
        "partially_faithful": "Mitochondria generate ATP and are found in most animal and plant cells. They have a double membrane structure with their own DNA.",
        "unfaithful_response": "Mitochondria are responsible for protein synthesis and DNA replication in cells. They were discovered in 1950."
    },
    {
        "id": 2,
        "question": "When was the Eiffel Tower built?",
        "context": "The Eiffel Tower is a wrought-iron lattice tower located in Paris, France. It was constructed from 1887 to 1889 as the centerpiece of the 1889 World's Fair. Gustave Eiffel's company designed and built the tower.",
        "faithful_response": "The Eiffel Tower was constructed between 1887 and 1889. It was built as the centerpiece of the 1889 World's Fair and was designed by Gustave Eiffel's company.",
        "partially_faithful": "The Eiffel Tower was built in Paris around 1889. It stands 324 meters tall and attracts millions of visitors annually.",
        "unfaithful_response": "The Eiffel Tower was built in 1920 by the French government to celebrate the end of World War I. It is 500 meters tall."
    },
    {
        "id": 3,
        "question": "How does photosynthesis work?",
        "context": "Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy. This chemical energy is stored in carbohydrate molecules synthesized from carbon dioxide and water. Chlorophyll, the green pigment in plants, absorbs light for this process.",
        "faithful_response": "Photosynthesis converts light energy into chemical energy stored in carbohydrates. Plants use carbon dioxide and water, with chlorophyll absorbing the light needed for the process.",
        "partially_faithful": "Photosynthesis is how plants make food using sunlight. It occurs primarily in the leaves and produces glucose and oxygen as byproducts.",
        "unfaithful_response": "Photosynthesis is the process where plants absorb nutrients from the soil and convert them into oxygen. It happens mainly at night when temperatures are cooler."
    },
    {
        "id": 4,
        "question": "What causes earthquakes?",
        "context": "Earthquakes are caused by the sudden release of energy in the Earth's lithosphere that creates seismic waves. Most earthquakes are caused by geological faults, but they can also be caused by volcanic activity, landslides, or human activities like mining.",
        "faithful_response": "Earthquakes occur due to the sudden release of energy in Earth's lithosphere, creating seismic waves. They are primarily caused by geological faults, though volcanic activity, landslides, and human activities can also trigger them.",
        "partially_faithful": "Earthquakes happen when tectonic plates shift and release energy. The Richter scale measures their intensity, and they can cause tsunamis when occurring underwater.",
        "unfaithful_response": "Earthquakes are caused by changes in atmospheric pressure and solar activity. They typically occur every full moon and can be predicted weeks in advance."
    },
    {
        "id": 5,
        "question": "What is machine learning?",
        "context": "Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It focuses on developing algorithms that can access data and use it to learn for themselves.",
        "faithful_response": "Machine learning is a branch of AI that allows systems to learn from experience without explicit programming. It develops algorithms that can access and learn from data autonomously.",
        "partially_faithful": "Machine learning is an AI technology that learns patterns from data. Popular frameworks include TensorFlow and PyTorch, and it powers applications like image recognition.",
        "unfaithful_response": "Machine learning is a type of database management system that stores and retrieves information using complex queries. It was invented in 2010."
    }
]

# Convert to DataFrame
df = pd.DataFrame(evaluation_data)

print(f"üìä Created {len(df)} evaluation examples")
print("\nExample structure:")
print(df[['id', 'question']].to_string(index=False))


## 3. Approach 1: Keyword-Based Faithfulness

This approach measures faithfulness based on keyword overlap between the context and response. While simple, it provides a fast baseline metric.


## 4. Approach 2: NLI-Based Faithfulness

Natural Language Inference (NLI) models can determine if a hypothesis is entailed by a premise. We use this to check if response sentences are entailed by the context.


In [None]:
class NLIFaithfulness:
    """
    NLI-based faithfulness scorer using Natural Language Inference.
    
    Checks if each sentence in the response is entailed by the context.
    Uses a pre-trained NLI model (DeBERTa or similar).
    """
    
    def __init__(self, model_name: str = "microsoft/deberta-base-mnli"):
        """
        Initialize with an NLI model.
        
        Args:
            model_name: HuggingFace model name for NLI
        """
        print(f"Loading NLI model: {model_name}...")
        self.nli_pipeline = pipeline(
            "text-classification", 
            model=model_name,
            top_k=None  # Return all scores
        )
        print("‚úÖ NLI model loaded!")
        
        # Label mappings (model-specific)
        self.entailment_label = "ENTAILMENT"
        self.contradiction_label = "CONTRADICTION"
        self.neutral_label = "NEUTRAL"
    
    def split_into_sentences(self, text: str) -> List[str]:
        """Split text into sentences."""
        sentences = sent_tokenize(text)
        # Filter out very short sentences
        return [s.strip() for s in sentences if len(s.strip()) > 10]
    
    def check_entailment(self, premise: str, hypothesis: str) -> Dict:
        """
        Check if premise entails hypothesis.
        
        Returns:
            Dict with entailment probabilities
        """
        # NLI input format: premise [SEP] hypothesis
        input_text = f"{premise} [SEP] {hypothesis}"
        
        try:
            result = self.nli_pipeline(input_text)
            
            # Parse results into a dict
            scores = {item['label']: item['score'] for item in result[0]}
            
            return {
                'entailment': scores.get(self.entailment_label, scores.get('entailment', 0)),
                'contradiction': scores.get(self.contradiction_label, scores.get('contradiction', 0)),
                'neutral': scores.get(self.neutral_label, scores.get('neutral', 0))
            }
        except Exception as e:
            print(f"Warning: NLI error - {e}")
            return {'entailment': 0.33, 'contradiction': 0.33, 'neutral': 0.33}
    
    def score(self, context: str, response: str) -> Dict:
        """
        Calculate NLI-based faithfulness score.
        
        Each sentence in the response is checked for entailment against the context.
        """
        # Split response into sentences
        response_sentences = self.split_into_sentences(response)
        
        if not response_sentences:
            return {
                'score': 0.5,
                'score_1_5': 3.0,
                'sentence_scores': [],
                'entailment_ratio': 0.5,
                'contradiction_ratio': 0.0
            }
        
        sentence_results = []
        entailment_count = 0
        contradiction_count = 0
        
        for sentence in response_sentences:
            nli_result = self.check_entailment(context, sentence)
            
            # Determine the classification
            max_label = max(nli_result, key=nli_result.get)
            
            sentence_results.append({
                'sentence': sentence[:50] + '...' if len(sentence) > 50 else sentence,
                'entailment_prob': nli_result['entailment'],
                'contradiction_prob': nli_result['contradiction'],
                'classification': max_label
            })
            
            if nli_result['entailment'] > 0.5:
                entailment_count += 1
            if nli_result['contradiction'] > 0.5:
                contradiction_count += 1
        
        # Calculate overall score
        entailment_ratio = entailment_count / len(response_sentences)
        contradiction_ratio = contradiction_count / len(response_sentences)
        
        # Average entailment probability
        avg_entailment = np.mean([r['entailment_prob'] for r in sentence_results])
        avg_contradiction = np.mean([r['contradiction_prob'] for r in sentence_results])
        
        # Final score: high entailment is good, contradiction is very bad
        final_score = avg_entailment * (1 - avg_contradiction * 1.5)
        final_score = max(0, min(1, final_score))
        
        return {
            'score': round(final_score, 4),
            'score_1_5': round(1 + final_score * 4, 2),
            'avg_entailment': round(avg_entailment, 4),
            'avg_contradiction': round(avg_contradiction, 4),
            'entailment_ratio': round(entailment_ratio, 4),
            'contradiction_ratio': round(contradiction_ratio, 4),
            'sentence_scores': sentence_results
        }

# Initialize NLI scorer
print("Initializing NLI-based faithfulness scorer...")
nli_scorer = NLIFaithfulness()

# Test on first example
example = df.iloc[0]
print("\nüìù Testing NLI-Based Faithfulness")
print("="*70)
print(f"\nContext: {example['context'][:100]}...")

for resp_type in ['faithful_response', 'partially_faithful', 'unfaithful_response']:
    result = nli_scorer.score(example['context'], example[resp_type])
    label = resp_type.replace('_', ' ').title()
    print(f"\n{label}:")
    print(f"  Response: {example[resp_type][:60]}...")
    print(f"  Score: {result['score']:.3f} (1-5 scale: {result['score_1_5']})")
    print(f"  Avg Entailment: {result['avg_entailment']:.3f}")
    print(f"  Avg Contradiction: {result['avg_contradiction']:.3f}")


## 5. Approach 3: Semantic Similarity Faithfulness

Using sentence embeddings to measure semantic similarity between context and response. Each sentence in the response is compared to the most similar sentence in the context.


In [None]:
class SemanticSimilarityFaithfulness:
    """
    Semantic similarity-based faithfulness scorer.
    
    Uses sentence embeddings to measure how semantically similar
    the response is to the context.
    """
    
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        """
        Initialize with a sentence transformer model.
        
        Args:
            model_name: SentenceTransformer model name
        """
        print(f"Loading sentence embedding model: {model_name}...")
        self.model = SentenceTransformer(model_name)
        print("‚úÖ Embedding model loaded!")
    
    def split_into_sentences(self, text: str) -> List[str]:
        """Split text into sentences."""
        sentences = sent_tokenize(text)
        return [s.strip() for s in sentences if len(s.strip()) > 10]
    
    def get_embeddings(self, texts: List[str]) -> np.ndarray:
        """Get embeddings for a list of texts."""
        return self.model.encode(texts, convert_to_numpy=True)
    
    def score(self, context: str, response: str) -> Dict:
        """
        Calculate semantic similarity-based faithfulness score.
        
        For each sentence in the response, find the maximum similarity
        to any sentence in the context.
        """
        # Split into sentences
        context_sentences = self.split_into_sentences(context)
        response_sentences = self.split_into_sentences(response)
        
        if not context_sentences or not response_sentences:
            return {
                'score': 0.5,
                'score_1_5': 3.0,
                'sentence_similarities': [],
                'avg_similarity': 0.5,
                'min_similarity': 0.5
            }
        
        # Get embeddings
        context_embeddings = self.get_embeddings(context_sentences)
        response_embeddings = self.get_embeddings(response_sentences)
        
        # Also get overall document embeddings
        context_full_emb = self.get_embeddings([context])[0]
        response_full_emb = self.get_embeddings([response])[0]
        
        # Calculate document-level similarity
        doc_similarity = cosine_similarity(
            context_full_emb.reshape(1, -1),
            response_full_emb.reshape(1, -1)
        )[0][0]
        
        # Calculate sentence-level similarities
        sentence_results = []
        max_similarities = []
        
        for i, resp_emb in enumerate(response_embeddings):
            # Find max similarity to any context sentence
            similarities = cosine_similarity(
                resp_emb.reshape(1, -1),
                context_embeddings
            )[0]
            
            max_sim = float(np.max(similarities))
            max_idx = int(np.argmax(similarities))
            
            max_similarities.append(max_sim)
            sentence_results.append({
                'response_sentence': response_sentences[i][:50] + '...',
                'best_match': context_sentences[max_idx][:50] + '...',
                'similarity': round(max_sim, 4)
            })
        
        # Calculate aggregate scores
        avg_similarity = np.mean(max_similarities)
        min_similarity = np.min(max_similarities)
        
        # Final score combines document and sentence-level similarity
        # Penalize low minimum similarity (indicates hallucinated content)
        final_score = (doc_similarity * 0.3 + avg_similarity * 0.4 + min_similarity * 0.3)
        
        # Apply threshold: very low similarities should be penalized more
        if min_similarity < 0.3:
            final_score *= (0.5 + min_similarity)
        
        final_score = max(0, min(1, final_score))
        
        return {
            'score': round(final_score, 4),
            'score_1_5': round(1 + final_score * 4, 2),
            'doc_similarity': round(doc_similarity, 4),
            'avg_similarity': round(avg_similarity, 4),
            'min_similarity': round(min_similarity, 4),
            'sentence_similarities': sentence_results
        }

# Initialize semantic similarity scorer
print("Initializing Semantic Similarity faithfulness scorer...")
semantic_scorer = SemanticSimilarityFaithfulness()

# Test on first example
example = df.iloc[0]
print("\nüìù Testing Semantic Similarity Faithfulness")
print("="*70)
print(f"\nContext: {example['context'][:100]}...")

for resp_type in ['faithful_response', 'partially_faithful', 'unfaithful_response']:
    result = semantic_scorer.score(example['context'], example[resp_type])
    label = resp_type.replace('_', ' ').title()
    print(f"\n{label}:")
    print(f"  Response: {example[resp_type][:60]}...")
    print(f"  Score: {result['score']:.3f} (1-5 scale: {result['score_1_5']})")
    print(f"  Doc Similarity: {result['doc_similarity']:.3f}")
    print(f"  Avg Sentence Sim: {result['avg_similarity']:.3f}")


## 6. Ensemble Faithfulness Scorer

Combining multiple approaches for a more robust faithfulness score.


In [None]:
class EnsembleFaithfulness:
    """
    Ensemble faithfulness scorer combining multiple approaches.
    
    Combines keyword-based, NLI-based, and semantic similarity approaches
    for a more robust faithfulness evaluation.
    """
    
    def __init__(self, 
                 keyword_scorer: KeywordFaithfulness,
                 nli_scorer: NLIFaithfulness,
                 semantic_scorer: SemanticSimilarityFaithfulness,
                 weights: Dict[str, float] = None):
        """
        Initialize ensemble with individual scorers.
        
        Args:
            keyword_scorer: Keyword-based scorer
            nli_scorer: NLI-based scorer  
            semantic_scorer: Semantic similarity scorer
            weights: Dict of weights for each scorer (must sum to 1)
        """
        self.keyword_scorer = keyword_scorer
        self.nli_scorer = nli_scorer
        self.semantic_scorer = semantic_scorer
        
        # Default weights
        self.weights = weights or {
            'keyword': 0.2,
            'nli': 0.4,
            'semantic': 0.4
        }
    
    def score(self, context: str, response: str) -> Dict:
        """
        Calculate ensemble faithfulness score.
        """
        # Get individual scores
        keyword_result = self.keyword_scorer.score(context, response)
        nli_result = self.nli_scorer.score(context, response)
        semantic_result = self.semantic_scorer.score(context, response)
        
        # Calculate weighted ensemble score
        ensemble_score = (
            keyword_result['score'] * self.weights['keyword'] +
            nli_result['score'] * self.weights['nli'] +
            semantic_result['score'] * self.weights['semantic']
        )
        
        # Also calculate agreement between methods
        scores = [keyword_result['score'], nli_result['score'], semantic_result['score']]
        agreement = 1 - np.std(scores)  # Higher agreement = lower std
        
        return {
            'score': round(ensemble_score, 4),
            'score_1_5': round(1 + ensemble_score * 4, 2),
            'keyword_score': keyword_result['score'],
            'nli_score': nli_result['score'],
            'semantic_score': semantic_result['score'],
            'method_agreement': round(agreement, 4),
            'individual_results': {
                'keyword': keyword_result,
                'nli': nli_result,
                'semantic': semantic_result
            }
        }
    
    def score_batch(self, contexts: List[str], responses: List[str]) -> List[Dict]:
        """Score multiple context-response pairs."""
        return [
            self.score(ctx, resp) 
            for ctx, resp in zip(contexts, responses)
        ]

# Initialize ensemble scorer
ensemble_scorer = EnsembleFaithfulness(
    keyword_scorer=keyword_scorer,
    nli_scorer=nli_scorer,
    semantic_scorer=semantic_scorer
)

# Test on first example
example = df.iloc[0]
print("üìù Testing Ensemble Faithfulness Scorer")
print("="*70)
print(f"\nContext: {example['context'][:100]}...")

for resp_type in ['faithful_response', 'partially_faithful', 'unfaithful_response']:
    result = ensemble_scorer.score(example['context'], example[resp_type])
    label = resp_type.replace('_', ' ').title()
    print(f"\n{label}:")
    print(f"  Response: {example[resp_type][:60]}...")
    print(f"  Ensemble Score: {result['score']:.3f} (1-5 scale: {result['score_1_5']})")
    print(f"  Keyword: {result['keyword_score']:.3f} | NLI: {result['nli_score']:.3f} | Semantic: {result['semantic_score']:.3f}")
    print(f"  Method Agreement: {result['method_agreement']:.3f}")


## 7. Comprehensive Evaluation on All Examples


In [None]:
# Evaluate all examples with all methods
print("üîÑ Running comprehensive evaluation on all examples...")
print("="*70)

results = []

for idx, row in df.iterrows():
    context = row['context']
    
    for resp_type in ['faithful_response', 'partially_faithful', 'unfaithful_response']:
        response = row[resp_type]
        
        # Get ensemble score (includes all individual scores)
        ensemble_result = ensemble_scorer.score(context, response)
        
        results.append({
            'question_id': row['id'],
            'question': row['question'],
            'response_type': resp_type.replace('_', ' ').title().replace(' Response', ''),
            'keyword_score': ensemble_result['keyword_score'],
            'nli_score': ensemble_result['nli_score'],
            'semantic_score': ensemble_result['semantic_score'],
            'ensemble_score': ensemble_result['score'],
            'ensemble_1_5': ensemble_result['score_1_5'],
            'method_agreement': ensemble_result['method_agreement']
        })
    
    print(f"  ‚úì Evaluated example {row['id']}")

# Convert to DataFrame
results_df = pd.DataFrame(results)

print(f"\n‚úÖ Evaluation complete! {len(results_df)} evaluations performed.")

# Display summary
print("\nüìä Summary Statistics by Response Type:")
print("="*70)
summary = results_df.groupby('response_type').agg({
    'keyword_score': ['mean', 'std'],
    'nli_score': ['mean', 'std'],
    'semantic_score': ['mean', 'std'],
    'ensemble_score': ['mean', 'std']
}).round(3)
print(summary)


## 8. Visualizations


In [None]:
# Create comprehensive visualizations
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=(
        "Ensemble Score by Response Type",
        "Method Comparison",
        "Score Distribution by Method",
        "Per-Question Heatmap"
    ),
    specs=[
        [{"type": "bar"}, {"type": "bar"}],
        [{"type": "box"}, {"type": "heatmap"}]
    ]
)

colors = {
    "Faithful": "#2ecc71",
    "Partially Faithful": "#f39c12",
    "Unfaithful": "#e74c3c"
}

# Plot 1: Ensemble Score by Response Type
avg_ensemble = results_df.groupby('response_type')['ensemble_score'].mean().reset_index()
avg_ensemble = avg_ensemble.sort_values('ensemble_score', ascending=False)

fig.add_trace(
    go.Bar(
        x=avg_ensemble['response_type'],
        y=avg_ensemble['ensemble_score'],
        marker_color=[colors.get(t, '#3498db') for t in avg_ensemble['response_type']],
        text=avg_ensemble['ensemble_score'].round(3),
        textposition='outside',
        name='Ensemble'
    ),
    row=1, col=1
)

# Plot 2: Method Comparison (grouped bar)
methods = ['keyword_score', 'nli_score', 'semantic_score']
method_names = ['Keyword', 'NLI', 'Semantic']
method_colors = ['#3498db', '#9b59b6', '#1abc9c']

for resp_type in ['Faithful', 'Partially Faithful', 'Unfaithful']:
    type_data = results_df[results_df['response_type'] == resp_type]
    avg_scores = [type_data[m].mean() for m in methods]
    
    fig.add_trace(
        go.Bar(
            name=resp_type,
            x=method_names,
            y=avg_scores,
            marker_color=colors[resp_type]
        ),
        row=1, col=2
    )

# Plot 3: Score Distribution (Box Plot)
for method, name, color in zip(methods, method_names, method_colors):
    fig.add_trace(
        go.Box(
            y=results_df[method],
            name=name,
            marker_color=color
        ),
        row=2, col=1
    )

# Plot 4: Heatmap
pivot_data = results_df.pivot(
    index='response_type', 
    columns='question_id', 
    values='ensemble_score'
)

fig.add_trace(
    go.Heatmap(
        z=pivot_data.values,
        x=[f"Q{i}" for i in pivot_data.columns],
        y=pivot_data.index,
        colorscale='RdYlGn',
        text=np.round(pivot_data.values, 2),
        texttemplate="%{text}",
        textfont={"size": 10}
    ),
    row=2, col=2
)

# Update layout
fig.update_layout(
    title_text="Faithfulness Metric Analysis (Custom Implementation)",
    showlegend=True,
    height=800,
    width=1200,
    barmode='group'
)

fig.update_yaxes(title_text="Score (0-1)", row=1, col=1)
fig.update_yaxes(title_text="Score (0-1)", row=1, col=2)
fig.update_yaxes(title_text="Score (0-1)", row=2, col=1)

fig.show()


In [None]:
# Radar chart comparing methods
fig_radar = go.Figure()

categories = ['Keyword\nScore', 'NLI\nScore', 'Semantic\nScore', 'Ensemble\nScore', 'Method\nAgreement']

for resp_type in ['Faithful', 'Partially Faithful', 'Unfaithful']:
    type_data = results_df[results_df['response_type'] == resp_type]
    
    values = [
        type_data['keyword_score'].mean(),
        type_data['nli_score'].mean(),
        type_data['semantic_score'].mean(),
        type_data['ensemble_score'].mean(),
        type_data['method_agreement'].mean()
    ]
    
    fig_radar.add_trace(go.Scatterpolar(
        r=values,
        theta=categories,
        fill='toself',
        name=resp_type,
        marker_color=colors[resp_type]
    ))

fig_radar.update_layout(
    polar=dict(
        radialaxis=dict(
            visible=True,
            range=[0, 1]
        )
    ),
    showlegend=True,
    title="Faithfulness Dimensions by Response Type",
    height=500,
    width=700
)

fig_radar.show()


## 9. Results Comparison Table


In [None]:
# Create a comprehensive comparison table
print("üìä Detailed Results Comparison")
print("="*90)

# Average scores by response type
comparison_table = results_df.groupby('response_type').agg({
    'keyword_score': 'mean',
    'nli_score': 'mean',
    'semantic_score': 'mean',
    'ensemble_score': 'mean',
    'ensemble_1_5': 'mean'
}).round(3)

comparison_table.columns = ['Keyword', 'NLI', 'Semantic', 'Ensemble (0-1)', 'Ensemble (1-5)']
comparison_table = comparison_table.sort_values('Ensemble (0-1)', ascending=False)

print("\nAverage Scores by Response Type:")
print(comparison_table.to_string())

# Visual bar representation
print("\n\nüìà Visual Score Comparison:")
print("="*90)
for resp_type in comparison_table.index:
    score = comparison_table.loc[resp_type, 'Ensemble (0-1)']
    bar = '‚ñà' * int(score * 40)
    print(f"{resp_type:25s}: {bar:40s} {score:.3f}")

# Method correlation analysis
print("\n\nüîó Method Correlation Analysis:")
print("="*90)
correlation = results_df[['keyword_score', 'nli_score', 'semantic_score', 'ensemble_score']].corr().round(3)
correlation.columns = ['Keyword', 'NLI', 'Semantic', 'Ensemble']
correlation.index = ['Keyword', 'NLI', 'Semantic', 'Ensemble']
print(correlation.to_string())


## 10. Hallucination Detection


In [None]:
class HallucinationDetector:
    """
    Detect potential hallucinations using our faithfulness scorers.
    """
    
    def __init__(self, ensemble_scorer: EnsembleFaithfulness, threshold: float = 0.5):
        self.ensemble_scorer = ensemble_scorer
        self.threshold = threshold
    
    def detect(self, context: str, response: str) -> Dict:
        """
        Detect hallucinations in a response.
        """
        result = self.ensemble_scorer.score(context, response)
        
        # Extract detailed info
        keyword_result = result['individual_results']['keyword']
        nli_result = result['individual_results']['nli']
        
        # Identify specific issues
        issues = []
        
        # Check for novel keywords (potential hallucinations)
        if keyword_result.get('novel_keywords'):
            issues.append(f"Novel keywords not in context: {keyword_result['novel_keywords'][:5]}")
        
        if keyword_result.get('novel_entities'):
            issues.append(f"Novel entities/numbers: {keyword_result['novel_entities']}")
        
        # Check NLI contradictions
        if nli_result.get('contradiction_ratio', 0) > 0.3:
            issues.append("High contradiction rate detected in NLI analysis")
        
        # Determine risk level
        score = result['score']
        if score >= 0.7:
            risk_level = "LOW"
            recommendation = "Response appears faithful. Safe to use."
        elif score >= 0.5:
            risk_level = "MEDIUM"
            recommendation = "Some concerns. Review before using."
        elif score >= 0.3:
            risk_level = "HIGH"
            recommendation = "Significant hallucination risk. Manual verification required."
        else:
            risk_level = "CRITICAL"
            recommendation = "Likely contains hallucinations. Do not use without revision."
        
        return {
            'faithfulness_score': result['score'],
            'score_1_5': result['score_1_5'],
            'risk_level': risk_level,
            'recommendation': recommendation,
            'issues': issues,
            'method_scores': {
                'keyword': result['keyword_score'],
                'nli': result['nli_score'],
                'semantic': result['semantic_score']
            }
        }

# Initialize detector
detector = HallucinationDetector(ensemble_scorer)

# Test on examples
print("üîç Hallucination Detection Analysis")
print("="*80)

for idx, row in df.head(3).iterrows():
    print(f"\n{'='*80}")
    print(f"Question {row['id']}: {row['question']}")
    
    for resp_type, label in [('faithful_response', 'Faithful'), 
                              ('unfaithful_response', 'Unfaithful')]:
        result = detector.detect(row['context'], row[resp_type])
        
        print(f"\n  {label} Response:")
        print(f"    Response: {row[resp_type][:50]}...")
        print(f"    Risk Level: {result['risk_level']}")
        print(f"    Score: {result['faithfulness_score']:.3f} (1-5: {result['score_1_5']})")
        if result['issues']:
            print(f"    Issues: {result['issues'][0][:60]}...")
        print(f"    üí° {result['recommendation']}")


## 11. Best Practices and Recommendations

### Key Takeaways:

1. **Multiple Methods Are Better**:
   - Keyword-based: Fast, interpretable, but surface-level
   - NLI-based: Captures semantic entailment, detects contradictions
   - Semantic similarity: Good for overall alignment, may miss subtle issues
   - **Ensemble**: Combines strengths, most robust

2. **When to Use Each Method**:
   - **Keyword**: Quick sanity check, identifying obvious hallucinations
   - **NLI**: Detecting contradictions, factual inconsistencies
   - **Semantic**: Overall coherence, topic alignment
   - **Ensemble**: Production systems, comprehensive evaluation

3. **Limitations**:
   - Keyword-based misses paraphrases
   - NLI may struggle with long contexts (need chunking)
   - Semantic similarity can be fooled by topically similar but factually incorrect content

4. **Production Considerations**:
   - Cache embeddings for repeated evaluations
   - Consider batching for throughput
   - Set thresholds based on your risk tolerance
   - Log scores for monitoring and improvement

### Score Interpretation:
- **0.7-1.0**: High faithfulness, safe to use
- **0.5-0.7**: Moderate faithfulness, review recommended
- **0.3-0.5**: Low faithfulness, verification required
- **0.0-0.3**: Likely hallucinated, do not use


In [None]:
# Final summary
print("‚úÖ Faithfulness Metric Demo Complete!")
print("="*70)
print("\nüìã Summary of Implementations:")
print("   1. KeywordFaithfulness    - Keyword overlap analysis")
print("   2. NLIFaithfulness        - Natural Language Inference")
print("   3. SemanticSimilarity     - Sentence embedding similarity")
print("   4. EnsembleFaithfulness   - Weighted combination")
print("   5. HallucinationDetector  - Risk assessment")

print("\nüìä Average Results by Response Type:")
summary_final = results_df.groupby('response_type')['ensemble_score'].mean().sort_values(ascending=False)
for resp_type, score in summary_final.items():
    print(f"   {resp_type:25s}: {score:.3f}")

print("\nüîó Resources:")
print("   - NLI Models: https://huggingface.co/models?pipeline_tag=text-classification&sort=trending&search=nli")
print("   - Sentence Transformers: https://www.sbert.net/")
print("   - Faithfulness in RAG: https://arxiv.org/abs/2307.15992")
print("   - RAGAS Framework: https://docs.ragas.io/")

print("\nüí° Next Steps:")
print("   - Fine-tune NLI model on your domain")
print("   - Add custom entity extraction for your use case")
print("   - Implement claim-level verification")
print("   - Integrate with your RAG pipeline")


In [None]:
class KeywordFaithfulness:
    """
    Keyword-based faithfulness scorer.
    
    Measures faithfulness based on how many keywords in the response
    can be found in the context (i.e., are grounded).
    """
    
    def __init__(self):
        self.stop_words = set(stopwords.words('english'))
        # Add common words that don't indicate factual content
        self.stop_words.update(['also', 'however', 'therefore', 'thus', 'hence'])
    
    def extract_keywords(self, text: str) -> set:
        """Extract meaningful keywords from text."""
        # Tokenize and lowercase
        tokens = word_tokenize(text.lower())
        
        # Remove stopwords and non-alphabetic tokens
        keywords = set()
        for token in tokens:
            if (token.isalpha() and 
                token not in self.stop_words and 
                len(token) > 2):
                keywords.add(token)
        
        return keywords
    
    def extract_entities_and_numbers(self, text: str) -> set:
        """Extract named entities and numbers (important for factual accuracy)."""
        entities = set()
        
        # Extract numbers
        numbers = re.findall(r'\b\d+(?:\.\d+)?\b', text)
        entities.update(numbers)
        
        # Extract potential proper nouns (capitalized words)
        words = text.split()
        for word in words:
            if word[0].isupper() and len(word) > 1:
                entities.add(word.lower().strip('.,!?'))
        
        return entities
    
    def score(self, context: str, response: str) -> Dict:
        """
        Calculate keyword-based faithfulness score.
        
        Returns:
            Dict with score and detailed breakdown
        """
        # Extract keywords
        context_keywords = self.extract_keywords(context)
        response_keywords = self.extract_keywords(response)
        
        # Extract entities/numbers (weighted more heavily)
        context_entities = self.extract_entities_and_numbers(context)
        response_entities = self.extract_entities_and_numbers(response)
        
        # Calculate overlap
        keyword_overlap = response_keywords.intersection(context_keywords)
        entity_overlap = response_entities.intersection(context_entities)
        
        # Calculate scores
        if len(response_keywords) > 0:
            keyword_precision = len(keyword_overlap) / len(response_keywords)
        else:
            keyword_precision = 0.0
        
        if len(response_entities) > 0:
            entity_precision = len(entity_overlap) / len(response_entities)
        else:
            entity_precision = 1.0  # No entities to verify
        
        # Novel keywords (potential hallucinations)
        novel_keywords = response_keywords - context_keywords
        novel_entities = response_entities - context_entities
        
        # Penalize novel entities more heavily
        novelty_penalty = 0.0
        if len(response_keywords) > 0:
            novelty_penalty = len(novel_keywords) / len(response_keywords) * 0.3
        if len(response_entities) > 0:
            novelty_penalty += len(novel_entities) / len(response_entities) * 0.7
        
        # Combined score
        combined_score = (keyword_precision * 0.4 + entity_precision * 0.6) * (1 - novelty_penalty * 0.5)
        combined_score = max(0, min(1, combined_score))
        
        return {
            'score': round(combined_score, 4),
            'score_1_5': round(1 + combined_score * 4, 2),  # Scale to 1-5
            'keyword_precision': round(keyword_precision, 4),
            'entity_precision': round(entity_precision, 4),
            'novelty_penalty': round(novelty_penalty, 4),
            'grounded_keywords': list(keyword_overlap)[:10],
            'novel_keywords': list(novel_keywords)[:10],
            'novel_entities': list(novel_entities)[:5]
        }

# Initialize the scorer
keyword_scorer = KeywordFaithfulness()

# Test on first example
example = df.iloc[0]
print("üìù Testing Keyword-Based Faithfulness")
print("="*70)
print(f"\nContext: {example['context'][:100]}...")

for resp_type in ['faithful_response', 'partially_faithful', 'unfaithful_response']:
    result = keyword_scorer.score(example['context'], example[resp_type])
    label = resp_type.replace('_', ' ').title()
    print(f"\n{label}:")
    print(f"  Response: {example[resp_type][:60]}...")
    print(f"  Score: {result['score']:.3f} (1-5 scale: {result['score_1_5']})")
    print(f"  Keyword Precision: {result['keyword_precision']:.3f}")
    print(f"  Novel Keywords: {result['novel_keywords'][:5]}")
