<a href="https://colab.research.google.com/github/dimitarpg13/agentic_architectures_and_design_patterns/blob/main/notebooks/model_evaluation/mlflow_faithfulness_metric_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# MLflow Faithfulness Metric Demonstration

This notebook provides a comprehensive guide to using MLflow's Faithfulness metric for evaluating RAG (Retrieval Augmented Generation) and text generation systems.

## What is Faithfulness?

Faithfulness is a critical metric for evaluating whether generated text accurately represents the source information without introducing hallucinations. Key characteristics:

- **Factual Consistency**: Measures if claims in the output can be verified from the context
- **Hallucination Detection**: Identifies fabricated information not present in the source
- **LLM-as-Judge**: Uses a language model to assess faithfulness
- **RAG-Focused**: Essential for retrieval-augmented generation systems

### Faithfulness Score Range:
- Typically 1-5 or binary (faithful/unfaithful)
- Higher scores indicate better alignment with source context
- A score of 5 means fully faithful with no hallucinations
- A score of 1 indicates significant deviation from source material


## 1. Installation and Setup


In [None]:
# Install required packages
!pip install -q mlflow>=2.8.0 openai pandas numpy matplotlib seaborn plotly


In [None]:
import mlflow
import mlflow.metrics
from mlflow.metrics import make_metric, MetricValue
from mlflow.metrics.genai import faithfulness, EvaluationExample
import pandas as pd
import numpy as np
import json
import os
from typing import List, Tuple, Dict, Optional
import warnings
warnings.filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

sns.set_style("whitegrid")
plt.rcParams['figure.dpi'] = 100

print(f"MLflow version: {mlflow.__version__}")


In [None]:
# Set up OpenAI API key for LLM-based evaluation
# You can set this via environment variable or directly here
# os.environ["OPENAI_API_KEY"] = "your-api-key-here"

# Verify API key is set
if "OPENAI_API_KEY" not in os.environ:
    print("‚ö†Ô∏è Warning: OPENAI_API_KEY not set. Please set it to use LLM-based faithfulness evaluation.")
    print("   You can set it using: os.environ['OPENAI_API_KEY'] = 'your-key'")
else:
    print("‚úÖ OpenAI API key is configured")


## 2. Understanding MLflow's Built-in Faithfulness Metric


In [None]:
# MLflow provides a built-in faithfulness metric using LLM-as-judge
# Let's examine the default faithfulness metric

# Create the built-in faithfulness metric
faithfulness_metric = faithfulness(model="openai:/gpt-4o-mini")

print("üìã MLflow Faithfulness Metric Details:")
print("="*60)
print(f"Metric Name: {faithfulness_metric.name}")
print(f"Greater is Better: {faithfulness_metric.greater_is_better}")
print(f"\nThis metric evaluates whether the model output is")
print("faithful to the provided context without hallucinations.")


## 3. Creating Custom Faithfulness Metrics


In [None]:
class FaithfulnessMetrics:
    """
    Comprehensive faithfulness metric implementations for MLflow.
    """
    
    @staticmethod
    def create_custom_faithfulness_metric(model="openai:/gpt-4o-mini"):
        """
        Create a custom faithfulness metric with detailed evaluation examples.
        
        Args:
            model: The LLM model to use as judge
        """
        # Define evaluation examples for few-shot prompting
        examples = [
            EvaluationExample(
                input="What is the capital of France?",
                output="The capital of France is Paris, which is located along the Seine River.",
                context="France is a country in Western Europe. Paris is the capital and largest city of France, situated on the Seine River.",
                score=5,
                justification="The answer is completely faithful to the context. It correctly states Paris as the capital and mentions the Seine River, both of which are present in the context."
            ),
            EvaluationExample(
                input="What is the population of Paris?",
                output="Paris has a population of approximately 2.1 million people in the city proper and over 12 million in the metropolitan area.",
                context="Paris is the capital of France. It is known for its art, culture, and architecture.",
                score=1,
                justification="The output contains population statistics that are not present in the context. This is a hallucination since the context only mentions Paris's cultural aspects, not population data."
            ),
            EvaluationExample(
                input="What are the main features of Python?",
                output="Python is a high-level programming language known for its readability and simplicity.",
                context="Python is a high-level, interpreted programming language known for its clear syntax, readability, and emphasis on code simplicity. It supports multiple programming paradigms.",
                score=4,
                justification="The answer is mostly faithful, capturing the key points about Python being high-level and readable. It slightly paraphrases 'clear syntax' as 'readability' which is acceptable. However, it omits the multi-paradigm aspect mentioned in context."
            )
        ]
        
        # Create metric with examples
        custom_faithfulness = faithfulness(
            model=model,
            examples=examples
        )
        
        return custom_faithfulness
    
    @staticmethod
    def create_strict_faithfulness_metric(model="openai:/gpt-4o-mini"):
        """
        Create a strict faithfulness metric that penalizes any deviation from context.
        """
        strict_examples = [
            EvaluationExample(
                input="Summarize the company's Q3 results.",
                output="The company reported revenue of $5.2 billion in Q3.",
                context="In Q3, the company reported total revenue of $5.2 billion, representing a 15% year-over-year increase.",
                score=5,
                justification="The output only includes information directly stated in the context. No additional claims are made."
            ),
            EvaluationExample(
                input="What were the Q3 profits?",
                output="The company made strong profits of approximately $800 million in Q3.",
                context="In Q3, the company reported total revenue of $5.2 billion.",
                score=1,
                justification="STRICT VIOLATION: The profit figure of $800 million is not mentioned in the context. Only revenue is discussed. This is a fabrication."
            ),
            EvaluationExample(
                input="Describe the weather conditions.",
                output="It was a sunny day with clear skies.",
                context="The weather report indicated sunny conditions with temperatures around 75¬∞F.",
                score=3,
                justification="Partially faithful. 'Sunny' is correct, but 'clear skies' is an inference not explicitly stated. Temperature was omitted."
            )
        ]
        
        return faithfulness(
            model=model,
            examples=strict_examples
        )

# Create metric instances
custom_faithfulness = FaithfulnessMetrics.create_custom_faithfulness_metric()
strict_faithfulness = FaithfulnessMetrics.create_strict_faithfulness_metric()

print("‚úÖ Custom faithfulness metrics created successfully!")


## 4. Preparing Sample RAG Evaluation Data


In [None]:
# Create diverse RAG examples with varying faithfulness levels
rag_evaluation_data = [
    {
        "question": "What is the main function of mitochondria?",
        "context": "Mitochondria are membrane-bound organelles found in the cytoplasm of eukaryotic cells. They are often referred to as the 'powerhouse of the cell' because they generate most of the cell's supply of adenosine triphosphate (ATP), used as a source of chemical energy.",
        "faithful_response": "Mitochondria are the powerhouse of the cell, responsible for generating most of the cell's ATP, which serves as chemical energy.",
        "partially_faithful_response": "Mitochondria generate ATP and are found in most animal and plant cells. They have a double membrane structure.",
        "unfaithful_response": "Mitochondria are responsible for protein synthesis and DNA replication in cells."
    },
    {
        "question": "When was the Eiffel Tower built?",
        "context": "The Eiffel Tower is a wrought-iron lattice tower located in Paris, France. It was constructed from 1887 to 1889 as the centerpiece of the 1889 World's Fair. Gustave Eiffel's company designed and built the tower.",
        "faithful_response": "The Eiffel Tower was constructed between 1887 and 1889. It was built as the centerpiece of the 1889 World's Fair and was designed by Gustave Eiffel's company.",
        "partially_faithful_response": "The Eiffel Tower was built in 1889 and stands 324 meters tall, making it one of the tallest structures in Paris.",
        "unfaithful_response": "The Eiffel Tower was built in 1920 by the French government to celebrate the end of World War I."
    },
    {
        "question": "How does photosynthesis work?",
        "context": "Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy. This chemical energy is stored in carbohydrate molecules synthesized from carbon dioxide and water. Chlorophyll, the green pigment in plants, absorbs light for this process.",
        "faithful_response": "Photosynthesis converts light energy into chemical energy stored in carbohydrates. Plants use carbon dioxide and water, with chlorophyll absorbing the light needed for the process.",
        "partially_faithful_response": "Photosynthesis is how plants make food using sunlight. It occurs primarily in the leaves and requires temperatures between 50-85¬∞F for optimal efficiency.",
        "unfaithful_response": "Photosynthesis is the process where plants absorb nutrients from the soil and convert them into oxygen. It happens mainly at night."
    },
    {
        "question": "What are the symptoms of vitamin D deficiency?",
        "context": "Vitamin D deficiency can cause bone pain and muscle weakness. In severe cases, it can lead to osteomalacia in adults, characterized by soft bones. Risk factors include limited sun exposure and darker skin pigmentation.",
        "faithful_response": "Vitamin D deficiency can cause bone pain and muscle weakness. Severe deficiency may lead to osteomalacia, a condition where bones become soft.",
        "partially_faithful_response": "Vitamin D deficiency causes bone pain, muscle weakness, fatigue, depression, and increased susceptibility to infections.",
        "unfaithful_response": "Vitamin D deficiency primarily causes skin rashes, hair loss, and vision problems. It can be diagnosed through a simple urine test."
    },
    {
        "question": "What is machine learning?",
        "context": "Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It focuses on developing algorithms that can access data and use it to learn for themselves.",
        "faithful_response": "Machine learning is a branch of AI that allows systems to learn from experience without explicit programming. It develops algorithms that can access and learn from data autonomously.",
        "partially_faithful_response": "Machine learning is an AI technology that learns from data. It was invented at Stanford in 1980 and is used primarily for image recognition.",
        "unfaithful_response": "Machine learning is a type of database management system that stores and retrieves information using neural networks."
    }
]

# Convert to DataFrame
df_rag = pd.DataFrame(rag_evaluation_data)

print(f"üìä Created {len(df_rag)} RAG evaluation examples")
print("\nExample data:")
for idx, row in df_rag.head(2).iterrows():
    print(f"\n{idx+1}. Question: {row['question']}")
    print(f"   Context: {row['context'][:100]}...")
    print(f"   Faithful: {row['faithful_response'][:80]}...")
    print(f"   Unfaithful: {row['unfaithful_response'][:80]}...")


## 5. Evaluating RAG Systems with MLflow Faithfulness


In [None]:
# Set up MLflow experiment
mlflow.set_experiment("faithfulness-metrics-demo")

def evaluate_rag_model(model_name, eval_data, model_config=None):
    """
    Comprehensive evaluation of a RAG model using faithfulness metrics.
    
    Args:
        model_name: Name of the model being evaluated
        eval_data: DataFrame with columns: inputs, outputs, context
        model_config: Optional configuration parameters
    """
    with mlflow.start_run(run_name=model_name):
        # Log model configuration
        mlflow.log_param("model_name", model_name)
        mlflow.log_param("num_samples", len(eval_data))
        
        if model_config:
            for key, value in model_config.items():
                mlflow.log_param(key, value)
        
        # Run evaluation with faithfulness metric
        results = mlflow.evaluate(
            data=eval_data,
            targets="ground_truth",
            predictions="outputs",
            extra_metrics=[faithfulness_metric],
            evaluators="default",
            evaluator_config={
                "col_mapping": {
                    "inputs": "inputs",
                    "context": "context"
                }
            }
        )
        
        return results

# Prepare evaluation datasets for different response types
def prepare_eval_data(df, response_column):
    """Prepare evaluation data in the format expected by MLflow."""
    return pd.DataFrame({
        "inputs": df["question"],
        "outputs": df[response_column],
        "context": df["context"],
        "ground_truth": df["faithful_response"]  # Using faithful response as ground truth
    })

# Evaluate each response type
response_types = {
    "Faithful-RAG-Model": {
        "column": "faithful_response",
        "config": {"model_type": "rag", "retriever": "dense", "temperature": 0.1}
    },
    "Partial-Faithful-Model": {
        "column": "partially_faithful_response",
        "config": {"model_type": "rag", "retriever": "sparse", "temperature": 0.5}
    },
    "Unfaithful-Baseline": {
        "column": "unfaithful_response",
        "config": {"model_type": "base_llm", "retriever": "none", "temperature": 0.9}
    }
}

evaluation_results = {}
print("üîÑ Evaluating RAG models with faithfulness metric...")
print("="*60)

for model_name, config in response_types.items():
    print(f"\nüìä Evaluating: {model_name}")
    eval_data = prepare_eval_data(df_rag, config["column"])
    
    try:
        results = evaluate_rag_model(
            model_name,
            eval_data,
            config["config"]
        )
        evaluation_results[model_name] = results.metrics
        print(f"   ‚úÖ Completed - Faithfulness Score: {results.metrics.get('faithfulness/v1/mean', 'N/A')}")
    except Exception as e:
        print(f"   ‚ö†Ô∏è Error: {str(e)}")
        evaluation_results[model_name] = {"error": str(e)}

print("\n‚úÖ Evaluation completed for all models!")


## 6. Manual Faithfulness Evaluation (Without API)


In [None]:
# For demonstration without API access, let's create a rule-based faithfulness scorer

class RuleBasedFaithfulnessScorer:
    """
    A rule-based faithfulness scorer for demonstration purposes.
    Uses keyword overlap and NLI-inspired heuristics.
    """
    
    def __init__(self):
        self.stopwords = set(['the', 'a', 'an', 'is', 'are', 'was', 'were', 'be', 
                              'been', 'being', 'have', 'has', 'had', 'do', 'does',
                              'did', 'will', 'would', 'could', 'should', 'may',
                              'might', 'must', 'shall', 'can', 'to', 'of', 'in',
                              'for', 'on', 'with', 'at', 'by', 'from', 'as', 'into',
                              'through', 'during', 'before', 'after', 'above',
                              'below', 'between', 'and', 'or', 'but', 'if', 'then',
                              'because', 'it', 'its', 'this', 'that', 'these', 'those'])
    
    def extract_keywords(self, text):
        """Extract meaningful keywords from text."""
        words = text.lower().replace('.', '').replace(',', '').split()
        return set(w for w in words if w not in self.stopwords and len(w) > 2)
    
    def calculate_overlap_score(self, context, response):
        """Calculate keyword overlap between context and response."""
        context_keywords = self.extract_keywords(context)
        response_keywords = self.extract_keywords(response)
        
        if not response_keywords:
            return 0.0
        
        overlap = response_keywords.intersection(context_keywords)
        
        # Precision: what fraction of response keywords are in context
        precision = len(overlap) / len(response_keywords)
        
        return precision
    
    def calculate_novelty_penalty(self, context, response):
        """Penalize novel information not in context (potential hallucinations)."""
        context_keywords = self.extract_keywords(context)
        response_keywords = self.extract_keywords(response)
        
        novel_keywords = response_keywords - context_keywords
        
        if not response_keywords:
            return 1.0
        
        novelty_ratio = len(novel_keywords) / len(response_keywords)
        
        # Higher novelty = lower faithfulness
        return 1.0 - novelty_ratio
    
    def score(self, context, response):
        """
        Calculate faithfulness score (1-5 scale).
        
        Returns:
            score: Faithfulness score (1-5)
            details: Dictionary with component scores
        """
        overlap = self.calculate_overlap_score(context, response)
        novelty_penalty = self.calculate_novelty_penalty(context, response)
        
        # Combined score (weighted average)
        combined = (overlap * 0.4 + novelty_penalty * 0.6)
        
        # Convert to 1-5 scale
        scaled_score = 1 + (combined * 4)
        
        return {
            "score": round(scaled_score, 2),
            "overlap": round(overlap, 3),
            "novelty_penalty_score": round(novelty_penalty, 3),
            "combined_raw": round(combined, 3)
        }

# Create scorer instance
rule_scorer = RuleBasedFaithfulnessScorer()

# Demonstrate on sample data
print("üìä Rule-Based Faithfulness Scoring (Demonstration)")
print("="*70)

for idx, row in df_rag.iterrows():
    print(f"\n{'='*70}")
    print(f"Question {idx+1}: {row['question']}")
    print(f"Context: {row['context'][:100]}...")
    
    # Score each response type
    responses = [
        ("Faithful", row['faithful_response']),
        ("Partially Faithful", row['partially_faithful_response']),
        ("Unfaithful", row['unfaithful_response'])
    ]
    
    for resp_type, response in responses:
        result = rule_scorer.score(row['context'], response)
        print(f"\n  {resp_type}:")
        print(f"    Response: {response[:60]}...")
        print(f"    Score: {result['score']}/5 (Overlap: {result['overlap']}, Novelty: {result['novelty_penalty_score']})")
    
    if idx >= 1:  # Show only first 2 examples
        print("\n... (showing first 2 examples)")
        break


## 7. Comprehensive Results Analysis


## 8. Visualizations


In [None]:
# Create comprehensive visualizations
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=(
        "Faithfulness Score by Response Type",
        "Score Distribution",
        "Component Scores Breakdown",
        "Per-Question Faithfulness Heatmap"
    ),
    specs=[
        [{"type": "bar"}, {"type": "box"}],
        [{"type": "bar"}, {"type": "heatmap"}]
    ]
)

colors = {
    "Faithful": "#2ecc71",
    "Partially Faithful": "#f39c12",
    "Unfaithful": "#e74c3c"
}

# Plot 1: Average Faithfulness Score by Response Type
avg_by_type = results_df.groupby('response_type')['faithfulness_score'].mean().reset_index()
avg_by_type = avg_by_type.sort_values('faithfulness_score', ascending=False)

fig.add_trace(
    go.Bar(
        x=avg_by_type['response_type'],
        y=avg_by_type['faithfulness_score'],
        marker_color=[colors.get(t, '#3498db') for t in avg_by_type['response_type']],
        text=avg_by_type['faithfulness_score'].round(2),
        textposition='outside'
    ),
    row=1, col=1
)

# Plot 2: Score Distribution (Box Plot)
for resp_type in ['Faithful', 'Partially Faithful', 'Unfaithful']:
    data = results_df[results_df['response_type'] == resp_type]['faithfulness_score']
    fig.add_trace(
        go.Box(
            y=data,
            name=resp_type,
            marker_color=colors[resp_type]
        ),
        row=1, col=2
    )

# Plot 3: Component Scores (Overlap and Novelty)
component_data = results_df.groupby('response_type')[['overlap', 'novelty_score']].mean().reset_index()

fig.add_trace(
    go.Bar(
        name='Overlap Score',
        x=component_data['response_type'],
        y=component_data['overlap'],
        marker_color='#3498db'
    ),
    row=2, col=1
)

fig.add_trace(
    go.Bar(
        name='Novelty Score',
        x=component_data['response_type'],
        y=component_data['novelty_score'],
        marker_color='#9b59b6'
    ),
    row=2, col=1
)

# Plot 4: Heatmap of per-question scores
pivot_data = results_df.pivot(index='response_type', columns='question_id', values='faithfulness_score')

fig.add_trace(
    go.Heatmap(
        z=pivot_data.values,
        x=[f"Q{i}" for i in pivot_data.columns],
        y=pivot_data.index,
        colorscale='RdYlGn',
        text=np.round(pivot_data.values, 2),
        texttemplate="%{text}",
        textfont={"size": 10}
    ),
    row=2, col=2
)

# Update layout
fig.update_layout(
    title_text="Comprehensive Faithfulness Metric Analysis",
    showlegend=True,
    height=800,
    width=1200,
    barmode='group'
)

fig.update_yaxes(title_text="Score (1-5)", row=1, col=1)
fig.update_yaxes(title_text="Score (1-5)", row=1, col=2)
fig.update_yaxes(title_text="Score (0-1)", row=2, col=1)

fig.show()


In [None]:
# Create a detailed radar chart for comparison
fig_radar = go.Figure()

categories = ['Faithfulness\nScore', 'Keyword\nOverlap', 'No\nHallucination', 'Factual\nConsistency']

# Calculate scores for radar chart
for resp_type in ['Faithful', 'Partially Faithful', 'Unfaithful']:
    type_data = results_df[results_df['response_type'] == resp_type]
    
    # Normalize faithfulness to 0-1 scale
    faith_norm = type_data['faithfulness_score'].mean() / 5
    overlap = type_data['overlap'].mean()
    novelty = type_data['novelty_score'].mean()
    factual = (faith_norm + overlap) / 2  # Combined metric
    
    fig_radar.add_trace(go.Scatterpolar(
        r=[faith_norm, overlap, novelty, factual],
        theta=categories,
        fill='toself',
        name=resp_type,
        marker_color=colors[resp_type]
    ))

fig_radar.update_layout(
    polar=dict(
        radialaxis=dict(
            visible=True,
            range=[0, 1]
        )
    ),
    showlegend=True,
    title="Faithfulness Dimensions Comparison",
    height=500,
    width=700
)

fig_radar.show()


## 9. Production-Ready Faithfulness Evaluation Pipeline


In [None]:
class FaithfulnessEvaluationPipeline:
    """
    Production-ready faithfulness evaluation pipeline with MLflow integration.
    """
    
    def __init__(self, experiment_name="faithfulness-evaluation", 
                 tracking_uri=None, use_llm_judge=True):
        self.experiment_name = experiment_name
        self.use_llm_judge = use_llm_judge
        
        if tracking_uri:
            mlflow.set_tracking_uri(tracking_uri)
        mlflow.set_experiment(experiment_name)
        
        # Initialize scorers
        self.rule_scorer = RuleBasedFaithfulnessScorer()
        
        if use_llm_judge:
            self.llm_metric = faithfulness(model="openai:/gpt-4o-mini")
    
    def evaluate_model(self,
                       model_name: str,
                       questions: List[str],
                       responses: List[str],
                       contexts: List[str],
                       metadata: Dict = None) -> Dict:
        """
        Evaluate a RAG model with comprehensive faithfulness metrics.
        """
        with mlflow.start_run(run_name=model_name):
            # Log metadata
            mlflow.log_param("model_name", model_name)
            mlflow.log_param("num_samples", len(questions))
            mlflow.log_param("use_llm_judge", self.use_llm_judge)
            
            if metadata:
                for key, value in metadata.items():
                    mlflow.log_param(key, value)
            
            # Calculate rule-based scores
            rule_results = self._calculate_rule_based_scores(
                questions, responses, contexts
            )
            
            # Log metrics
            for metric_name, value in rule_results['aggregate'].items():
                mlflow.log_metric(f"rule_{metric_name}", value)
            
            # Log artifacts
            self._log_evaluation_artifacts(
                questions, responses, contexts, rule_results
            )
            
            return rule_results
    
    def _calculate_rule_based_scores(self,
                                     questions: List[str],
                                     responses: List[str],
                                     contexts: List[str]) -> Dict:
        """Calculate rule-based faithfulness scores."""
        results = {
            'individual_scores': [],
            'aggregate': {}
        }
        
        for q, r, c in zip(questions, responses, contexts):
            score_result = self.rule_scorer.score(c, r)
            score_result['question'] = q
            score_result['response'] = r[:100]
            results['individual_scores'].append(score_result)
        
        # Calculate aggregates
        scores = [s['score'] for s in results['individual_scores']]
        results['aggregate'] = {
            'faithfulness_mean': np.mean(scores),
            'faithfulness_std': np.std(scores),
            'faithfulness_min': np.min(scores),
            'faithfulness_max': np.max(scores),
            'overlap_mean': np.mean([s['overlap'] for s in results['individual_scores']]),
            'novelty_mean': np.mean([s['novelty_penalty_score'] for s in results['individual_scores']])
        }
        
        return results
    
    def _log_evaluation_artifacts(self,
                                 questions: List[str],
                                 responses: List[str],
                                 contexts: List[str],
                                 results: Dict):
        """Log evaluation artifacts to MLflow."""
        report = {
            'summary': results['aggregate'],
            'sample_analysis': results['individual_scores'][:10]  # Log first 10
        }
        
        mlflow.log_dict(report, "faithfulness_report.json")
    
    def compare_models(self,
                      models: Dict[str, Tuple[List[str], Dict]],
                      questions: List[str],
                      contexts: List[str]) -> pd.DataFrame:
        """
        Compare multiple RAG models.
        
        Args:
            models: Dict of model_name -> (responses, metadata)
            questions: List of questions
            contexts: List of contexts
        """
        comparison_results = []
        
        for model_name, (responses, metadata) in models.items():
            results = self.evaluate_model(
                model_name, questions, responses, contexts, metadata
            )
            
            row = {
                'Model': model_name,
                'Faithfulness': results['aggregate']['faithfulness_mean'],
                'Std': results['aggregate']['faithfulness_std'],
                'Overlap': results['aggregate']['overlap_mean'],
                'Novelty': results['aggregate']['novelty_mean']
            }
            comparison_results.append(row)
        
        return pd.DataFrame(comparison_results)

# Demonstrate the pipeline
print("\nüöÄ Production Pipeline Demonstration")
print("="*60)

pipeline = FaithfulnessEvaluationPipeline(
    experiment_name="faithfulness-production-pipeline",
    use_llm_judge=False  # Set to True if you have API access
)

# Prepare models for comparison
questions = df_rag['question'].tolist()
contexts = df_rag['context'].tolist()

models_to_compare = {
    "RAG-GPT4-Dense": (
        df_rag['faithful_response'].tolist(),
        {"architecture": "rag", "llm": "gpt-4", "retriever": "dense"}
    ),
    "RAG-GPT35-Sparse": (
        df_rag['partially_faithful_response'].tolist(),
        {"architecture": "rag", "llm": "gpt-3.5", "retriever": "sparse"}
    ),
    "Base-LLM-NoRAG": (
        df_rag['unfaithful_response'].tolist(),
        {"architecture": "base", "llm": "gpt-3.5", "retriever": "none"}
    )
}

# Run comparison
comparison_results = pipeline.compare_models(
    models_to_compare, questions, contexts
)

print("\nüìä Production Pipeline Results:")
print("="*70)
print(comparison_results.round(3).to_string(index=False))

# Identify best model
best_model = comparison_results.loc[comparison_results['Faithfulness'].idxmax()]
print(f"\nüèÜ Best Model: {best_model['Model']}")
print(f"   - Faithfulness Score: {best_model['Faithfulness']:.3f}/5")
print(f"   - Keyword Overlap: {best_model['Overlap']:.3f}")


## 10. Detecting Hallucinations


In [None]:
class HallucinationDetector:
    """
    Detect potential hallucinations in generated responses.
    """
    
    def __init__(self):
        self.scorer = RuleBasedFaithfulnessScorer()
        # Common hallucination indicators
        self.hallucination_patterns = [
            "studies show",
            "research indicates",
            "according to",
            "statistics reveal",
            "experts say",
            "it is estimated",
            "approximately",
            "around",
            "roughly"
        ]
    
    def detect_hallucinations(self, context, response):
        """
        Analyze response for potential hallucinations.
        
        Returns:
            Dict with hallucination analysis
        """
        # Get faithfulness score
        faith_result = self.scorer.score(context, response)
        
        # Extract novel claims (potential hallucinations)
        context_keywords = self.scorer.extract_keywords(context)
        response_keywords = self.scorer.extract_keywords(response)
        novel_keywords = response_keywords - context_keywords
        
        # Check for hallucination patterns
        response_lower = response.lower()
        found_patterns = [
            pattern for pattern in self.hallucination_patterns
            if pattern in response_lower
        ]
        
        # Calculate hallucination risk score
        novelty_risk = len(novel_keywords) / max(len(response_keywords), 1)
        pattern_risk = len(found_patterns) * 0.1
        faith_risk = (5 - faith_result['score']) / 4
        
        overall_risk = min((novelty_risk * 0.4 + pattern_risk + faith_risk * 0.4), 1.0)
        
        return {
            'hallucination_risk': round(overall_risk, 3),
            'risk_level': self._get_risk_level(overall_risk),
            'faithfulness_score': faith_result['score'],
            'novel_claims': list(novel_keywords)[:10],
            'suspicious_patterns': found_patterns,
            'recommendation': self._get_recommendation(overall_risk)
        }
    
    def _get_risk_level(self, risk):
        if risk < 0.2:
            return "LOW"
        elif risk < 0.5:
            return "MEDIUM"
        elif risk < 0.7:
            return "HIGH"
        else:
            return "CRITICAL"
    
    def _get_recommendation(self, risk):
        if risk < 0.2:
            return "Response appears faithful to context. Safe to use."
        elif risk < 0.5:
            return "Minor concerns. Review for accuracy before use."
        elif risk < 0.7:
            return "Significant hallucination risk. Manual verification required."
        else:
            return "High hallucination detected. Do not use without substantial revision."

# Demonstrate hallucination detection
detector = HallucinationDetector()

print("\nüîç Hallucination Detection Analysis")
print("="*70)

for idx, row in df_rag.head(3).iterrows():
    print(f"\n{'='*70}")
    print(f"Question: {row['question']}")
    
    for resp_type, response in [('Faithful', row['faithful_response']), 
                                 ('Unfaithful', row['unfaithful_response'])]:
        result = detector.detect_hallucinations(row['context'], response)
        
        print(f"\n  {resp_type} Response:")
        print(f"    Response: {response[:60]}...")
        print(f"    Risk Level: {result['risk_level']} ({result['hallucination_risk']:.1%})")
        print(f"    Faithfulness: {result['faithfulness_score']}/5")
        if result['novel_claims']:
            print(f"    Novel Claims: {', '.join(result['novel_claims'][:5])}")
        if result['suspicious_patterns']:
            print(f"    Suspicious Patterns: {result['suspicious_patterns']}")
        print(f"    üí° {result['recommendation']}")


## 11. Best Practices and Recommendations

### Key Takeaways:

1. **Faithfulness Metric Selection**:
   - **LLM-as-Judge**: Most accurate but requires API access
   - **Rule-based**: Fast and deterministic for initial screening
   - **Hybrid**: Combine both for cost-effective evaluation

2. **RAG System Evaluation**:
   - Always evaluate faithfulness to retrieved context
   - Track both precision (relevant info) and hallucination rate
   - Consider multiple retrieval strategies

3. **Hallucination Detection**:
   - Monitor for claims not supported by context
   - Watch for statistical claims and citations
   - Use multiple detection methods for robustness

4. **MLflow Integration Benefits**:
   - Track faithfulness across model versions
   - Compare RAG configurations systematically
   - Log evaluation artifacts for debugging
   - Enable A/B testing for production systems

5. **Production Tips**:
   - Set faithfulness thresholds for automated rejection
   - Implement real-time monitoring in production
   - Create feedback loops for continuous improvement
   - Balance latency vs. evaluation depth

### Faithfulness Score Interpretation:
- **5 (Excellent)**: Fully faithful, no hallucinations
- **4 (Good)**: Minor omissions but no fabrications
- **3 (Moderate)**: Some unsupported claims present
- **2 (Poor)**: Significant hallucinations detected
- **1 (Critical)**: Mostly fabricated content


In [None]:
# Final MLflow tracking summary
print("\nüìà MLflow Tracking Summary:")
print("="*60)
print("To view all experiments and metrics in MLflow UI:")
print("\n1. Run in terminal:")
print("   mlflow ui --port 5000")
print("\n2. Open browser:")
print("   http://localhost:5000")
print("\n3. Navigate to experiments:")
print("   - faithfulness-metrics-demo")
print("   - faithfulness-production-pipeline")
print("\n‚úÖ Demo completed successfully!")
print("\nüîó Additional Resources:")
print("   - MLflow LLM Evaluation: https://mlflow.org/docs/latest/llms/llm-evaluate/index.html")
print("   - Faithfulness in RAG: https://arxiv.org/abs/2307.15992")
print("   - Hallucination Detection: https://arxiv.org/abs/2311.14648")


In [None]:
# Calculate faithfulness scores for all examples using rule-based scorer
def evaluate_all_responses(df, scorer):
    """Evaluate all response types for all questions."""
    results = []
    
    for idx, row in df.iterrows():
        for resp_type in ['faithful_response', 'partially_faithful_response', 'unfaithful_response']:
            score_result = scorer.score(row['context'], row[resp_type])
            results.append({
                'question_id': idx + 1,
                'question': row['question'],
                'response_type': resp_type.replace('_response', '').replace('_', ' ').title(),
                'faithfulness_score': score_result['score'],
                'overlap': score_result['overlap'],
                'novelty_score': score_result['novelty_penalty_score']
            })
    
    return pd.DataFrame(results)

# Generate results
results_df = evaluate_all_responses(df_rag, rule_scorer)

# Create summary statistics
summary = results_df.groupby('response_type').agg({
    'faithfulness_score': ['mean', 'std', 'min', 'max'],
    'overlap': 'mean',
    'novelty_score': 'mean'
}).round(3)

print("\nüìä Faithfulness Score Summary by Response Type:")
print("="*70)
print(summary.to_string())

# Display comparison table
print("\n\nüìà Average Scores Comparison:")
print("="*70)
avg_scores = results_df.groupby('response_type')['faithfulness_score'].mean().sort_values(ascending=False)
for resp_type, score in avg_scores.items():
    bar = '‚ñà' * int(score * 4)
    print(f"{resp_type:25s}: {bar:20s} {score:.2f}/5")
