# LLM Evaluation Using MLflow Traces and Datasets

## Overview
This notebook demonstrates how to:
1. **Simulate real-world chat interactions** between users and an LLM
2. **Capture traces** of each conversation turn using MLflow
3. **Add expectations** (ground truth annotations) to traces
4. **Build evaluation datasets** from annotated traces
5. **Evaluate LLM responses** systematically using MLflow's evaluation framework

## Why This Approach?
Building datasets from traces allows you to:
- **Capture real interactions**: Record actual user-LLM conversations
- **Evolve your test suite**: Continuously add new test cases from production
- **Systematic evaluation**: Measure performance consistently across iterations
- **Track improvements**: Compare model versions against the same test data

## The Evaluation Loop
```
User Question → LLM Response → Capture Trace → Add Expectations → 
Build Dataset → Run Evaluation → Analyze Results → Iterate
```

                "## Setup and Imports",
                "",
                "**🎯 Configuration-Driven Approach:**",
                "This notebook uses `config.yaml` for all parameters. You can customize:",
                "- Groq models and API settings",
                "- A/B testing configuration",
                "- Sample questions and expected answers",
                "- Evaluation metrics and thresholds",
                "- Visualization settings",
                "",
                "**No need to modify notebook code - just edit config.yaml!**"

In [None]:
# Install required packages (uncomment if needed)
# !pip install mlflow groq pandas numpy matplotlib seaborn rouge-score nltk python-dotenv scikit-learn

In [None]:
import mlflow
from mlflow.genai.datasets import create_dataset
import pandas as pd
import numpy as np
import time
from datetime import datetime
import os
from dotenv import load_dotenv
import matplotlib.pyplot as plt
import seaborn as sns
from groq import Groq
import yaml

# For advanced evaluation metrics
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk.translate.meteor_score import meteor_score
import nltk

# Download required NLTK data
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt', quiet=True)
    nltk.download('wordnet', quiet=True)

# Load environment variables from .env file
load_dotenv()

# Load configuration from config.yaml
with open('config.yaml', 'r') as f:
    config = yaml.safe_load(f)

print("📁 Configuration loaded from config.yaml")
print(f"   - Models available: {', '.join(config['groq']['models'].keys())}")
print(f"   - A/B Testing: {'Enabled' if config['ab_testing']['enabled'] else 'Disabled'}")
print(f"   - Evaluation metrics: {len([k for k, v in config['evaluation']['metrics'].items() if v.get('enabled', True)])} enabled")

# Set visualization style from config
sns.set_style(config['visualization']['style'])

# Set MLflow tracking URI from config
mlflow.set_tracking_uri(config['mlflow']['tracking_uri'])

# Create or get experiment from config
experiment_name = config['mlflow']['experiment_name']
experiment = mlflow.set_experiment(experiment_name)
experiment_id = experiment.experiment_id

print(f"\n📊 MLflow Setup:")
print(f"   - Experiment: {experiment_name}")
print(f"   - Experiment ID: {experiment_id}")
print(f"   - Tracking URI: {mlflow.get_tracking_uri()}")
print(f"\n✓ All imports and configuration loaded successfully")

                "## Step 1: Initialize Groq API Client",
                "",
                "We'll use Groq API for fast LLM inference.",
                "",
                "**Configuration:**",
                "- API key is loaded from `.env` file",
                "- Available models are defined in `config.yaml`",
                "- Model parameters (temperature, max_tokens, etc.) are in `config.yaml`",
                "",
                "**Setup**: Add your API key to `.env` file:",
                "```",
                "GROQ_API_KEY=your_groq_api_key_here",
                "```",
                "",
                "Get your API key from: https://console.groq.com/",
                "",
                "**To change models or settings, edit config.yaml instead of code!**"

In [None]:
# Initialize Groq client with API key from .env file
# API key environment variable name comes from config
groq_api_key_env = config['groq']['api_key_env']
groq_api_key = os.getenv(groq_api_key_env)

if not groq_api_key:
    raise ValueError(f"{groq_api_key_env} not found in .env file. Please add it.")

groq_client = Groq(api_key=groq_api_key)

# Load available Groq models from config
GROQ_MODELS = {
    key: details['model_id'] 
    for key, details in config['groq']['models'].items()
}

print("✓ Groq client initialized")
print(f"\n🤖 Available models for A/B testing:")
for name, model_id in GROQ_MODELS.items():
    model_info = config['groq']['models'][name]
    print(f"  - {name}: {model_id}")
    print(f"    └─ {model_info['description']}")
    print(f"    └─ Recommended for: {model_info['recommended_for']}")

## Step 2: Groq-Powered Chat Function

This function:
- Accepts user questions
- Calls Groq API to generate responses
- Captures each interaction as an MLflow trace
- Supports multiple models for A/B testing

In [None]:
def groq_llm_response(question: str, model_key: str = None, temperature: float = None) -> tuple:
    """
    Call Groq API to generate response using specified model.
    Uses configuration from config.yaml for default values.
    
    Args:
        question: User's input question
        model_key: Key for model selection from GROQ_MODELS (uses config default if None)
        temperature: Sampling temperature (uses config default if None)
    
    Returns:
        Tuple of (response_text, model_id, tokens_used)
    """
    # Use defaults from config if not provided
    if model_key is None:
        model_key = config['groq']['default_model']
    if temperature is None:
        temperature = config['groq']['default_temperature']
    
    model_id = GROQ_MODELS.get(model_key, GROQ_MODELS[config['groq']['default_model']])
    
    try:
        # Call Groq API with settings from config
        chat_completion = groq_client.chat.completions.create(
            messages=[
                {
                    "role": "system",
                    "content": config['groq']['system_prompt']
                },
                {
                    "role": "user",
                    "content": question
                }
            ],
            model=model_id,
            temperature=temperature,
            max_tokens=config['groq']['max_tokens'],
            top_p=config['groq']['top_p'],
            stream=config['groq']['stream']
        )
        
        response_text = chat_completion.choices[0].message.content
        tokens_used = chat_completion.usage.total_tokens
        
        return response_text, model_id, tokens_used
        
    except Exception as e:
        print(f"Error calling Groq API: {e}")
        return f"Error: {str(e)}", model_id, 0


@mlflow.trace(name="chat_completion", span_type="CHAT_MODEL")
def chat_with_llm(user_question: str, conversation_id: str, model_key: str = None) -> dict:
    """
    Chat with Groq LLM and capture the interaction as an MLflow trace.
    Uses configuration from config.yaml for default model.
    
    Args:
        user_question: The question asked by the user
        conversation_id: Unique identifier for the conversation
        model_key: Model to use (uses config default if None)
    
    Returns:
        Dictionary containing question, answer, and metadata
    """
    # Get response from Groq (uses config defaults if model_key is None)
    llm_answer, model_id, tokens = groq_llm_response(user_question, model_key)
    
    # Prepare response with metadata
    response = {
        "question": user_question,
        "answer": llm_answer,
        "conversation_id": conversation_id,
        "timestamp": datetime.now().isoformat(),
        "model_key": model_key,
        "model_id": model_id,
        "tokens_used": tokens
    }
    
    # Log metadata to MLflow
    mlflow.log_param("conversation_id", conversation_id)
    mlflow.log_param("model_key", model_key)
    mlflow.log_param("model_id", model_id)
    mlflow.log_metric("tokens_used", tokens)
    
    return response

print("✓ Groq-powered chat functions defined")

## Step 3: Generate Sample Conversations (A/B Testing)

Let's simulate user-LLM interactions using different models for A/B testing.
We'll test multiple Groq models to compare their performance.

In [None]:
# Load sample questions from config
sample_questions = config['sample_questions']

# For A/B testing: load test models from config
ab_testing_enabled = config['ab_testing']['enabled']
test_models = config['ab_testing']['test_models'] if ab_testing_enabled else [config['groq']['default_model']]

# Generate conversations and capture traces
print("Generating chat interactions with A/B testing...\n")
print(f"Model A: {GROQ_MODELS[test_models[0]]}")
print(f"Model B: {GROQ_MODELS[test_models[1]]}")
print("="*60 + "\n")

conversations = []
for idx, question in enumerate(sample_questions):
    # Alternate between models for A/B testing
    model_key = test_models[idx % len(test_models)]
    conversation_id = f"conv_{idx+1}"
    
    # Start an MLflow run for this conversation
    with mlflow.start_run(run_name=f"chat_{idx+1}_{model_key}"):
        # Chat with LLM (this creates a trace)
        response = chat_with_llm(question, conversation_id, model_key)
        conversations.append(response)
        
        print(f"[{conversation_id}] Model: {model_key}")
        print(f"Q: {question}")
        print(f"A: {response['answer'][:150]}..." if len(response['answer']) > 150 else f"A: {response['answer']}")
        print(f"Tokens: {response['tokens_used']}\n")

print(f"✓ Generated {len(conversations)} conversations with traces")
print(f"✓ Used {len(test_models)} different models for A/B testing")

## Step 4: Retrieve and Inspect Traces

Now let's retrieve the traces we just created

In [None]:
# Search for traces in our experiment
traces = mlflow.search_traces(
    experiment_ids=[experiment_id],
    filter_string="attributes.name = 'chat_completion'",
    max_results=100,
    return_type="list"  # Returns list[Trace] for direct manipulation
)

print(f"Found {len(traces)} traces\n")

# Inspect the first trace
if traces:
    first_trace = traces[0]
    print("Sample Trace Structure:")
    print(f"  Trace ID: {first_trace.info.trace_id}")
    print(f"  Trace Name: {first_trace.info.trace_name}")
    print(f"  Execution Time: {first_trace.info.execution_time_ms}ms")
    print(f"  Status: {first_trace.info.status}")
    print(f"\n  Data Keys: {first_trace.data.keys() if hasattr(first_trace.data, 'keys') else 'N/A'}")

                "## Step 5: Add Expectations (Ground Truth Annotations)",
                "",
                "**Expectations** are the ground truth against which we evaluate the LLM's outputs.",
                "They can be:",
                "- Specific expected text/answers (reference responses)",
                "- Quality metrics (relevance, accuracy scores)",
                "- Boolean flags (contains_citation, is_helpful, etc.)",
                "- Structured evaluation criteria",
                "",
                "**Configuration:**",
                "- Expected answers are defined in `config.yaml` under `expected_answers`",
                "- Each answer includes the question, expected response, and quality metrics",
                "- **To add/modify expectations, edit config.yaml!**",
                "",
                "This approach allows:",
                "- Easy updates without code changes",
                "- Version control of test data",
                "- Team collaboration on ground truth",
                "- Centralized test case management"

In [None]:
def add_expectations_to_trace(trace, expected_answer: str, quality_metrics: dict):
    """
    Add expectations (ground truth) to a trace for evaluation.
    
    Args:
        trace: MLflow Trace object
        expected_answer: The expected/ideal answer
        quality_metrics: Dictionary of quality scores and criteria
    """
    trace_id = trace.info.trace_id
    
    # Log expected answer
    mlflow.log_expectation(
        trace_id=trace_id,
        name="expected_answer",
        value=expected_answer
    )
    
    # Log quality metrics as structured expectations
    mlflow.log_expectation(
        trace_id=trace_id,
        name="quality_metrics",
        value=quality_metrics
    )


# Load expectations (reference answers and quality metrics) from config
# In a real scenario, these would come from expert review, human annotation, or ground truth data
# Now centrally managed in config.yaml for easy updates
expectations_map = [
    {
        "expected_answer": exp['answer'],
        "quality_metrics": exp['quality_metrics']
    }
    for exp in config['expected_answers']
]

# Add expectations to traces
print("Adding expectations to traces...\n")

for idx, trace in enumerate(traces[:len(expectations_map)]):
    expectations = expectations_map[idx]
    add_expectations_to_trace(
        trace, 
        expectations["expected_answer"],
        expectations["quality_metrics"]
    )
    print(f"✓ Added expectations to trace {idx+1}")

print(f"\n✓ Added expectations to {len(expectations_map)} traces")

## Step 5: Retrieve Annotated Traces

Now let's retrieve the traces with their expectations

In [None]:
# Retrieve traces with expectations
annotated_traces = mlflow.search_traces(
    experiment_ids=[experiment_id],
    filter_string="attributes.name = 'chat_completion'",
    max_results=100,
    return_type="list"
)

print(f"Retrieved {len(annotated_traces)} annotated traces\n")

# Display a sample annotated trace
if annotated_traces:
    sample_trace = annotated_traces[0]
    print("Sample Annotated Trace:")
    print(f"  Trace ID: {sample_trace.info.trace_id}")
    
    # Note: Expectations are stored separately and can be retrieved
    # They will be included when we build the dataset

## Step 6: Build Evaluation Dataset from Traces

Now we create an evaluation dataset by merging our annotated traces.
This dataset becomes a reusable test suite for systematic evaluation.

In [None]:
# Create evaluation dataset using config
dataset_name = config['dataset']['name']
dataset_tags = config['dataset']['tags'].copy()
dataset_tags['created_date'] = datetime.now().isoformat()
dataset_tags['models_tested'] = ','.join(test_models)

print(f"Creating evaluation dataset: {dataset_name}...\n")

dataset = create_dataset(
    name=dataset_name,
    experiment_id=[experiment_id],
    tags=dataset_tags
)

# Merge annotated traces into the dataset
# Use only traces that have expectations added
traces_with_expectations = annotated_traces[:len(expectations_map)]
dataset.merge_records(traces_with_expectations)

print(f"✓ Created dataset '{dataset_name}'")
print(f"✓ Merged {len(traces_with_expectations)} traces into dataset")
print(f"\nDataset Info:")
print(f"  Name: {dataset.name}")
print(f"  Tags: {dataset.tags}")

## Step 7: Load and Inspect the Dataset

Let's load our dataset and see what it contains

In [None]:
# Load the dataset
loaded_dataset = mlflow.genai.datasets.load_dataset(dataset_name)

print(f"Dataset: {loaded_dataset.name}\n")

# Convert to pandas DataFrame for easier inspection
try:
    df = loaded_dataset.to_pandas()
    print(f"Dataset shape: {df.shape}")
    print(f"\nColumns: {df.columns.tolist()}\n")
    print("Sample records:")
    print(df.head())
except Exception as e:
    print(f"Note: {e}")
    print("Dataset structure may vary - inspect using dataset API methods")

## Step 8: Advanced Evaluation Metrics

We'll evaluate LLM responses using industry-standard metrics:

### Metrics Implemented:
1. **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**
   - ROUGE-1: Unigram overlap
   - ROUGE-2: Bigram overlap
   - ROUGE-L: Longest common subsequence
   - Common in summarization and text generation

2. **BLEU (Bilingual Evaluation Understudy)**
   - Measures n-gram precision
   - Originally for machine translation, now widely used in NLG

3. **METEOR (Metric for Evaluation of Translation with Explicit ORdering)**
   - Considers synonyms and stemming
   - More sophisticated than BLEU

4. **Custom Metrics**
   - Semantic similarity (Jaccard)
   - Exact match
   - Length appropriateness

In [None]:
# Initialize evaluation scorers using config
rouge_config = config['evaluation']['metrics']['rouge']
rouge = rouge_scorer.RougeScorer(
    rouge_config['types'], 
    use_stemmer=rouge_config['use_stemmer']
) if rouge_config['enabled'] else None

bleu_config = config['evaluation']['metrics']['bleu']
smoothing = SmoothingFunction().method1 if bleu_config.get('smoothing', True) else None


def calculate_rouge_scores(prediction: str, reference: str) -> dict:
    """
    Calculate ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L).
    
    ROUGE measures recall-oriented overlap between prediction and reference.
    Higher scores indicate better overlap with the reference text.
    
    Args:
        prediction: Generated text from LLM
        reference: Ground truth/expected text
    
    Returns:
        Dictionary with ROUGE-1, ROUGE-2, and ROUGE-L F1 scores
    """
    scores = rouge.score(reference, prediction)
    return {
        'rouge1': scores['rouge1'].fmeasure,
        'rouge2': scores['rouge2'].fmeasure,
        'rougeL': scores['rougeL'].fmeasure
    }


def calculate_bleu_score(prediction: str, reference: str) -> float:
    """
    Calculate BLEU score.
    
    BLEU measures precision of n-grams in prediction against reference.
    Score ranges from 0 to 1, where 1 is perfect match.
    
    Args:
        prediction: Generated text from LLM
        reference: Ground truth/expected text
    
    Returns:
        BLEU score (0.0 to 1.0)
    """
    reference_tokens = [reference.lower().split()]
    prediction_tokens = prediction.lower().split()
    
    try:
        score = sentence_bleu(
            reference_tokens, 
            prediction_tokens, 
            smoothing_function=smoothing
        )
        return score
    except:
        return 0.0


def calculate_meteor_score(prediction: str, reference: str) -> float:
    """
    Calculate METEOR score.
    
    METEOR considers synonyms, stemming, and word order.
    More sophisticated than BLEU for semantic similarity.
    
    Args:
        prediction: Generated text from LLM
        reference: Ground truth/expected text
    
    Returns:
        METEOR score (0.0 to 1.0)
    """
    try:
        # Tokenize
        reference_tokens = reference.lower().split()
        prediction_tokens = prediction.lower().split()
        score = meteor_score([reference_tokens], prediction_tokens)
        return score
    except:
        return 0.0


def calculate_semantic_similarity(prediction: str, reference: str) -> float:
    """
    Calculate Jaccard similarity (word-level overlap).
    
    Simple but effective measure of lexical similarity.
    
    Args:
        prediction: Generated text from LLM
        reference: Ground truth/expected text
    
    Returns:
        Jaccard similarity (0.0 to 1.0)
    """
    pred_words = set(prediction.lower().split())
    ref_words = set(reference.lower().split())
    
    if not pred_words or not ref_words:
        return 0.0
    
    intersection = pred_words.intersection(ref_words)
    union = pred_words.union(ref_words)
    
    return len(intersection) / len(union) if union else 0.0


def exact_match_score(prediction: str, reference: str) -> float:
    """
    Check for exact match between prediction and reference.
    
    Args:
        prediction: Generated text from LLM
        reference: Ground truth/expected text
    
    Returns:
        1.0 if exact match, 0.0 otherwise
    """
    return 1.0 if prediction.strip().lower() == reference.strip().lower() else 0.0


def evaluate_response(prediction: str, reference: str) -> dict:
    """
    Comprehensive evaluation using all metrics.
    
    Args:
        prediction: Generated text from LLM
        reference: Ground truth/expected text
    
    Returns:
        Dictionary with all evaluation scores
    """
    rouge_scores = calculate_rouge_scores(prediction, reference)
    bleu = calculate_bleu_score(prediction, reference)
    meteor = calculate_meteor_score(prediction, reference)
    semantic_sim = calculate_semantic_similarity(prediction, reference)
    exact_match = exact_match_score(prediction, reference)
    
    return {
        'rouge1': rouge_scores['rouge1'],
        'rouge2': rouge_scores['rouge2'],
        'rougeL': rouge_scores['rougeL'],
        'bleu': bleu,
        'meteor': meteor,
        'semantic_similarity': semantic_sim,
        'exact_match': exact_match,
        'response_length': len(prediction)
    }

print("✓ Advanced evaluation metrics defined")
print("\nMetrics available:")
print("  - ROUGE-1, ROUGE-2, ROUGE-L")
print("  - BLEU")
print("  - METEOR")
print("  - Semantic Similarity (Jaccard)")
print("  - Exact Match")
print("  - Response Length")

## Step 9: Run Comprehensive Evaluation

Let's evaluate all conversations using our advanced metrics and compare models (A/B testing)

In [None]:
# Evaluate each conversation
print("Running comprehensive evaluation...\n")
print("="*80)

evaluation_results = []

for idx, conversation in enumerate(conversations[:len(expectations_map)]):
    expected = expectations_map[idx]
    
    # Get prediction and reference
    prediction = conversation['answer']
    reference = expected['expected_answer']
    
    # Calculate all evaluation metrics
    scores = evaluate_response(prediction, reference)
    
    # Add metadata
    result = {
        'conversation_id': conversation['conversation_id'],
        'question': conversation['question'],
        'model_key': conversation['model_key'],
        'model_id': conversation['model_id'],
        'tokens_used': conversation['tokens_used'],
        'prediction': prediction,
        'reference': reference,
        **scores  # Unpack all metric scores
    }
    
    evaluation_results.append(result)
    
    # Print summary for this conversation
    print(f"\nConversation {idx+1} [{conversation['model_key']}]:")
    print(f"Q: {conversation['question']}")
    print(f"\nMetrics:")
    print(f"  ROUGE-1: {scores['rouge1']:.3f}")
    print(f"  ROUGE-2: {scores['rouge2']:.3f}")
    print(f"  ROUGE-L: {scores['rougeL']:.3f}")
    print(f"  BLEU:    {scores['bleu']:.3f}")
    print(f"  METEOR:  {scores['meteor']:.3f}")
    print(f"  Semantic Similarity: {scores['semantic_similarity']:.3f}")
    print(f"  Tokens Used: {conversation['tokens_used']}")
    print("-" * 80)

# Convert to DataFrame for analysis
results_df = pd.DataFrame(evaluation_results)

print(f"\n✓ Evaluated {len(evaluation_results)} conversations")
print(f"\nResults DataFrame Shape: {results_df.shape}")
print(f"Columns: {results_df.columns.tolist()}")

## Step 10: Statistical Analysis & A/B Testing Comparison

Analyze evaluation results and compare model performance

In [None]:
# Calculate aggregate statistics
print("\n" + "="*80)
print("OVERALL EVALUATION RESULTS")
print("="*80)

# Metric columns for analysis
metric_cols = ['rouge1', 'rouge2', 'rougeL', 'bleu', 'meteor', 'semantic_similarity']

print("\n📊 Overall Performance Metrics:")
print("-" * 80)
for metric in metric_cols:
    mean_val = results_df[metric].mean()
    std_val = results_df[metric].std()
    min_val = results_df[metric].min()
    max_val = results_df[metric].max()
    print(f"{metric.upper():20s} | Mean: {mean_val:.3f} | Std: {std_val:.3f} | Min: {min_val:.3f} | Max: {max_val:.3f}")

# A/B Testing Analysis
print("\n" + "="*80)
print("A/B TESTING: MODEL COMPARISON")
print("="*80)

# Group by model
model_comparison = results_df.groupby('model_key')[metric_cols].agg(['mean', 'std'])

print("\n📈 Performance by Model:")
print(model_comparison.round(3))

# Determine winner for each metric
print("\n🏆 A/B Test Winners (by metric):")
print("-" * 80)
for metric in metric_cols:
    model_means = results_df.groupby('model_key')[metric].mean()
    winner = model_means.idxmax()
    winner_score = model_means.max()
    print(f"{metric.upper():20s} | Winner: {winner:15s} | Score: {winner_score:.3f}")

# Token efficiency comparison
print("\n💰 Token Efficiency:")
print("-" * 80)
token_stats = results_df.groupby('model_key')['tokens_used'].agg(['mean', 'sum', 'min', 'max'])
print(token_stats.round(0))

# Overall recommendation
print("\n💡 Recommendation:")
print("-" * 80)
avg_performance = results_df.groupby('model_key')[metric_cols].mean().mean(axis=1)
best_model = avg_performance.idxmax()
best_score = avg_performance.max()
print(f"Best Overall Model: {best_model}")
print(f"Average Score: {best_score:.3f}")
print(f"\nAll model averages:")
for model, score in avg_performance.items():
    print(f"  {model}: {score:.3f}")

## Step 11: Visualization of Evaluation Results

Create comprehensive visualizations to understand model performance

In [None]:
# 1. Metric Comparison Across Models (Bar Chart)
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
fig.suptitle('LLM Evaluation Metrics: Model Comparison (A/B Testing)', fontsize=16, fontweight='bold')

metrics_to_plot = ['rouge1', 'rouge2', 'rougeL', 'bleu', 'meteor', 'semantic_similarity']
titles = ['ROUGE-1 Score', 'ROUGE-2 Score', 'ROUGE-L Score', 'BLEU Score', 'METEOR Score', 'Semantic Similarity']

for idx, (metric, title) in enumerate(zip(metrics_to_plot, titles)):
    ax = axes[idx // 3, idx % 3]
    
    # Group data by model
    model_data = results_df.groupby('model_key')[metric].agg(['mean', 'std']).reset_index()
    
    # Create bar plot
    bars = ax.bar(model_data['model_key'], model_data['mean'], 
                   yerr=model_data['std'], capsize=5, alpha=0.7,
                   color=['#3498db', '#e74c3c', '#2ecc71'][:len(model_data)])
    
    ax.set_title(title, fontweight='bold')
    ax.set_ylabel('Score', fontsize=10)
    ax.set_ylim([0, 1])
    ax.grid(axis='y', alpha=0.3)
    ax.tick_params(axis='x', rotation=15)
    
    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.3f}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.savefig('evaluation_metrics_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Saved: evaluation_metrics_comparison.png")

In [None]:
# 2. Radar Chart for Overall Model Comparison
from math import pi

fig, ax = plt.subplots(figsize=(10, 10), subplot_kw=dict(projection='polar'))

# Prepare data
categories = ['ROUGE-1', 'ROUGE-2', 'ROUGE-L', 'BLEU', 'METEOR', 'Semantic Sim']
num_vars = len(categories)

# Compute angle for each axis
angles = [n / float(num_vars) * 2 * pi for n in range(num_vars)]
angles += angles[:1]

# Plot for each model
colors = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12']
for idx, model in enumerate(results_df['model_key'].unique()):
    model_data = results_df[results_df['model_key'] == model]
    values = [
        model_data['rouge1'].mean(),
        model_data['rouge2'].mean(),
        model_data['rougeL'].mean(),
        model_data['bleu'].mean(),
        model_data['meteor'].mean(),
        model_data['semantic_similarity'].mean()
    ]
    values += values[:1]  # Complete the circle
    
    ax.plot(angles, values, 'o-', linewidth=2, label=model, color=colors[idx % len(colors)])
    ax.fill(angles, values, alpha=0.15, color=colors[idx % len(colors)])

# Customize plot
ax.set_xticks(angles[:-1])
ax.set_xticklabels(categories, size=11)
ax.set_ylim(0, 1)
ax.set_yticks([0.2, 0.4, 0.6, 0.8, 1.0])
ax.set_yticklabels(['0.2', '0.4', '0.6', '0.8', '1.0'], size=9)
ax.grid(True, linestyle='--', alpha=0.7)
ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.1), fontsize=11)
ax.set_title('Model Performance Radar Chart\n(A/B Testing Comparison)', 
            size=14, fontweight='bold', pad=20)

plt.tight_layout()
plt.savefig('model_performance_radar.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Saved: model_performance_radar.png")

In [None]:
# 3. Heatmap of Metric Correlations
fig, ax = plt.subplots(figsize=(10, 8))

# Calculate correlation matrix
correlation_matrix = results_df[metric_cols].corr()

# Create heatmap
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8},
            ax=ax)

ax.set_title('Correlation Between Evaluation Metrics', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.savefig('metrics_correlation_heatmap.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Saved: metrics_correlation_heatmap.png")

In [None]:
# 4. Token Usage vs Performance Scatter Plot
fig, ax = plt.subplots(figsize=(12, 6))

# Calculate average performance score
results_df['avg_score'] = results_df[metric_cols].mean(axis=1)

# Create scatter plot for each model
for idx, model in enumerate(results_df['model_key'].unique()):
    model_data = results_df[results_df['model_key'] == model]
    ax.scatter(model_data['tokens_used'], model_data['avg_score'], 
              s=150, alpha=0.7, label=model, color=colors[idx % len(colors)])

ax.set_xlabel('Tokens Used', fontsize=12, fontweight='bold')
ax.set_ylabel('Average Performance Score', fontsize=12, fontweight='bold')
ax.set_title('Token Efficiency vs Performance\n(Lower tokens + Higher score = Better)', 
            fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('token_efficiency_vs_performance.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Saved: token_efficiency_vs_performance.png")

In [None]:
# 5. Box Plot for Metric Distribution by Model
fig, ax = plt.subplots(figsize=(14, 6))

# Prepare data in long format
plot_data = []
for model in results_df['model_key'].unique():
    model_data = results_df[results_df['model_key'] == model]
    for metric in metric_cols:
        for value in model_data[metric]:
            plot_data.append({
                'Model': model,
                'Metric': metric.upper(),
                'Score': value
            })

plot_df = pd.DataFrame(plot_data)

# Create box plot
sns.boxplot(data=plot_df, x='Metric', y='Score', hue='Model', ax=ax, palette='Set2')

ax.set_title('Distribution of Evaluation Metrics by Model', fontsize=14, fontweight='bold')
ax.set_xlabel('Metric', fontsize=12, fontweight='bold')
ax.set_ylabel('Score', fontsize=12, fontweight='bold')
ax.legend(title='Model', fontsize=10)
ax.grid(axis='y', alpha=0.3)
plt.xticks(rotation=15)

plt.tight_layout()
plt.savefig('metrics_distribution_boxplot.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Saved: metrics_distribution_boxplot.png")

print("\n✅ All visualizations created successfully!")
print("\nGenerated files:")
print("  1. evaluation_metrics_comparison.png")
print("  2. model_performance_radar.png")
print("  3. metrics_correlation_heatmap.png")
print("  4. token_efficiency_vs_performance.png")
print("  5. metrics_distribution_boxplot.png")

## Step 12: Continuous Evaluation Loop

Demonstrate how to continuously add new conversations and update the dataset

In [None]:
def continuous_evaluation_workflow(new_question: str, conversation_id: str):
    """
    Demonstrate the continuous evaluation loop:
    1. User asks question
    2. LLM responds (trace captured)
    3. Add expectations
    4. Update dataset
    5. Re-evaluate
    
    Args:
        new_question: New user question
        conversation_id: Unique conversation identifier
    """
    print(f"\n{'='*60}")
    print("CONTINUOUS EVALUATION WORKFLOW")
    print(f"{'='*60}\n")
    
    # Step 1: User interaction
    print(f"1️⃣  User Question: {new_question}")
    
    # Step 2: Generate response and capture trace
    with mlflow.start_run(run_name=f"continuous_eval_{conversation_id}"):
        response = chat_with_llm(new_question, conversation_id)
        print(f"2️⃣  LLM Response: {response['answer']}")
        print(f"   ✓ Trace captured")
    
    # Step 3: Add expectations (in real scenario, this could be async/manual)
    new_traces = mlflow.search_traces(
        experiment_ids=[experiment_id],
        filter_string=f"attributes.name = 'chat_completion'",
        max_results=1,
        return_type="list",
        order_by=["timestamp DESC"]
    )
    
    if new_traces:
        new_trace = new_traces[0]
        # Simulate adding expectations
        add_expectations_to_trace(
            new_trace,
            expected_answer="Sample expected answer for new question",
            quality_metrics={
                "relevance": 0.9,
                "accuracy": 0.85,
                "completeness": 0.9
            }
        )
        print(f"3️⃣  Expectations added to trace")
        
        # Step 4: Update dataset
        dataset.merge_records([new_trace])
        print(f"4️⃣  Dataset updated with new trace")
    
    # Step 5: Ready for evaluation
    print(f"5️⃣  Ready for re-evaluation with updated dataset")
    print(f"\n✓ Continuous evaluation workflow complete\n")


# Example: Add a new conversation
continuous_evaluation_workflow(
    "What is the capital of Spain?",
    "conv_new_1"
)

## Step 12: View Evaluation in MLflow UI

You can view all traces, datasets, and evaluations in the MLflow UI

In [None]:
print("\n" + "="*60)
print("VIEWING RESULTS IN MLFLOW UI")
print("="*60)
print("\nTo view your evaluation results in MLflow UI:")
print("\n1. Run in terminal:")
print("   mlflow ui --port 5000")
print("\n2. Open browser and navigate to:")
print("   http://localhost:5000")
print("\n3. Explore:")
print("   - Experiments → ", experiment_name)
print("   - Traces tab to view all chat interactions")
print("   - Datasets tab to view evaluation datasets")
print("   - Compare runs to see performance trends")
print("\n" + "="*60)

## Summary and Key Takeaways

### What We Accomplished:

1. ✅ **Groq API Integration**: Used fast, production-ready LLM inference
2. ✅ **Real Chat Interactions**: Created user-LLM conversations with multiple models
3. ✅ **MLflow Trace Capture**: Recorded each interaction automatically
4. ✅ **Ground Truth Annotations**: Added expectations (reference answers)
5. ✅ **Dataset from Traces**: Built reusable evaluation dataset
6. ✅ **Advanced Metrics**: Implemented ROUGE, BLEU, METEOR, Semantic Similarity
7. ✅ **A/B Testing**: Compared different Groq models systematically
8. ✅ **Comprehensive Visualizations**: 5 different charts for analysis
9. ✅ **Token Efficiency Analysis**: Cost-benefit evaluation
10. ✅ **Continuous Evaluation Loop**: Showed iterative improvement workflow

### Why This Approach Works:

- **Real-World Grounded**: Evaluation data from actual Groq API interactions
- **Industry-Standard Metrics**: ROUGE, BLEU, METEOR used in production systems
- **Data-Driven A/B Testing**: Compare models objectively with multiple metrics
- **Continuous Improvement**: Dataset evolves with each conversation
- **Cost Optimization**: Track token usage and efficiency
- **Visual Insights**: Easy-to-understand charts for stakeholders
- **Version Tracking**: Compare models consistently over time
- **MLflow Integration**: Enterprise-ready tracking and deployment

### Next Steps:

1. **Scale Up**: Test with larger datasets (100+ conversations)
2. **More Models**: Add other Groq models (Mixtral, Gemma) for comparison
3. **Temperature Testing**: A/B test different temperature settings
4. **Custom Metrics**: Add domain-specific evaluation metrics
5. **Human Evaluation**: Integrate human feedback loop
6. **Production Pipeline**: Automate trace collection from live traffic
7. **Alerting**: Set up metric degradation alerts
8. **CI/CD Integration**: Automated testing before deployment
9. **Cost Analysis**: Track API costs vs performance
10. **Model Fine-tuning**: Use insights to improve model selection

### Resources:

- [MLflow GenAI Datasets](https://mlflow.org/docs/latest/genai/datasets/) - Dataset building from traces
- [MLflow Tracing Guide](https://mlflow.org/docs/latest/genai/tracing/) - Capture LLM interactions
- [MLflow Evaluation Framework](https://mlflow.org/docs/latest/genai/eval-monitor/) - Systematic evaluation
- [Groq API Documentation](https://console.groq.com/docs) - Fast LLM inference
- [ROUGE Score Guide](https://aclanthology.org/W04-1013/) - Understanding ROUGE metrics
- [BLEU Score Paper](https://aclanthology.org/P02-1040/) - BLEU metric details

## Additional: A/B Testing Strategies with Groq

### A/B Testing Approaches:

1. **Model Comparison** (implemented above):
   - Test different models (llama-3.1-8b vs llama-3.1-70b)
   - Compare performance, cost, and speed

2. **Temperature Testing**:
   - Same model, different temperatures
   - Evaluate creativity vs consistency

3. **Prompt Engineering**:
   - Test different system prompts
   - Compare instruction formats

4. **Context Window Testing**:
   - Test with varying context lengths
   - Evaluate performance degradation

In [None]:
# Example: A/B Test with Different Temperatures

def ab_test_temperatures():
    """
    Example of A/B testing with different temperature settings.
    Higher temperature = more creative/random
    Lower temperature = more focused/deterministic
    """
    temperatures = [0.3, 0.7, 1.0]  # Test different creativity levels
    test_question = "Explain artificial intelligence"
    
    results = []
    for temp in temperatures:
        response, model_id, tokens = groq_llm_response(
            test_question, 
            model_key="llama-3.1-8b",
            temperature=temp
        )
        results.append({
            'temperature': temp,
            'response': response,
            'tokens': tokens
        })
        print(f"\nTemperature {temp}:")
        print(response[:200] + "...")
    
    return results

# Uncomment to run temperature A/B test
# temp_results = ab_test_temperatures()

print("✓ A/B testing examples defined")
print("\nTo run temperature test, uncomment: temp_results = ab_test_temperatures()")