# Prompt Versioning and LLM Evaluation with Groq API

This comprehensive notebook demonstrates how to systematically evaluate and version prompt templates using MLflow's GenAI evaluation framework with Groq API for LLM inference. Learn how to build a robust prompt engineering workflow with quantitative evaluation metrics.

## 🎯 What You'll Learn

This tutorial covers the complete prompt engineering lifecycle:

1. **Setup & Configuration** - MLflow tracking and Groq API integration
2. **Prompt Versioning** - Creating and managing multiple prompt versions
3. **Evaluation Datasets** - Building comprehensive test cases
4. **LLM Integration** - Using Groq API for fast, cost-effective inference
5. **Custom Scorers** - Building both LLM-based and heuristic evaluation metrics
6. **Systematic Evaluation** - Running evaluations and analyzing results
7. **Iterative Improvement** - Data-driven prompt optimization workflow

## 🚀 Key Benefits

- **Cost-Effective**: Uses Groq's fast inference instead of expensive OpenAI API calls
- **Version Control**: Track prompt changes and their performance impact
- **Quantitative Metrics**: Measure improvements with concrete evaluation scores
- **Professional Workflow**: Enterprise-grade prompt engineering practices
- **Reproducible**: All experiments tracked in MLflow for easy comparison

## 📋 Prerequisites

### Required Packages
```bash
pip install --upgrade mlflow>=3.3 groq
```

### API Keys Required
- **Groq API Key**: Get one from [https://console.groq.com/keys](https://console.groq.com/keys)
- Set your API key as an environment variable:
  ```bash
  export GROQ_API_KEY="your-groq-api-key-here"
  ```

### MLflow Server
Start the MLflow tracking server:
```bash
mlflow ui --backend-store-uri sqlite:///mlflow.db --port 5000
```

## 🔧 System Requirements

- Python 3.8+
- MLflow 3.3+
- Groq API access
- Internet connection for API calls


## 📚 Step 1: Setup and Configuration

In this step, we'll import the necessary libraries and configure our MLflow tracking server and Groq API client. This foundation enables us to track experiments and make LLM calls efficiently.


In [None]:
# Import required libraries for MLflow tracking and Groq API integration
import os
import json
import mlflow
from groq import Groq
from mlflow.entities import Feedback
from mlflow.genai import scorer
from mlflow.genai.scorers import Guidelines

# Configure MLflow tracking server
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("llm-evaluation")

# Retrieve and validate Groq API key
GROQ_API_KEY = os.getenv("GROQ_API_KEY", "your-groq-api-key-here")

if GROQ_API_KEY == "your-groq-api-key-here":
    print("⚠️  Please set your GROQ_API_KEY environment variable or update the key in this cell")
    print("Get your API key from: https://console.groq.com/keys")
    raise ValueError("Groq API key not configured. Please set GROQ_API_KEY environment variable.")
else:
    print("✅ Groq API key configured successfully")

# Initialize Groq client for LLM inference
client = Groq(api_key=GROQ_API_KEY)

# Display configuration status
print(f"MLflow tracking URI: {mlflow.get_tracking_uri()}")
print(f"Current experiment: llm-evaluation")
print("🚀 Setup complete! Ready to start prompt engineering workflow.")

## 📝 Step 2: Create Prompt Templates

Now we'll define and version prompt templates for our evaluation. This demonstrates the iterative improvement process where we can track changes and measure their impact on performance.

### 🔄 Prompt Version 1: Basic Q&A Template

Our first version is a simple, straightforward Q&A template without specific formatting instructions. This serves as our baseline for comparison.


In [None]:
# Define the first version of our prompt template
# This is a basic Q&A template without specific formatting instructions
# It serves as our baseline for comparison with improved versions
PROMPT_V1 = [
    {
        "role": "system",
        "content": "You are a helpful assistant. Answer the following question.",
    },
    {
        "role": "user",
        # Use double curly braces {{}} to indicate template variables
        # These will be filled in during evaluation with actual questions
        "content": "Question: {{question}}",
    },
]

# Register the prompt template in MLflow Prompt Registry for version control
# This enables tracking changes, loading specific versions, and comparing performance
mlflow.genai.register_prompt(
    name="qa_prompt",
    template=PROMPT_V1,
    commit_message="Initial basic Q&A prompt - baseline version",
)

print("✅ Prompt V1 registered successfully!")
print("\n📋 Prompt V1 structure:")
for i, msg in enumerate(PROMPT_V1):
    print(f"  {i+1}. {msg['role'].title()}: {msg['content']}")

print("\n💡 This version will serve as our baseline for measuring improvements.")

### ⚡ Prompt Version 2: Enhanced Template with Formatting Instructions

Building on our baseline, this version adds specific formatting instructions to encourage more concise and professional responses. We'll measure how these changes impact our evaluation metrics.


In [None]:
# Define an improved version of our prompt template
# This version includes specific formatting instructions for better response quality
# Key improvements: length constraints and professional tone requirements
PROMPT_V2 = [
    {
        "role": "system",
        "content": "You are a helpful assistant. Answer the following question in three sentences or less. Be concise and professional.",
    },
    {
        "role": "user",
        "content": "Question: {{question}}",
    },
]

# Register the improved prompt as a new version in MLflow
# This creates version 2 while preserving version 1 for comparison
mlflow.genai.register_prompt(
    name="qa_prompt",
    template=PROMPT_V2,
    commit_message="Enhanced prompt with formatting instructions for conciseness and professionalism",
)

print("✅ Prompt V2 registered successfully!")
print("\n📋 Prompt V2 structure:")
for i, msg in enumerate(PROMPT_V2):
    print(f"  {i+1}. {msg['role'].title()}: {msg['content']}")

print("\n🔄 Version loading options:")
print("  - prompts:/qa_prompt@latest (always gets the latest version)")
print("  - prompts:/qa_prompt/1 (load specific version 1)")
print("  - prompts:/qa_prompt/2 (load specific version 2)")

print("\n🎯 Key improvements in V2:")
print("  - Added length constraint (3 sentences max)")
print("  - Specified professional tone requirement")
print("  - Enhanced clarity and directness")

## 🧪 Step 3: Create Evaluation Dataset

A robust evaluation dataset is crucial for measuring prompt performance. We'll create comprehensive test cases covering diverse topics and difficulty levels to ensure our prompts work well across different scenarios.

### Dataset Structure

Each test case in our evaluation dataset includes:
- **inputs**: The question to ask the LLM
- **expectations**: Expected key concepts that should be mentioned in the response
- **tags**: Metadata for filtering, analysis, and targeted improvements

### Evaluation Strategy

Our dataset covers:
- **Multiple domains**: Weather, technology, medicine, biology, environment
- **Difficulty levels**: Basic, intermediate, and advanced questions
- **Key concept validation**: Ensuring responses cover expected topics


In [None]:
# Define a comprehensive evaluation dataset with diverse test cases
# Each entry contains inputs (the question), expectations (key concepts), and metadata tags
eval_dataset = [
    {
        "inputs": {"question": "What causes rain?"},
        "expectations": {
            "key_concepts": ["evaporation", "condensation", "precipitation", "water cycle"]
        },
        "tags": {"topic": "weather", "difficulty": "basic"},
    },
    {
        "inputs": {"question": "Explain the difference between AI and ML"},
        "expectations": {
            "key_concepts": ["artificial intelligence", "machine learning", "subset", "algorithms"]
        },
        "tags": {"topic": "technology", "difficulty": "intermediate"},
    },
    {
        "inputs": {"question": "How do vaccines work?"},
        "expectations": {
            "key_concepts": ["immune system", "antibodies", "protection", "antigens"]
        },
        "tags": {"topic": "medicine", "difficulty": "intermediate"},
    },
    {
        "inputs": {"question": "What is photosynthesis?"},
        "expectations": {
            "key_concepts": ["sunlight", "chlorophyll", "carbon dioxide", "oxygen", "glucose"]
        },
        "tags": {"topic": "biology", "difficulty": "basic"},
    },
    {
        "inputs": {"question": "Describe quantum computing"},
        "expectations": {
            "key_concepts": ["quantum", "qubits", "superposition", "entanglement"]
        },
        "tags": {"topic": "technology", "difficulty": "advanced"},
    },
    {
        "inputs": {"question": "What is climate change?"},
        "expectations": {
            "key_concepts": ["greenhouse gases", "global warming", "carbon emissions", "temperature"]
        },
        "tags": {"topic": "environment", "difficulty": "intermediate"},
    },
]

# Display dataset statistics and composition
print(f"✅ Created evaluation dataset with {len(eval_dataset)} test cases")
print("\n📊 Dataset composition:")

# Calculate topic distribution
topic_counts = {}
difficulty_counts = {}

for item in eval_dataset:
    topic = item["tags"]["topic"]
    difficulty = item["tags"]["difficulty"]
    
    topic_counts[topic] = topic_counts.get(topic, 0) + 1
    difficulty_counts[difficulty] = difficulty_counts.get(difficulty, 0) + 1

print("\n🌍 Topics covered:")
for topic, count in topic_counts.items():
    print(f"  - {topic}: {count} cases")

print("\n📈 Difficulty distribution:")
for difficulty, count in difficulty_counts.items():
    print(f"  - {difficulty}: {count} cases")

print(f"\n💡 Total key concepts to evaluate: {sum(len(item['expectations']['key_concepts']) for item in eval_dataset)}")
print("🎯 This diverse dataset will help us measure prompt performance across different domains and complexity levels.")

## 🔮 Step 4: Create Prediction Function with Groq API

The prediction function is the bridge between our prompt templates and the LLM. This function demonstrates how to integrate Groq API with MLflow's prompt registry for efficient, cost-effective inference.

### Function Architecture

Our prediction function performs these key operations:
1. **Load Prompt Template**: Retrieves the latest version from MLflow registry
2. **Format Template**: Fills in template variables with actual questions
3. **Convert Format**: Transforms MLflow format to Groq API format
4. **Generate Response**: Calls Groq API with optimized parameters
5. **Return Result**: Provides response for evaluation

### Key Benefits of Groq Integration

- **⚡ Speed**: Groq's optimized inference delivers responses in milliseconds
- **💰 Cost-Effective**: Significantly cheaper than OpenAI API calls
- **🔄 Consistency**: Reliable API with high uptime
- **📊 Tracking**: Full integration with MLflow tracing

**Important**: The function parameter name must match the key in our dataset's `inputs` field (`question`).


In [None]:
@mlflow.trace
def predict_fn(question: str) -> str:
    """
    Prediction function that uses Groq API to generate responses.
    
    This function integrates MLflow prompt registry with Groq API for efficient
    LLM inference. It automatically loads the latest prompt version and handles
    format conversion between MLflow and Groq.
    
    Args:
        question (str): The question to ask (must match dataset inputs key)
        
    Returns:
        str: The generated response from Groq API
        
    Raises:
        Exception: If API call fails or prompt loading errors occur
    """
    try:
        # Load the latest prompt template from MLflow registry
        # Using @latest syntax to always get the most recent version
        prompt = mlflow.genai.load_prompt("prompts:/qa_prompt@latest")
        
        # Format the prompt template with the actual question
        # This replaces {{question}} with the real question from our dataset
        rendered_prompt = prompt.format(question=question)
        
        # Convert MLflow prompt format to Groq API format
        # Groq expects a list of message dictionaries with 'role' and 'content'
        groq_messages = []
        for msg in rendered_prompt:
            groq_messages.append({
                "role": msg["role"],
                "content": msg["content"]
            })
        
        # Call Groq API to generate response
        # Using llama-3.3-70b-versatile for high-quality, fast inference
        response = client.chat.completions.create(
            model="llama-3.3-70b-versatile",  # Fast, capable model
            messages=groq_messages,
            temperature=0.7,      # Balanced creativity vs consistency
            max_tokens=500,       # Reasonable response length
            top_p=1,             # Use full probability distribution
            stream=False,        # Get complete response
        )
        
        return response.choices[0].message.content
        
    except Exception as e:
        print(f"❌ Error in prediction function: {e}")
        return f"Error: {str(e)}"

# Test the prediction function with a sample question
print("🧪 Testing prediction function with sample question...")
test_question = "What causes rain?"
test_response = predict_fn(test_question)
print(f"Question: {test_question}")
print(f"Response: {test_response}")
print("✅ Prediction function working correctly!")
print("🚀 Ready to run full evaluation on all test cases.")


## 📊 Step 5: Define Evaluation Scorers

Evaluation scorers are the metrics that measure how well our prompts perform. We'll create both LLM-based and heuristic scorers to comprehensively assess response quality across multiple dimensions.

### 🧠 Custom LLM Scorers Using Groq API

These scorers use Groq models to evaluate qualitative aspects of responses. We've created custom scorers because MLflow's built-in Guidelines scorers are hardcoded to require OpenAI API keys.

#### Scorer Categories

Our custom LLM scorers evaluate:
- **Conciseness**: Brevity and directness of responses
- **Professionalism**: Tone and formality appropriateness
- **Accuracy**: Factual correctness and reliability
- **Helpfulness**: Usefulness and relevance to the question


In [None]:
# Define custom LLM-based scorers using Groq API
# These scorers use Groq models to evaluate qualitative aspects of responses
# Each scorer returns a Feedback object with a score (0-1) and rationale

@scorer
def is_concise(outputs: str, expectations: dict) -> Feedback:
    """
    Evaluate if the response is concise and to the point.
    
    This scorer assesses brevity and directness, penalizing unnecessary details
    while rewarding clear, focused responses that get to the point quickly.
    """
    try:
        evaluation_prompt = [
            {"role": "system", "content": "You are an expert evaluator. Rate responses on a scale of 0-1 where 1 is perfect."},
            {"role": "user", "content": f"""
Evaluate this response for conciseness (brevity and directness):

Response: "{outputs}"

Guidelines: The response should be concise and to the point. Avoid unnecessary details.
Score based on how well the response balances completeness with brevity.

Provide your evaluation in this exact JSON format:
{{"score": 0.85, "rationale": "Brief explanation of your score"}}
"""}
        ]
        
        response = client.chat.completions.create(
            model="llama-3.3-70b-versatile",
            messages=evaluation_prompt,
            temperature=0.1,  # Low temperature for consistent evaluation
            max_tokens=200,
        )
        
        result = response.choices[0].message.content
        # Parse the JSON response from the LLM evaluator
        import json
        try:
            eval_result = json.loads(result)
            return Feedback(value=eval_result["score"], rationale=eval_result["rationale"])
        except:
            # Fallback if JSON parsing fails
            return Feedback(value=0.5, rationale=f"Could not parse evaluation: {result}")
            
    except Exception as e:
        return Feedback(value=0.0, rationale=f"Evaluation error: {str(e)}")

@scorer
def is_professional(outputs: str, expectations: dict) -> Feedback:
    """
    Evaluate if the response maintains professional tone.
    
    This scorer assesses whether the response uses appropriate language,
    maintains formality, and presents information in a professional manner.
    """
    try:
        evaluation_prompt = [
            {"role": "system", "content": "You are an expert evaluator. Rate responses on a scale of 0-1 where 1 is perfect."},
            {"role": "user", "content": f"""
Evaluate this response for professional tone:

Response: "{outputs}"

Guidelines: The response should be written in a professional, clear, and appropriate tone.
Consider language choice, formality level, and overall presentation.

Provide your evaluation in this exact JSON format:
{{"score": 0.85, "rationale": "Brief explanation of your score"}}
"""}
        ]
        
        response = client.chat.completions.create(
            model="llama-3.3-70b-versatile",
            messages=evaluation_prompt,
            temperature=0.1,
            max_tokens=200,
        )
        
        result = response.choices[0].message.content
        import json
        try:
            eval_result = json.loads(result)
            return Feedback(value=eval_result["score"], rationale=eval_result["rationale"])
        except:
            return Feedback(value=0.5, rationale=f"Could not parse evaluation: {result}")
            
    except Exception as e:
        return Feedback(value=0.0, rationale=f"Evaluation error: {str(e)}")

@scorer
def is_accurate(outputs: str, expectations: dict) -> Feedback:
    """
    Evaluate if the response is factually accurate.
    
    This scorer assesses the factual correctness of the response based on
    established knowledge and verifiable information.
    """
    try:
        evaluation_prompt = [
            {"role": "system", "content": "You are an expert evaluator. Rate responses on a scale of 0-1 where 1 is perfect."},
            {"role": "user", "content": f"""
Evaluate this response for factual accuracy:

Response: "{outputs}"

Guidelines: The response should be factually accurate and based on established knowledge.
Consider whether the information presented is correct, verifiable, and reliable.

Provide your evaluation in this exact JSON format:
{{"score": 0.85, "rationale": "Brief explanation of your score"}}
"""}
        ]
        
        response = client.chat.completions.create(
            model="llama-3.3-70b-versatile",
            messages=evaluation_prompt,
            temperature=0.1,
            max_tokens=200,
        )
        
        result = response.choices[0].message.content
        import json
        try:
            eval_result = json.loads(result)
            return Feedback(value=eval_result["score"], rationale=eval_result["rationale"])
        except:
            return Feedback(value=0.5, rationale=f"Could not parse evaluation: {result}")
            
    except Exception as e:
        return Feedback(value=0.0, rationale=f"Evaluation error: {str(e)}")

@scorer
def is_helpful(outputs: str, expectations: dict) -> Feedback:
    """
    Evaluate if the response is helpful and addresses the question.
    
    This scorer assesses whether the response is useful, informative,
    and directly addresses what the user is asking for.
    """
    try:
        evaluation_prompt = [
            {"role": "system", "content": "You are an expert evaluator. Rate responses on a scale of 0-1 where 1 is perfect."},
            {"role": "user", "content": f"""
Evaluate this response for helpfulness:

Response: "{outputs}"

Guidelines: The response should be helpful, informative, and directly address the user's question.
Consider relevance, completeness, and practical value.

Provide your evaluation in this exact JSON format:
{{"score": 0.85, "rationale": "Brief explanation of your score"}}
"""}
        ]
        
        response = client.chat.completions.create(
            model="llama-3.3-70b-versatile",
            messages=evaluation_prompt,
            temperature=0.1,
            max_tokens=200,
        )
        
        result = response.choices[0].message.content
        import json
        try:
            eval_result = json.loads(result)
            return Feedback(value=eval_result["score"], rationale=eval_result["rationale"])
        except:
            return Feedback(value=0.5, rationale=f"Could not parse evaluation: {result}")
            
    except Exception as e:
        return Feedback(value=0.0, rationale=f"Evaluation error: {str(e)}")

# Display summary of custom LLM scorers
print("✅ Custom Groq-based LLM scorers defined:")
print("  - is_concise: Evaluates brevity and directness using Groq")
print("  - is_professional: Evaluates tone and professionalism using Groq") 
print("  - is_accurate: Evaluates factual correctness using Groq")
print("  - is_helpful: Evaluates usefulness and relevance using Groq")
print("\n💡 These scorers use Groq API instead of OpenAI, providing:")
print("  • Faster evaluation with lower latency")
print("  • Reduced costs compared to OpenAI API")
print("  • Consistent performance and reliability")


### 🔧 Alternative: Using Guidelines with Custom Model (Optional)

If you want to try using MLflow's built-in Guidelines with a different model, you can attempt this approach. However, this may not work with all MLflow versions as Guidelines might be hardcoded to OpenAI.

**Note**: This section demonstrates why custom scorers are necessary when working with non-OpenAI APIs.


In [None]:
# Alternative approach: Try to use Guidelines with custom model
# This demonstrates why custom scorers are necessary for non-OpenAI APIs

try:
    # Attempt to create Guidelines with a custom model
    # Note: This will likely fail as Guidelines are hardcoded to OpenAI
    from mlflow.genai.scorers import Guidelines
    
    # Try to specify a custom model (this parameter may not exist)
    is_concise_alt = Guidelines(
        name="is_concise_alt", 
        guidelines="The response should be concise and to the point. Avoid unnecessary details.",
        model="llama-3.3-70b-versatile"  # This parameter may not be supported
    )
    print("✅ Alternative Guidelines approach worked!")
    print("💡 If this works, you could use Guidelines instead of custom scorers")
    
except Exception as e:
    print(f"❌ Alternative Guidelines approach failed: {e}")
    print("💡 This confirms that Guidelines scorers are hardcoded to OpenAI")
    print("✅ Our custom Groq-based scorers are the correct solution!")
    print("\n🔍 Why custom scorers are better:")
    print("  • Full control over evaluation prompts")
    print("  • Support for any LLM provider")
    print("  • Customizable scoring logic")
    print("  • Better error handling")

# Clean up any failed attempts
try:
    del is_concise_alt
except:
    pass

## 💡 Solution Summary: Using Groq Instead of OpenAI for LLM Scorers

### 🚨 The Problem
MLflow's built-in `Guidelines` scorers are hardcoded to use OpenAI API, which requires an `OPENAI_API_KEY` environment variable. This creates a dependency on OpenAI's API and pricing structure.

### ✅ The Solution
We've created **custom LLM scorers** that use Groq API instead. These scorers provide:

1. **Fast Inference**: Use Groq's optimized infrastructure with `llama-3.3-70b-versatile` model
2. **High Quality**: Maintain the same evaluation quality as OpenAI-based scorers
3. **Structured Output**: Provide consistent JSON responses for reliable scoring
4. **Robust Error Handling**: Include fallback mechanisms for evaluation failures
5. **Cost Efficiency**: Significantly lower costs than OpenAI API calls

### 🎯 Benefits of Our Custom Approach

| Feature | Custom Groq Scorers | MLflow Guidelines |
|---------|-------------------|-------------------|
| **API Provider** | ✅ Groq (flexible) | ❌ OpenAI only |
| **Speed** | ✅ Ultra-fast | ⚠️ Standard |
| **Cost** | ✅ Very low | ❌ Higher |
| **Customization** | ✅ Full control | ❌ Limited |
| **Dependencies** | ✅ Minimal | ❌ OpenAI required |

### 🛠️ Implementation Advantages

- **🔧 No OpenAI Dependency**: Works entirely with Groq ecosystem
- **⚡ Faster Inference**: Groq's optimized infrastructure delivers sub-second responses
- **💰 Lower Costs**: Groq's competitive pricing reduces evaluation expenses
- **🎛️ Full Customization**: Easy to modify evaluation prompts and models
- **🔍 Better Control**: Fine-tune scoring logic for specific use cases


### 🔍 Custom Heuristic Scorers

These scorers use rule-based logic to evaluate specific aspects like concept coverage and response length. They provide fast, deterministic evaluations that complement our LLM-based scorers.

#### Scorer Types

Our heuristic scorers evaluate:
- **Concept Coverage**: Percentage of expected key concepts mentioned
- **Response Length**: Appropriate length based on specified criteria


In [None]:
# Custom heuristic scorer for concept coverage
# This scorer evaluates how many expected key concepts are mentioned in the response
@scorer
def concept_coverage(outputs: str, expectations: dict) -> Feedback:
    """
    Evaluate the coverage of key concepts in the response.
    
    This scorer performs case-insensitive matching to find expected key concepts
    in the generated response and calculates a coverage percentage score.
    
    Args:
        outputs (str): The generated response text
        expectations (dict): Expected concepts and other criteria
            - key_concepts: List of concepts that should be mentioned
        
    Returns:
        Feedback: Score between 0-1 and detailed rationale
    """
    # Extract expected key concepts from the expectations
    concepts = set(expectations.get("key_concepts", []))
    
    if not concepts:
        return Feedback(
            value=1.0,
            rationale="No key concepts specified to evaluate - scoring as perfect"
        )
    
    # Convert response to lowercase for case-insensitive matching
    response_lower = outputs.lower()
    
    # Find which concepts are mentioned in the response
    included = set()
    for concept in concepts:
        if concept.lower() in response_lower:
            included.add(concept)
    
    # Calculate coverage score as a percentage (0-1)
    coverage_score = len(included) / len(concepts)
    
    # Generate detailed rationale for transparency
    missing_concepts = concepts - included
    rationale = f"Coverage: {len(included)}/{len(concepts)} concepts included"
    if included:
        rationale += f" - Found: {list(included)}"
    if missing_concepts:
        rationale += f" - Missing: {list(missing_concepts)}"
    
    return Feedback(
        value=coverage_score,
        rationale=rationale
    )

# Custom scorer for response length (conciseness check)
@scorer  
def response_length_check(outputs: str, expectations: dict) -> Feedback:
    """
    Evaluate if the response length is appropriate (not too long or too short).
    
    This scorer checks if the response meets length requirements, penalizing
    responses that are too short (lack detail) or too long (verbose).
    
    Args:
        outputs (str): The generated response text
        expectations (dict): Expected criteria
            - max_length: Maximum acceptable word count (default: 100)
        
    Returns:
        Feedback: Score between 0-1 and rationale
    """
    word_count = len(outputs.split())
    
    # Define ideal length range (adjustable based on use case)
    min_words = 10  # Minimum for substantive response
    max_words = expectations.get("max_length", 100)  # Maximum to avoid verbosity
    
    if word_count < min_words:
        # Score decreases linearly for responses that are too short
        score = word_count / min_words
        rationale = f"Response too short ({word_count} words, minimum {min_words} required)"
    elif word_count > max_words:
        # Score decreases as response gets longer than maximum
        score = max_words / word_count
        rationale = f"Response too long ({word_count} words, maximum {max_words} recommended)"
    else:
        # Perfect score for responses within the ideal range
        score = 1.0
        rationale = f"Response length optimal ({word_count} words within {min_words}-{max_words} range)"
    
    return Feedback(
        value=score,
        rationale=rationale
    )

# Display summary of heuristic scorers
print("✅ Custom heuristic scorers defined:")
print("  - concept_coverage: Evaluates coverage of expected key concepts")
print("  - response_length_check: Evaluates appropriate response length")
print("\n💡 Heuristic scorers provide:")
print("  • Fast, deterministic evaluation")
print("  • Transparent scoring logic")
print("  • Consistent results across runs")
print("  • Complement to LLM-based scorers")

## 🚀 Step 6: Run Evaluation

Now we'll run the comprehensive evaluation using our prediction function and all scorers. This will test our current prompt (V2) against all test cases and provide detailed performance metrics.

### Evaluation Process

The evaluation will:
1. **Load Current Prompt**: Use the latest version from MLflow registry
2. **Process Test Cases**: Run each question through our prediction function
3. **Apply All Scorers**: Evaluate responses using both LLM-based and heuristic scorers
4. **Track Results**: Log everything to MLflow for analysis and comparison
5. **Generate Metrics**: Provide comprehensive performance scores


In [None]:
# Run the comprehensive evaluation with all our scorers
print("🚀 Starting comprehensive evaluation...")
print("This will evaluate our current prompt (V2) against all test cases.")
print("The evaluation may take a few minutes depending on the number of test cases.\n")

# Combine all scorers into a single list for comprehensive evaluation
all_scorers = [
    is_concise,           # LLM-based: evaluates brevity and directness
    is_professional,      # LLM-based: evaluates tone and professionalism
    is_accurate,          # LLM-based: evaluates factual correctness
    is_helpful,           # LLM-based: evaluates usefulness and relevance
    concept_coverage,     # Heuristic: evaluates coverage of key concepts
    response_length_check # Heuristic: evaluates appropriate response length
]

print(f"📊 Evaluation configuration:")
print(f"  • Test cases: {len(eval_dataset)}")
print(f"  • Scorers: {len(all_scorers)} (4 LLM-based + 2 heuristic)")
print(f"  • Prompt version: Latest from registry")
print(f"  • Model: llama-3.3-70b-versatile via Groq API")
print("\n🔄 Starting evaluation process...")

# Run the evaluation using MLflow's GenAI evaluation framework
# This will create a new run and log all results automatically
evaluation_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=predict_fn,
    scorers=all_scorers,
)

print("✅ Evaluation completed successfully!")
print(f"📊 Results logged to MLflow run: {evaluation_results.run_id}")
print("\n📈 To view detailed results:")
print("1. Open MLflow UI: http://localhost:5000")
print("2. Navigate to the latest experiment run")
print("3. Click on 'Evaluation Results' to see detailed scores")
print("4. Click on individual test cases to see traces and rationales")
print("5. Compare scores across different evaluation metrics")
print("\n🎯 Next: We'll create a third prompt version to demonstrate iterative improvement!")

## 🔄 Step 7: Iterative Prompt Improvement

Now we'll create a third version of our prompt to demonstrate the iterative improvement process. This version will be even more specific about formatting requirements, and we'll compare the results to show how prompt engineering can systematically improve performance.

### Improvement Strategy

Our V3 prompt will focus on:
- **Precise Length Requirements**: "exactly 2-3 sentences" for consistency
- **Enhanced Clarity**: More specific instructions for response structure
- **Professional Standards**: Reinforced tone and quality expectations


In [None]:
# Create an even more refined version of our prompt
# This version adds precise length requirements and enhanced clarity instructions
PROMPT_V3 = [
    {
        "role": "system",
        "content": "You are a knowledgeable assistant. Provide clear, concise answers in exactly 2-3 sentences. Focus on the most important information and use professional language.",
    },
    {
        "role": "user",
        "content": "Question: {{question}}",
    },
]

# Register the new version in MLflow prompt registry
# This creates version 3 while preserving versions 1 and 2 for comparison
mlflow.genai.register_prompt(
    name="qa_prompt",
    template=PROMPT_V3,
    commit_message="V3: More specific length and clarity requirements for improved consistency",
)

print("✅ Prompt V3 registered successfully!")
print("\n📋 Prompt V3 structure:")
for i, msg in enumerate(PROMPT_V3):
    print(f"  {i+1}. {msg['role'].title()}: {msg['content']}")

print("\n🎯 Key improvements in V3:")
print("  - Precise length constraint: 'exactly 2-3 sentences'")
print("  - Enhanced clarity: 'Focus on the most important information'")
print("  - Reinforced professionalism: 'knowledgeable assistant'")
print("  - Structured approach: Clear, concise, professional")

print("\n🔄 Running evaluation with V3 prompt...")

# Run evaluation with the new prompt (V3)
# The prediction function will automatically use the latest prompt version
evaluation_results_v3 = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=predict_fn,  # Automatically uses latest prompt (V3)
    scorers=all_scorers,
)

print("✅ V3 evaluation completed successfully!")
print(f"📊 V3 Results logged to MLflow run: {evaluation_results_v3.run_id}")
print("\n🔍 Now you can compare V2 and V3 results in the MLflow UI:")
print("1. Go to the experiment page in MLflow UI")
print("2. Select the V2 and V3 runs to compare")
print("3. View the comparison dashboard to see improvements/degradations")
print("4. Analyze which metrics improved and which may have declined")
print("\n💡 This demonstrates the iterative prompt improvement workflow!")

## 🛠️ Step 8: Advanced Features and Tips

### 📚 Loading Specific Prompt Versions

One of the key benefits of MLflow's prompt registry is the ability to load specific versions for testing, comparison, or rollback scenarios. This enables precise control over which prompt version is used in different environments.


In [None]:
# Example: Load specific prompt versions for testing and comparison
print("📋 Loading available prompt versions:")

# Load different versions using MLflow's prompt registry
prompt_v1 = mlflow.genai.load_prompt("prompts:/qa_prompt/1")
prompt_v2 = mlflow.genai.load_prompt("prompts:/qa_prompt/2") 
prompt_latest = mlflow.genai.load_prompt("prompts:/qa_prompt@latest")

print("✅ Successfully loaded all prompt versions")
print(f"  - V1: {len(prompt_v1)} messages")
print(f"  - V2: {len(prompt_v2)} messages") 
print(f"  - Latest (V3): {len(prompt_latest)} messages")

print("\n🔍 Version comparison capabilities:")
print("  • Load specific versions for A/B testing")
print("  • Create environment-specific prompts (dev/staging/prod)")
print("  • Implement rollback strategies")
print("  • Compare prompt performance across versions")

# Create a custom prediction function for a specific version
@mlflow.trace
def predict_fn_v1(question: str) -> str:
    """
    Prediction function using prompt V1 specifically.
    
    This demonstrates how to create version-specific prediction functions
    for comparing different prompt versions in the same evaluation.
    """
    prompt = mlflow.genai.load_prompt("prompts:/qa_prompt/1")
    rendered_prompt = prompt.format(question=question)
    
    groq_messages = []
    for msg in rendered_prompt:
        groq_messages.append({
            "role": msg["role"],
            "content": msg["content"]
        })
    
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=groq_messages,
        temperature=0.7,
        max_tokens=500,
    )
    
    return response.choices[0].message.content

print("\n🔄 Advanced comparison workflow:")
print("1. Create separate prediction functions for each version")
print("2. Run evaluations with different functions")
print("3. Compare results in MLflow UI side-by-side")
print("4. Identify which version performs best for specific metrics")
print("5. Implement the best-performing version in production")

print("\n💡 Pro tip: You can also evaluate multiple versions simultaneously")
print("   by running separate evaluations and comparing the results!")

## 🎉 Summary and Next Steps

### 🏆 What We've Accomplished

This comprehensive tutorial has demonstrated a complete prompt engineering workflow:

1. **✅ Setup & Configuration**: MLflow tracking and Groq API integration
2. **✅ Prompt Versioning**: Created and managed multiple prompt versions (V1, V2, V3)
3. **✅ Evaluation Dataset**: Built comprehensive test cases across diverse domains
4. **✅ LLM Integration**: Implemented prediction functions using Groq API
5. **✅ Custom Scorers**: Defined both LLM-based and heuristic evaluation metrics
6. **✅ Systematic Evaluation**: Ran evaluations and demonstrated iterative improvement

### 🎯 Key Benefits of This Approach

| Benefit | Description |
|---------|-------------|
| **🔄 Systematic Iteration** | Track prompt changes and measure their impact on performance |
| **📊 Quantitative Evaluation** | Measure improvements with concrete, comparable metrics |
| **🎯 Targeted Improvements** | Identify specific areas for enhancement based on data |
| **📈 Performance Tracking** | Monitor trends and performance over time |
| **🔍 Detailed Analysis** | Deep dive into individual test cases and evaluation traces |
| **💰 Cost Efficiency** | Use Groq's fast, affordable API instead of expensive alternatives |
| **🛡️ Version Control** | Maintain prompt history and enable rollback capabilities |

### 🚀 Next Steps for Further Development

#### Immediate Improvements
1. **📊 Expand Evaluation Dataset**: Add more diverse questions across different domains
2. **🎯 Domain-Specific Scorers**: Create specialized evaluation metrics for your use case
3. **🤖 Model Experimentation**: Test different Groq models and parameters
4. **🔄 A/B Testing Framework**: Implement systematic prompt variation testing

#### Advanced Features
1. **⚡ Automated Pipelines**: Set up continuous evaluation and monitoring
2. **📈 Performance Dashboards**: Create real-time monitoring of prompt performance
3. **🎛️ Hyperparameter Tuning**: Optimize temperature, max_tokens, and other parameters
4. **🔗 Multi-Model Evaluation**: Compare performance across different LLM providers

#### Production Considerations
1. **🏭 Deployment Strategies**: Implement prompt versioning in production environments
2. **📊 Monitoring & Alerting**: Set up alerts for performance degradation
3. **🔄 Rollback Mechanisms**: Implement automated rollback for poor-performing prompts
4. **📝 Documentation**: Maintain comprehensive documentation of prompt evolution


### 🔧 Troubleshooting Tips

**Common Issues and Solutions:**

#### 1. **🔑 Groq API Key Issues**
```python
# Make sure your API key is set correctly
export GROQ_API_KEY="your-actual-api-key"
# Or set it in the notebook cell
```

**Symptoms**: Authentication errors, "API key not configured" messages
**Solution**: Verify your API key is valid and properly set as an environment variable

#### 2. **🔗 MLflow Connection Issues**
```bash
# Start MLflow server if not running
mlflow ui --backend-store-uri sqlite:///mlflow.db --port 5000
```

**Symptoms**: Connection refused errors, "Failed to log span" warnings
**Solution**: Ensure MLflow server is running on the specified port

#### 3. **⏱️ Evaluation Taking Too Long**
```python
# Reduce dataset size for faster testing
small_dataset = eval_dataset[:3]  # Use only first 3 test cases
```

**Symptoms**: Long evaluation times, timeout errors
**Solution**: Start with smaller datasets for testing, then scale up

#### 4. **💾 Memory Issues with Large Datasets**
```python
# Process evaluations in batches
batch_size = 5
for i in range(0, len(eval_dataset), batch_size):
    batch = eval_dataset[i:i+batch_size]
    mlflow.genai.evaluate(data=batch, predict_fn=predict_fn, scorers=all_scorers)
```

**Symptoms**: Memory errors, system slowdown
**Solution**: Process evaluations in smaller batches

#### 5. **🔄 Prompt Loading Issues**
```python
# Check if prompt exists before loading
try:
    prompt = mlflow.genai.load_prompt("prompts:/qa_prompt@latest")
except Exception as e:
    print(f"Prompt loading failed: {e}")
```

**Symptoms**: "Prompt not found" errors
**Solution**: Ensure prompts are registered before trying to load them

### 📚 Resources and Documentation

| Resource | Description | URL |
|----------|-------------|-----|
| **MLflow GenAI Docs** | Official MLflow GenAI documentation | https://mlflow.org/docs/latest/genai/ |
| **Groq API Docs** | Groq API reference and guides | https://console.groq.com/docs/ |
| **Prompt Registry** | MLflow prompt registry documentation | https://mlflow.org/docs/latest/genai/prompt-registry/ |
| **Evaluation Framework** | MLflow evaluation framework guide | https://mlflow.org/docs/latest/genai/eval-monitor/ |
| **Groq Models** | Available models and capabilities | https://console.groq.com/docs/models |

### 🎯 Best Practices

1. **🔒 Security**: Never commit API keys to version control
2. **📊 Monitoring**: Set up alerts for evaluation failures
3. **🔄 Versioning**: Use semantic versioning for prompt changes
4. **📝 Documentation**: Document prompt evolution and performance changes
5. **🧪 Testing**: Always test with small datasets before full evaluation
