# Advanced Evaluation Using Rubrics

## Overview

This notebook demonstrates **two powerful evaluation approaches** for production LLM systems:

1. **Rubric-Based Evaluation** - Multi-dimensional scoring (accuracy, completeness, relevance)
2. **Reference-Based Evaluation** - Compare against expert "gold standard" answers

## Why Multiple Evaluation Methods?

### Different Quality Dimensions

| Method | Measures | Use Case |
|--------|----------|----------|
| **Rubric** | Multiple criteria (accuracy, tone, completeness) | Holistic quality assessment |
| **Reference** | Similarity to expert answer | Consistency validation |
| **Binary (F)** | Pass/fail on specific criteria | Quality gates |

### Production Strategy: Layer Evaluations

```
Response → Binary Check (F) → Rubric Scoring (I) → Reference Comparison (I) → Send to User
```

Each layer catches different failure modes.

## Key Concepts

### Rubric Evaluation
Ask LLM to score response on multiple dimensions:
- Uses context (product data) as source of truth
- Checks factual accuracy, completeness, hallucinations
- Outputs structured assessment

### Reference Evaluation
Compare response to pre-written "ideal" answer:
- Detects: subset, superset, exact match, conflict
- Useful when you have gold standard responses
- Validates consistency across agents/models

In [None]:
import os
import openai
import json
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

In [None]:
client = openai.OpenAI()

def get_completion_from_messages(
    messages,
    model="gpt-3.5-turbo",
    temperature=0,
    max_tokens=500,
):
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature,
        max_tokens=max_tokens,
    )
    return response.choices[0].message.content

In [None]:
# product information
products = {
    "TechPro Ultrabook": {
        "name": "TechPro Ultrabook",
        "category": "Computers and Laptops",
        "brand": "TechPro",
        "model_number": "TP-UB100",
        "warranty": "1 year",
        "rating": 4.5,
        "features": ["13.3-inch display", "8GB RAM", "256GB SSD", "Intel Core i5 processor"],
        "description": "A sleek and lightweight ultrabook for everyday use.",
        "price": 799.99
    },
    "BlueWave Gaming Laptop": {
        "name": "BlueWave Gaming Laptop",
        "category": "Computers and Laptops",
        "brand": "BlueWave",
        "model_number": "BW-GL200",
        "warranty": "2 years",
        "rating": 4.7,
        "features": ["15.6-inch display", "16GB RAM", "512GB SSD", "NVIDIA GeForce RTX 3060"],
        "description": "A high-performance gaming laptop for an immersive experience.",
        "price": 1199.99
    },
    "PowerLite Convertible": {
        "name": "PowerLite Convertible",
        "category": "Computers and Laptops",
        "brand": "PowerLite",
        "model_number": "PL-CV300",
        "warranty": "1 year",
        "rating": 4.3,
        "features": ["14-inch touchscreen", "8GB RAM", "256GB SSD", "360-degree hinge"],
        "description": "A versatile convertible laptop with a responsive touchscreen.",
        "price": 699.99
    },
    "TechPro Desktop": {
        "name": "TechPro Desktop",
        "category": "Computers and Laptops",
        "brand": "TechPro",
        "model_number": "TP-DT500",
        "warranty": "1 year",
        "rating": 4.4,
        "features": ["Intel Core i7 processor", "16GB RAM", "1TB HDD", "NVIDIA GeForce GTX 1660"],
        "description": "A powerful desktop computer for work and play.",
        "price": 999.99
    },
    "BlueWave Chromebook": {
        "name": "BlueWave Chromebook",
        "category": "Computers and Laptops",
        "brand": "BlueWave",
        "model_number": "BW-CB100",
        "warranty": "1 year",
        "rating": 4.1,
        "features": ["11.6-inch display", "4GB RAM", "32GB eMMC", "Chrome OS"],
        "description": "A compact and affordable Chromebook for everyday tasks.",
        "price": 249.99
    },
    "SmartX ProPhone": {
        "name": "SmartX ProPhone",
        "category": "Smartphones and Accessories",
        "brand": "SmartX",
        "model_number": "SX-PP10",
        "warranty": "1 year",
        "rating": 4.6,
        "features": ["6.1-inch display", "128GB storage", "12MP dual camera", "5G"],
        "description": "A powerful smartphone with advanced camera features.",
        "price": 899.99
    },
    "MobiTech PowerCase": {
        "name": "MobiTech PowerCase",
        "category": "Smartphones and Accessories",
        "brand": "MobiTech",
        "model_number": "MT-PC20",
        "warranty": "1 year",
        "rating": 4.3,
        "features": ["5000mAh battery", "Wireless charging", "Compatible with SmartX ProPhone"],
        "description": "A protective case with built-in battery for extended usage.",
        "price": 59.99
    },
    "SmartX MiniPhone": {
        "name": "SmartX MiniPhone",
        "category": "Smartphones and Accessories",
        "brand": "SmartX",
        "model_number": "SX-MP5",
        "warranty": "1 year",
        "rating": 4.2,
        "features": ["4.7-inch display", "64GB storage", "8MP camera", "4G"],
        "description": "A compact and affordable smartphone for basic tasks.",
        "price": 399.99
    },
    "MobiTech Wireless Charger": {
        "name": "MobiTech Wireless Charger",
        "category": "Smartphones and Accessories",
        "brand": "MobiTech",
        "model_number": "MT-WC10",
        "warranty": "1 year",
        "rating": 4.5,
        "features": ["10W fast charging", "Qi-compatible", "LED indicator", "Compact design"],
        "description": "A convenient wireless charger for a clutter-free workspace.",
        "price": 29.99
    },
    "SmartX EarBuds": {
        "name": "SmartX EarBuds",
        "category": "Smartphones and Accessories",
        "brand": "SmartX",
        "model_number": "SX-EB20",
        "warranty": "1 year",
        "rating": 4.4,
        "features": ["True wireless", "Bluetooth 5.0", "Touch controls", "24-hour battery life"],
        "description": "Experience true wireless freedom with these comfortable earbuds.",
        "price": 99.99
    },
    "CineView 4K TV": {
        "name": "CineView 4K TV",
        "category": "Televisions and Home Theater Systems",
        "brand": "CineView",
        "model_number": "CV-4K55",
        "warranty": "2 years",
        "rating": 4.8,
        "features": ["55-inch display", "4K resolution", "HDR", "Smart TV"],
        "description": "A stunning 4K TV with vibrant colors and smart features.",
        "price": 599.99
    },
    "SoundMax Home Theater": {
        "name": "SoundMax Home Theater",
        "category": "Televisions and Home Theater Systems",
        "brand": "SoundMax",
        "model_number": "SM-HT100",
        "warranty": "1 year",
        "rating": 4.4,
        "features": ["5.1 channel", "1000W output", "Wireless subwoofer", "Bluetooth"],
        "description": "A powerful home theater system for an immersive audio experience.",
        "price": 399.99
    },
    "CineView 8K TV": {
        "name": "CineView 8K TV",
        "category": "Televisions and Home Theater Systems",
        "brand": "CineView",
        "model_number": "CV-8K65",
        "warranty": "2 years",
        "rating": 4.9,
        "features": ["65-inch display", "8K resolution", "HDR", "Smart TV"],
        "description": "Experience the future of television with this stunning 8K TV.",
        "price": 2999.99
    },
    "SoundMax Soundbar": {
        "name": "SoundMax Soundbar",
        "category": "Televisions and Home Theater Systems",
        "brand": "SoundMax",
        "model_number": "SM-SB50",
        "warranty": "1 year",
        "rating": 4.3,
        "features": ["2.1 channel", "300W output", "Wireless subwoofer", "Bluetooth"],
        "description": "Upgrade your TV's audio with this sleek and powerful soundbar.",
        "price": 199.99
    },
    "CineView OLED TV": {
        "name": "CineView OLED TV",
        "category": "Televisions and Home Theater Systems",
        "brand": "CineView",
        "model_number": "CV-OLED55",
        "warranty": "2 years",
        "rating": 4.7,
        "features": ["55-inch display", "4K resolution", "HDR", "Smart TV"],
        "description": "Experience true blacks and vibrant colors with this OLED TV.",
        "price": 1499.99
    },
    "GameSphere X": {
        "name": "GameSphere X",
        "category": "Gaming Consoles and Accessories",
        "brand": "GameSphere",
        "model_number": "GS-X",
        "warranty": "1 year",
        "rating": 4.9,
        "features": ["4K gaming", "1TB storage", "Backward compatibility", "Online multiplayer"],
        "description": "A next-generation gaming console for the ultimate gaming experience.",
        "price": 499.99
    },
    "ProGamer Controller": {
        "name": "ProGamer Controller",
        "category": "Gaming Consoles and Accessories",
        "brand": "ProGamer",
        "model_number": "PG-C100",
        "warranty": "1 year",
        "rating": 4.2,
        "features": ["Ergonomic design", "Customizable buttons", "Wireless", "Rechargeable battery"],
        "description": "A high-quality gaming controller for precision and comfort.",
        "price": 59.99
    },
    "GameSphere Y": {
        "name": "GameSphere Y",
        "category": "Gaming Consoles and Accessories",
        "brand": "GameSphere",
        "model_number": "GS-Y",
        "warranty": "1 year",
        "rating": 4.8,
        "features": ["4K gaming", "500GB storage", "Backward compatibility", "Online multiplayer"],
        "description": "A compact gaming console with powerful performance.",
        "price": 399.99
    },
    "ProGamer Racing Wheel": {
        "name": "ProGamer Racing Wheel",
        "category": "Gaming Consoles and Accessories",
        "brand": "ProGamer",
        "model_number": "PG-RW200",
        "warranty": "1 year",
        "rating": 4.5,
        "features": ["Force feedback", "Adjustable pedals", "Paddle shifters", "Compatible with GameSphere X"],
        "description": "Enhance your racing games with this realistic racing wheel.",
        "price": 249.99
    },
    "GameSphere VR Headset": {
        "name": "GameSphere VR Headset",
        "category": "Gaming Consoles and Accessories",
        "brand": "GameSphere",
        "model_number": "GS-VR",
        "warranty": "1 year",
        "rating": 4.6,
        "features": ["Immersive VR experience", "Built-in headphones", "Adjustable headband", "Compatible with GameSphere X"],
        "description": "Step into the world of virtual reality with this comfortable VR headset.",
        "price": 299.99
    },
    "AudioPhonic Noise-Canceling Headphones": {
        "name": "AudioPhonic Noise-Canceling Headphones",
        "category": "Audio Equipment",
        "brand": "AudioPhonic",
        "model_number": "AP-NC100",
        "warranty": "1 year",
        "rating": 4.6,
        "features": ["Active noise-canceling", "Bluetooth", "20-hour battery life", "Comfortable fit"],
        "description": "Experience immersive sound with these noise-canceling headphones.",
        "price": 199.99
    },
    "WaveSound Bluetooth Speaker": {
        "name": "WaveSound Bluetooth Speaker",
        "category": "Audio Equipment",
        "brand": "WaveSound",
        "model_number": "WS-BS50",
        "warranty": "1 year",
        "rating": 4.5,
        "features": ["Portable", "10-hour battery life", "Water-resistant", "Built-in microphone"],
        "description": "A compact and versatile Bluetooth speaker for music on the go.",
        "price": 49.99
    },
    "AudioPhonic True Wireless Earbuds": {
        "name": "AudioPhonic True Wireless Earbuds",
        "category": "Audio Equipment",
        "brand": "AudioPhonic",
        "model_number": "AP-TW20",
        "warranty": "1 year",
        "rating": 4.4,
        "features": ["True wireless", "Bluetooth 5.0", "Touch controls", "18-hour battery life"],
        "description": "Enjoy music without wires with these comfortable true wireless earbuds.",
        "price": 79.99
    },
    "WaveSound Soundbar": {
        "name": "WaveSound Soundbar",
        "category": "Audio Equipment",
        "brand": "WaveSound",
        "model_number": "WS-SB40",
        "warranty": "1 year",
        "rating": 4.3,
        "features": ["2.0 channel", "80W output", "Bluetooth", "Wall-mountable"],
        "description": "Upgrade your TV's audio with this slim and powerful soundbar.",
        "price": 99.99
    },
    "AudioPhonic Turntable": {
        "name": "AudioPhonic Turntable",
        "category": "Audio Equipment",
        "brand": "AudioPhonic",
        "model_number": "AP-TT10",
        "warranty": "1 year",
        "rating": 4.2,
        "features": ["3-speed", "Built-in speakers", "Bluetooth", "USB recording"],
        "description": "Rediscover your vinyl collection with this modern turntable.",
        "price": 149.99
    },
    "FotoSnap DSLR Camera": {
        "name": "FotoSnap DSLR Camera",
        "category": "Cameras and Camcorders",
        "brand": "FotoSnap",
        "model_number": "FS-DSLR200",
        "warranty": "1 year",
        "rating": 4.7,
        "features": ["24.2MP sensor", "1080p video", "3-inch LCD", "Interchangeable lenses"],
        "description": "Capture stunning photos and videos with this versatile DSLR camera.",
        "price": 599.99
    },
    "ActionCam 4K": {
        "name": "ActionCam 4K",
        "category": "Cameras and Camcorders",
        "brand": "ActionCam",
        "model_number": "AC-4K",
        "warranty": "1 year",
        "rating": 4.4,
        "features": ["4K video", "Waterproof", "Image stabilization", "Wi-Fi"],
        "description": "Record your adventures with this rugged and compact 4K action camera.",
        "price": 299.99
    },
    "FotoSnap Mirrorless Camera": {
        "name": "FotoSnap Mirrorless Camera",
        "category": "Cameras and Camcorders",
        "brand": "FotoSnap",
        "model_number": "FS-ML100",
        "warranty": "1 year",
        "rating": 4.6,
        "features": ["20.1MP sensor", "4K video", "3-inch touchscreen", "Interchangeable lenses"],
        "description": "A compact and lightweight mirrorless camera with advanced features.",
        "price": 799.99
    },
    "ZoomMaster Camcorder": {
        "name": "ZoomMaster Camcorder",
        "category": "Cameras and Camcorders",
        "brand": "ZoomMaster",
        "model_number": "ZM-CM50",
        "warranty": "1 year",
        "rating": 4.3,
        "features": ["1080p video", "30x optical zoom", "3-inch LCD", "Image stabilization"],
        "description": "Capture life's moments with this easy-to-use camcorder.",
        "price": 249.99
    },
    "FotoSnap Instant Camera": {
        "name": "FotoSnap Instant Camera",
        "category": "Cameras and Camcorders",
        "brand": "FotoSnap",
        "model_number": "FS-IC10",
        "warranty": "1 year",
        "rating": 4.1,
        "features": ["Instant prints", "Built-in flash", "Selfie mirror", "Battery-powered"],
        "description": "Create instant memories with this fun and portable instant camera.",
        "price": 69.99
    }
}

def get_products_and_category():
    """Generate a dictionary where keys are categories and values are lists of product names."""
    products_and_category = {}
    for product_name, product_info in products.items():
        category = product_info["category"]
        if category not in products_and_category:
            products_and_category[category] = []
        products_and_category[category].append(product_name)
    return products_and_category

def get_product_by_name(name):
    return products.get(name, None)

def get_products_by_category(category):
    return [product for product in products.values() if product["category"] == category]

def read_string_to_list(input_string):
    if input_string is None:
        return None
    try:
        input_string = input_string.replace("'", "\"")  # Replace single quotes with double quotes for valid JSON
        data = json.loads(input_string)
        return data
    except json.JSONDecodeError:
        print("Error: Invalid JSON string")
        return None

def find_category_and_product_only(user_input, system_message):
    delimiter = "####"
    messages = [
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': f"{delimiter}{user_input}{delimiter}"},
    ]
    return get_completion_from_messages(messages)

def get_products_from_query(user_input):
    """Extract products and categories from user query."""
    delimiter = "####"
    system_message = f"""
You will be provided with customer service queries. \
The customer service query will be delimited with \
{delimiter} characters.
Output a python list of objects, where each object has \
the following format:
    'category': <one of Computers and Laptops, \
    Smartphones and Accessories, \
    Televisions and Home Theater Systems, \
    Gaming Consoles and Accessories, 
    Audio Equipment, Cameras and Camcorders>,
OR
    'products': <a list of products that must \
    be found in the allowed products below>

Where the categories and products must be found in \
the customer service query.
If a product is mentioned, it must be associated with \
the correct category in the allowed products list below.
If no products or categories are found, output an \
empty list.

Allowed products: 

Computers and Laptops category:
TechPro Ultrabook
BlueWave Gaming Laptop
PowerLite Convertible
TechPro Desktop
BlueWave Chromebook

Smartphones and Accessories category:
SmartX ProPhone
MobiTech PowerCase
SmartX MiniPhone
MobiTech Wireless Charger
SmartX EarBuds

Televisions and Home Theater Systems category:
CineView 4K TV
SoundMax Home Theater
CineView 8K TV
SoundMax Soundbar
CineView OLED TV

Gaming Consoles and Accessories category:
GameSphere X
ProGamer Controller
GameSphere Y
ProGamer Racing Wheel
GameSphere VR Headset

Audio Equipment category:
AudioPhonic Noise-Canceling Headphones
WaveSound Bluetooth Speaker
AudioPhonic True Wireless Earbuds
WaveSound Soundbar
AudioPhonic Turntable

Cameras and Camcorders category:
FotoSnap DSLR Camera
ActionCam 4K
FotoSnap Mirrorless Camera
ZoomMaster Camcorder
FotoSnap Instant Camera

Only output the list of objects, with nothing else.
"""
    return find_category_and_product_only(user_input, system_message)

def generate_output_string(data_list):
    output_string = ""
    if data_list is None:
        return output_string
    for data in data_list:
        try:
            if "products" in data:
                products_list = data["products"]
                for product_name in products_list:
                    product = get_product_by_name(product_name)
                    if product:
                        output_string += json.dumps(product, indent=4) + "\n"
                    else:
                        print(f"Error: Product '{product_name}' not found")
            elif "category" in data:
                category_name = data["category"]
                category_products = get_products_by_category(category_name)
                for product in category_products:
                    output_string += json.dumps(product, indent=4) + "\n"
            else:
                print("Error: Invalid object format")
        except Exception as e:
            print(f"Error: {e}")
    return output_string

def get_mentioned_product_info(category_and_product_list):
    """Get product information for mentioned products."""
    return generate_output_string(category_and_product_list)

def answer_user_msg(user_msg, product_info):
    """Generate an answer to the user message using product information."""
    delimiter = "```"
    system_message = f"""
You are a customer service assistant for a large electronic store. \
Respond in a friendly and helpful tone, with concise answers. \
Make sure to ask the user relevant follow-up questions.
"""
    messages = [
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': f"{delimiter}{user_msg}{delimiter}"},
        {'role': 'assistant', 'content': f"Relevant product information:\n{product_info}"}
    ]
    return get_completion_from_messages(messages, max_tokens=500)

In [None]:
customer_msg = f"""
tell me about the smartx pro phone and the fotosnap camera, the dslr one.
Also, what TVs or TV related products do you have?"""

products_by_category = get_products_from_query(customer_msg)
category_and_product_list = read_string_to_list(products_by_category)
product_info = get_mentioned_product_info(category_and_product_list)
assistant_answer = answer_user_msg(user_msg=customer_msg,
                                   product_info=product_info)

In [None]:
print(assistant_answer) 

---

## Method 1: Rubric-Based Evaluation

### Evaluation Criteria

This rubric checks:
1. **Context Grounding** - Response uses only provided product data
2. **No Hallucinations** - No invented information
3. **Factual Agreement** - No contradictions with source data
4. **Question Coverage** - All user questions answered
5. **Completeness Score** - How many questions addressed

### Output Format
```
- Is response based only on context? Y/N
- Does it include info not in context? Y/N
- Any disagreements with context? Y/N
- Question count: N
- Question 1 answered: Y/N
- Question 2 answered: Y/N
- ...
- Total questions addressed: N
```

This structured output enables programmatic quality gates.

In [None]:
cust_prod_info = {
    'customer_msg': customer_msg,
    'context': product_info
}


**Expected Output** (for good response):
```
- Based only on context? Y
- Includes extra info? N
- Disagreements? N
- Questions asked: 3
- Question 1 (SmartX ProPhone): Y
- Question 2 (FotoSnap DSLR): Y
- Question 3 (TVs): Y
- Questions addressed: 3/3
```

---

## Method 2: Reference-Based Evaluation

Instead of judging quality directly, **compare to an expert answer**.

### Test Set with Ideal Answer

Contains:
- User query
- **Ideal answer** (hand-crafted by expert)
- LLM's answer

### Evaluation Categories

- **(A) Subset** - LLM answer contains less info (incomplete but correct)
- **(B) Superset** - LLM answer contains more info (over-delivers)
- **(C) Exact Match** - Same details as ideal
- **(D) Conflict** - Factual disagreement with ideal
- **(E) Style Difference** - Different wording but same facts

In [None]:
def eval_with_rubric(test_set, assistant_answer):

    cust_msg = test_set['customer_msg']
    context = test_set['context']
    completion = assistant_answer
    
    system_message = """\
    You are an assistant that evaluates how well the customer service agent \
    answers a user question by looking at the context that the customer service \
    agent is using to generate its response. 
    """

    user_message = f"""\
You are evaluating a submitted answer to a question based on the context \
that the agent uses to answer the question.
Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {cust_msg}
    ************
    [Context]: {context}
    ************
    [Submission]: {completion}
    ************
    [END DATA]

Compare the factual content of the submitted answer with the context. \
Ignore any differences in style, grammar, or punctuation.
Answer the following questions:
    - Is the Assistant response based only on the context provided? (Y or N)
    - Does the answer include information that is not provided in the context? (Y or N)
    - Is there any disagreement between the response and the context? (Y or N)
    - Count how many questions the user asked. (output a number)
    - For each question that the user asked, is there a corresponding answer to it?
      Question 1: (Y or N)
      Question 2: (Y or N)
      ...
      Question N: (Y or N)
    - Of the number of questions asked, how many of these questions were addressed by the answer? (output a number)
"""

    messages = [
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': user_message}
    ]

    response = get_completion_from_messages(messages)
    return response

In [None]:
evaluation_output = eval_with_rubric(cust_prod_info, assistant_answer)
print(evaluation_output)

In [None]:
test_set_ideal = {
    'customer_msg': """\
tell me about the smartx pro phone and the fotosnap camera, the dslr one.
Also, what TVs or TV related products do you have?""",
    'ideal_answer':"""\
Of course!  The SmartX ProPhone is a powerful \
smartphone with advanced camera features. \
For instance, it has a 12MP dual camera. \
Other features include 5G wireless and 128GB storage. \
It also has a 6.1-inch display.  The price is $899.99.

The FotoSnap DSLR Camera is great for \
capturing stunning photos and videos. \
Some features include 1080p video, \
3-inch LCD, a 24.2MP sensor, \
and interchangeable lenses. \
The price is 599.99.

For TVs and TV related products, we offer 3 TVs \


All TVs offer HDR and Smart TV.

The CineView 4K TV has vibrant colors and smart features. \
Some of these features include a 55-inch display, \
'4K resolution. It's priced at 599.

The CineView 8K TV is a stunning 8K TV. \
Some features include a 65-inch display and \
8K resolution.  It's priced at 2999.99

The CineView OLED TV lets you experience vibrant colors. \
Some features include a 55-inch display and 4K resolution. \
It's priced at 1499.99.

We also offer 2 home theater products, both which include bluetooth.\
The SoundMax Home Theater is a powerful home theater system for \
an immmersive audio experience.
Its features include 5.1 channel, 1000W output, and wireless subwoofer.
It's priced at 399.99.

The SoundMax Soundbar is a sleek and powerful soundbar.
It's features include 2.1 channel, 300W output, and wireless subwoofer.
It's priced at 199.99

Are there any questions additional you may have about these products \
that you mentioned here?
Or may do you have other questions I can help you with?
    """
}

**Expected**: `A`, `B`, `C`, or `E` (all acceptable)  
**Failure**: `D` (factual conflict with expert answer)

---

## Test Case 2: Nonsensical Response

Testing with obviously wrong answer to verify evaluator catches failures.

In [None]:
def eval_vs_ideal(test_set, assistant_answer):

    cust_msg = test_set['customer_msg']
    ideal = test_set['ideal_answer']
    completion = assistant_answer
    
    system_message = """\
    You are an assistant that evaluates how well the customer service agent \
    answers a user question by comparing the response to the ideal (expert) response
    Output a single letter and nothing else. 
    """

    user_message = f"""\
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {cust_msg}
    ************
    [Expert]: {ideal}
    ************
    [Submission]: {completion}
    ************
    [END DATA]

Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
    The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
    (A) The submitted answer is a subset of the expert answer and is fully consistent with it.
    (B) The submitted answer is a superset of the expert answer and is fully consistent with it.
    (C) The submitted answer contains all the same details as the expert answer.
    (D) There is a disagreement between the submitted answer and the expert answer.
    (E) The answers differ, but these differences don't matter from the perspective of factuality.
  choice_strings: ABCDE
"""

    messages = [
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': user_message}
    ]

    response = get_completion_from_messages(messages)
    return response

**Expected**: `D` (complete disagreement)

"Life is like a box of chocolates" conflicts with the ideal product-focused answer.

---

## Production Implementation: Combined Evaluation

### Complete Quality Pipeline

```python
def comprehensive_evaluation(user_query, llm_response, product_context, ideal_answer=None):
    \"\"\"
    Multi-method evaluation for production quality assurance.
    Returns: (quality_score: float, should_send: bool, reasons: list)
    \"\"\"
    results = {
        "rubric_pass": False,
        "reference_grade": None,
        "issues": []
    }
    
    # Method 1: Rubric evaluation
    rubric_result = eval_with_rubric({
        'customer_msg': user_query,
        'context': product_context
    }, llm_response)
    
    # Parse rubric (Y/N answers)
    if "Based only on context? N" in rubric_result:
        results["issues"].append("Hallucination detected")
    if "Disagreements? Y" in rubric_result:
        results["issues"].append("Factual error")
    if "Questions addressed: 0" in rubric_result:
        results["issues"].append("Question not answered")
    
    results["rubric_pass"] = len(results["issues"]) == 0
    
    # Method 2: Reference comparison (if available)
    if ideal_answer:
        grade = eval_vs_ideal({
            'customer_msg': user_query,
            'ideal_answer': ideal_answer
        }, llm_response)
        
        results["reference_grade"] = grade
        
        # Grade D is failure (conflict with ideal)
        if grade == 'D':
            results["issues"].append("Conflicts with ideal answer")
    
    # Decision logic
    quality_score = 1.0 if results["rubric_pass"] else 0.0
    should_send = results["rubric_pass"] and (
        not ideal_answer or results["reference_grade"] != 'D'
    )
    
    return quality_score, should_send, results["issues"]

# Usage
score, should_send, issues = comprehensive_evaluation(
    user_query, 
    llm_response, 
    product_data, 
    ideal_answer=expert_answer
)

if should_send:
    send_to_user(llm_response)
else:
    log_quality_failure(issues)
    regenerate_or_fallback()
```

### When to Use Each Method

| Scenario | Rubric | Reference | Binary (F) |
|----------|--------|-----------|-----------|
| **No gold standard** | ✅ Yes | ❌ No | ✅ Yes |
| **Have expert answers** | ✅ Yes | ✅ Yes | ✅ Yes |
| **Quick pass/fail** | ❌ No | ❌ No | ✅ Yes |
| **Detailed scoring** | ✅ Yes | ❌ No | ❌ No |
| **Consistency check** | ❌ No | ✅ Yes | ❌ No |

### Cost Considerations

**Per Evaluation**:
- Binary (F): ~50 tokens = $0.00003
- Rubric (I): ~600 tokens = $0.00036
- Reference (I): ~700 tokens = $0.00042

**Strategy**: 
1. Run binary check first (cheap)
2. If passes, run rubric (detailed)
3. Optionally run reference (consistency)

**At 10K evals/day**:
- Binary only: $0.30/day
- Binary + Rubric: $3.90/day
- All three: $7.50/day

---

## Advanced Rubric Patterns

### Weighted Scoring

```python
def weighted_rubric_evaluation(response, context):
    rubric = {
        "factual_accuracy": {"weight": 0.4, "description": "Uses only provided data"},
        "completeness": {"weight": 0.3, "description": "Answers all questions"},
        "relevance": {"weight": 0.2, "description": "On-topic, no tangents"},
        "tone": {"weight": 0.1, "description": "Friendly and helpful"}
    }
    
    system_message = f\"\"\"
    Evaluate on 1-5 scale for each criterion:
    {json.dumps(rubric, indent=2)}
    
    Output JSON: {{"factual_accuracy": 5, "completeness": 4, ...}}
    \"\"\"
    
    scores = get_completion(system_message)
    scores_dict = json.loads(scores)
    
    # Calculate weighted score
    total_score = sum(
        scores_dict[criterion] * rubric[criterion]["weight"]
        for criterion in rubric
    )
    
    return total_score / 5.0  # Normalize to 0-1
```

### Multi-LLM Consensus

```python
def consensus_evaluation(response, context):
    \"\"\"Use multiple LLMs as judges, take average.\"\"\"
    judges = ["gpt-3.5-turbo", "gpt-4o-mini", "gpt-4o"]
    scores = []
    
    for model in judges:
        score = eval_with_rubric_model(response, context, model)
        scores.append(score)
    
    # Average or median
    return statistics.mean(scores)
```

---

## Summary: Evaluation Strategy

### Development Phase
1. **Test Set Creation** - 10-50 diverse examples
2. **Manual Review** - Human-validate first 100 responses
3. **Baseline Measurement** - Establish accuracy/quality scores
4. **Rubric Refinement** - Adjust criteria based on failures

### Production Phase
1. **Real-Time Gates** - Binary check (F) on every response
2. **Sampling** - Rubric eval on 10% of traffic
3. **Reference** - Compare to ideal when available
4. **Human Audit** - Weekly review of flagged responses

### CI/CD Integration
```yaml
# .github/workflows/test-prompts.yml
name: Evaluate Prompts
on: [pull_request]
jobs:
  test:
    steps:
      - name: Run test set
        run: pytest test_llm_responses.py
      - name: Check accuracy
        run: |
          if [ $ACCURACY < 0.90 ]; then
            echo "Accuracy below threshold"
            exit 1
          fi
```

### Key Metrics Dashboard
```python
{
    "last_24h": {
        "responses_evaluated": 10000,
        "rubric_pass_rate": 0.94,
        "reference_match_rate": 0.88,
        "avg_questions_per_query": 2.3,
        "avg_completeness": 0.96,
        "hallucination_rate": 0.02
    }
}
```

---

## Best Practices Summary

### Test Set Quality
✅ **Diverse** - Cover edge cases, common queries, failures  
✅ **Representative** - Match production distribution  
✅ **Updated** - Add new patterns as discovered  
✅ **Versioned** - Track test set changes with code

### Evaluation Approach
✅ **Multi-Method** - Combine binary, rubric, reference  
✅ **Cost-Aware** - Cheap checks first, expensive later  
✅ **Automated** - Run on every prompt change  
✅ **Human-in-Loop** - Sample validation weekly

### Continuous Improvement
✅ **Track Trends** - Monitor quality over time  
✅ **Root Cause** - Analyze failures systematically  
✅ **A/B Testing** - Compare prompt variants  
✅ **Feedback Loop** - User ratings improve test set

---

## Next Steps

With complete evaluation framework:
1. **Deploy to Production** - Multi-layer quality gates active
2. **Monitor Continuously** - Dashboards for real-time quality
3. **Iterate Prompts** - A/B test improvements
4. **Scale Confidently** - Quality metrics de-risk growth

**Congratulations!** You now have a complete, production-ready LLM system with comprehensive evaluation.

In [None]:
print(assistant_answer)

In [None]:
eval_vs_ideal(test_set_ideal, assistant_answer)

In [None]:
assistant_answer_2 = "life is like a box of chocolates"

In [None]:
eval_vs_ideal(test_set_ideal, assistant_answer_2)