<a href="https://colab.research.google.com/github/Decoding-Data-Science/nov25/blob/main/1_evaluation_recipe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Chatbot And RAG Evaluation

Retrieval Augmented Generation (RAG) is a technique that enhances Large Language Models (LLMs) by providing them with relevant external knowledge. It has become one of the most widely used approaches for building LLM applications.

This tutorial will show you how to evaluate your RAG applications using LangSmith. You'll learn:

1. How to create test datasets
2. How to run your RAG application on those datasets
3. How to measure your application's performance using different evaluation metrics

#### Overview
A typical RAG evaluation workflow consists of three main steps:

1. Creating a dataset with questions and their expected answers
2. Running your RAG application on those questions
3. Using evaluators to measure how well your application performed, looking at factors like:
 - Answer relevance
 - Answer accuracy
 - Retrieval quality

For this tutorial, we'll create and evaluate a bot that answers questions about a few of Lilian Weng's insightful blog posts.

### Chatbot Evaluation

In [None]:
# !pip install -q \
#   python-dotenv \
#   langsmith \
#   langchain \
#   langchain-openai \
#   langchain-community \
#   langchain-text-splitters \
#   openai \
#   pandas \
#   tiktoken


In [1]:
from dotenv import load_dotenv
import os

load_dotenv()

# Read keys
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
LANGSMITH_API_KEY = os.getenv("LANGSMITH_API_KEY")

if not OPENAI_API_KEY or not LANGSMITH_API_KEY:
    raise ValueError("Missing OPENAI_API_KEY or LANGSMITH_API_KEY")

# Required env vars
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
os.environ["LANGSMITH_API_KEY"] = LANGSMITH_API_KEY
os.environ["LANGSMITH_TRACING"] = "true"

# ðŸ”´ THIS IS THE KEY FIX FOR EU
os.environ["LANGSMITH_ENDPOINT"] = "https://eu.api.smith.langchain.com"

# OPTIONAL but recommended if you have multiple workspaces
# os.environ["LANGSMITH_WORKSPACE_ID"] = "YOUR_WORKSPACE_UUID"


from langsmith import Client

client = Client()
client.list_datasets(limit=1)
print("LangSmith EU auth OK âœ…")



LangSmith EU auth OK âœ…


In [2]:
# =========================
# 1. Install & Imports
# =========================
# !pip install -qU langsmith

import os
from langsmith import Client

# === Set your API key ===
# Option A: set directly (for quick demo) â€“ replace with your key
# os.environ["LANGSMITH_API_KEY"] = "YOUR_LANGSMITH_API_KEY"

# Option B (recommended in Colab):
# Use "Secrets" in Colab (from the left sidebar) and then:
# os.environ["LANGSMITH_API_KEY"] = os.environ.get("LANGSMITH_API_KEY")

client = Client()

# =========================
# 2. Create Dataset
# =========================
dataset_name_new = "Recipe Bot Evaluation â€” Q/A (Beginner + FDA Set) â€” 16th Dec"
dataset = client.create_dataset(dataset_name_new)

# =========================
# 3. Original Examples (from your snippet)
# =========================
original_examples = [
    {
        "inputs": {
            "question": "How many teaspoons are in one tablespoon?",
            "context": "US kitchen measurement equivalents."
        },
        "outputs": {"answer": "3"}
    },
    {
        "inputs": {
            "question": "What is the safe internal temperature for cooked chicken (Â°C)?",
            "context": "Food safety guideline for poultry doneness."
        },
        "outputs": {"answer": "74Â°C"}
    },
    {
        "inputs": {
            "question": "Convert 2 US cups to milliliters.",
            "context": "Use the US legal cup for home cooking."
        },
        "outputs": {"answer": "480 ml"}
    },
    {
        "inputs": {
            "question": "What is the classic vinaigrette oil-to-acid ratio?",
            "context": "Standard salad dressing ratio."
        },
        "outputs": {"answer": "3:1"}
    },
    {
        "inputs": {
            "question": "Substitute for 1 cup light brown sugar using white sugar and molasses.",
            "context": "Common home-baking substitution."
        },
        "outputs": {"answer": "1 cup white sugar + 1 tbsp molasses"}
    },
    {
        "inputs": {
            "question": "Minimum internal temperature for medium-rare steak (Â°C).",
            "context": "Typical doneness temperature."
        },
        "outputs": {"answer": "57Â°C"}
    },
    {
        "inputs": {
            "question": "How many grams are in 1 ounce (oz)?",
            "context": "Kitchen weight conversion."
        },
        "outputs": {"answer": "28.35 g"}
    },
    {
        "inputs": {
            "question": "What gas is produced when baking soda reacts with an acid?",
            "context": "Leavening reaction in quick breads."
        },
        "outputs": {"answer": "Carbon dioxide"}
    },
    {
        "inputs": {
            "question": "Boiling time for a soft-boiled egg (runny yolk) after simmering starts.",
            "context": "Stovetop method, large eggs."
        },
        "outputs": {"answer": "6 minutes"}
    },
    {
        "inputs": {
            "question": "How many tablespoons are in 1/4 cup (US)?",
            "context": "US kitchen measurement equivalents."
        },
        "outputs": {"answer": "4 tbsp"}
    },
    {
        "inputs": {
            "question": "Q1",
            "context": "US Measurements"
        },
        "outputs": {"answer": "1 tbsp"}
    }
]

# =========================
# 4. FDA / USDA Facts + Edge Cases
# =========================
extra_examples = [
    # --- Food safety temperatures ---
    {
        "inputs": {
            "question": "What is the USDA safe internal temperature for ground beef (Â°C)?",
            "context": "US food safety temperature guidelines."
        },
        "outputs": {"answer": "71Â°C"}
    },
    {
        "inputs": {
            "question": "What is the bacterial 'Danger Zone' temperature range (Â°C)?",
            "context": "FDA food safety storage rules."
        },
        "outputs": {"answer": "5Â°C to 60Â°C"}
    },
    {
        "inputs": {
            "question": "How long can cooked food safely sit at room temperature before it should be discarded?",
            "context": "General US food safety rule for perishable foods."
        },
        "outputs": {"answer": "Maximum 2 hours"}
    },
    {
        "inputs": {
            "question": "To what internal temperature (Â°C) should leftovers be reheated for safety?",
            "context": "US food safety guideline for reheating leftovers."
        },
        "outputs": {"answer": "74Â°C"}
    },

    # --- Handling & cross-contamination ---
    {
        "inputs": {
            "question": "Should raw chicken be washed before cooking?",
            "context": "FDA guidance on cross-contamination in home kitchens."
        },
        "outputs": {"answer": "No â€” washing raw chicken can spread bacteria through splashing."}
    },
    {
        "inputs": {
            "question": "What is the minimum recommended time for proper handwashing?",
            "context": "Food safety guidance for handwashing."
        },
        "outputs": {"answer": "At least 20 seconds"}
    },
    {
        "inputs": {
            "question": "How long can raw chicken be safely stored in the refrigerator?",
            "context": "USDA guidance for refrigerated storage of raw poultry."
        },
        "outputs": {"answer": "1 to 2 days"}
    },
    {
        "inputs": {
            "question": "At what freezer temperature (Â°C) should food be stored to keep it safe long-term?",
            "context": "US food safety freezer storage guidelines."
        },
        "outputs": {"answer": "âˆ’18Â°C or lower"}
    },

    # --- Allergens ---
    {
        "inputs": {
            "question": "Name three allergens from the FDA Big Nine list.",
            "context": "US FDA major food allergen list."
        },
        "outputs": {"answer": "Examples include milk, eggs, and peanuts."}
    },
    {
        "inputs": {
            "question": "Is sesame one of the FDA-recognized major allergens?",
            "context": "FDA Big Nine allergen list, updated in recent years."
        },
        "outputs": {"answer": "Yes, sesame is one of the major allergens."}
    },

    # --- Measurement conversions & basics ---
    {
        "inputs": {
            "question": "How many milliliters are in one US tablespoon?",
            "context": "US kitchen measurement standards."
        },
        "outputs": {"answer": "About 14.79 ml (often rounded to 15 ml)."}
    },
    {
        "inputs": {
            "question": "How many fluid ounces are in one US cup?",
            "context": "US kitchen volume measurements."
        },
        "outputs": {"answer": "8 fluid ounces"}
    },
    {
        "inputs": {
            "question": "Approximately how many grams are in 1 cup of all-purpose flour?",
            "context": "Typical baking reference for US recipes."
        },
        "outputs": {"answer": "Around 120 g, though it can vary with measurement method."}
    },
    {
        "inputs": {
            "question": "How many teaspoons are in 1/2 tablespoon?",
            "context": "US kitchen measurement equivalents."
        },
        "outputs": {"answer": "1.5 teaspoons"}
    },

    # --- Edge case: regional/measurement ambiguity ---
    {
        "inputs": {
            "question": "Are US and UK pints the same size?",
            "context": "International measurement comparisons that can confuse recipes."
        },
        "outputs": {"answer": "No â€” a US pint is about 473 ml, while a UK pint is about 568 ml."}
    },
    {
        "inputs": {
            "question": "How many sticks of butter make 1 cup in US recipes?",
            "context": "Common US baking measurement for butter."
        },
        "outputs": {"answer": "2 sticks of butter equal 1 cup (about 226 g)."}
    },

    # --- Edge case: visual checks vs thermometer ---
    {
        "inputs": {
            "question": "Can you rely on the color of chicken meat alone to know if it is safely cooked?",
            "context": "FDA advice on checking doneness of poultry."
        },
        "outputs": {"answer": "No â€” color is not reliable; you must check that the internal temperature reaches 74Â°C."}
    },

    # --- Edge case: freezing and bacteria ---
    {
        "inputs": {
            "question": "Does freezing meat kill harmful bacteria?",
            "context": "Food preservation and safety guidance."
        },
        "outputs": {"answer": "No â€” freezing usually does not kill bacteria; it mainly stops them from growing."}
    },

    # --- Edge case: unit confusion (Fahrenheit vs Celsius) ---
    {
        "inputs": {
            "question": "Is 165Â°F the same as 65Â°C for cooked chicken?",
            "context": "Comparing common food safety temperatures between Fahrenheit and Celsius."
        },
        "outputs": {"answer": "No â€” 165Â°F is about 74Â°C, not 65Â°C."}
    },

    # --- Edge case: flour weight variability ---
    {
        "inputs": {
            "question": "Does 1 US cup of flour always weigh exactly 120 g?",
            "context": "Baking measurement variability."
        },
        "outputs": {"answer": "No â€” 120 g is a common reference, but actual weight can range roughly 100â€“130 g depending on how it is measured."}
    },

    # --- Edge case: rare steak & risk groups ---
    {
        "inputs": {
            "question": "Is rare steak safe for everyone to eat?",
            "context": "Food safety risk levels for different groups of people."
        },
        "outputs": {"answer": "Rare steak can be acceptable for healthy adults when properly handled, but higher-risk groups like pregnant people, older adults, and immunocompromised individuals are advised to avoid undercooked meat."}
    },

    # --- Edge case: hot holding temperature ---
    {
        "inputs": {
            "question": "What is the minimum hot-holding temperature (Â°C) recommended for cooked foods?",
            "context": "US food service guidance for keeping cooked food hot and safe."
        },
        "outputs": {"answer": "About 60Â°C or higher is recommended for hot holding."}
    },

    # --- Edge case: 'room temperature' ambiguity ---
    {
        "inputs": {
            "question": "Is there a single exact temperature for 'room temperature' in recipes?",
            "context": "Culinary terminology and approximate temperature ranges."
        },
        "outputs": {"answer": "No â€” it is not an exact standard; in cooking it usually means around 20â€“22Â°C."}
    }
]

# =========================
# 5. Upload all examples to LangSmith
# =========================
client.create_examples(
    dataset_id=dataset.id,
    examples=original_examples + extra_examples
)

print(f"Created dataset: {dataset_name_new}")
print(f"Total examples uploaded: {len(original_examples) + len(extra_examples)}")


Created dataset: Recipe Bot Evaluation â€” Q/A (Beginner + FDA Set) â€” 16th Dec
Total examples uploaded: 34


### Define Metrics (LLM As A Judge)


In [3]:
import openai
from langsmith import wrappers

openai_client=wrappers.wrap_openai(openai.OpenAI())

eval_instructions = " Strict grader for short recipe Q&A"

def correctness(inputs:dict,outputs:dict, reference_outputs:dict)->bool:
      user_content = f"""You are grading the following question:
    {inputs['question']}
    Here is the real answer:
    {reference_outputs['answer']}
    You are grading the following predicted answer:
    {outputs['response']}
    Respond with CORRECT or INCORRECT:
    Grade:
    """
      response=openai_client.chat.completions.create(
            model="gpt-4o-mini",
            temperature=0,
            messages=[
                  {"role":"system","content":eval_instructions},
                  {"role":"user","content":user_content}
            ]
      ).choices[0].message.content

      return response == "CORRECT"

In [4]:
## Concisions- checks whether the actual output is less than 2x the length of the expected result.

def concision(outputs: dict, reference_outputs: dict) -> bool:
    return int(len(outputs["response"]) < 2 * len(reference_outputs["answer"]))

### Run Evaluations

In [5]:
default_instructions = "Respond to the users question in a short, concise manner (one or two word ) IF it is yes/no answer add essential info only is required for the answer . the answer crisp and no duplication"

def my_app(question: str, model: str = "gpt-4.1-nano-2025-04-14", instructions: str = default_instructions) -> str:
    return openai_client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": instructions},
            {"role": "user", "content": question},
        ],
    ).choices[0].message.content

In [6]:
### Call my_app for every datapoints
def ls_target(inputs: str) -> dict:
    return {"response": my_app(inputs["question"])}

In [7]:
## Run our evaluation
experiment_results=client.evaluate(
    ls_target, ## Your AI system
    data=dataset_name_new,
    evaluators=[correctness,concision],
    experiment_prefix="gpt-4.1-nano-2025-04-14_2"
)



  from .autonotebook import tqdm as notebook_tqdm


View the evaluation results for experiment: 'gpt-4.1-nano-2025-04-14_2-bc846e9a' at:
https://eu.smith.langchain.com/o/dc7e084f-6549-42af-bfdd-cf03a464fc49/datasets/335cf5c0-1bae-4015-b501-aec19fafa363/compare?selectedSessions=59c80cff-2777-4889-b1f7-d99f8d7ce840




0it [00:00, ?it/s]Error running target function: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}
Traceback (most recent call last):
  File "c:\Users\almehairbi\Desktop\Dell AI\.venv\Lib\site-packages\langsmith\evaluation\_runner.py", line 1921, in _forward
    fn(*args, langsmith_extra=langsmith_extra)
    ~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\almehairbi\Desktop\Dell AI\.venv\Lib\site-packages\langsmith\run_helpers.py", line 710, in wrapper
    function_result = run_container["context"].run(
        func, *args, **kwargs
    )
  File "C:\Users\almehairbi\AppData\Local\Temp\ipykernel_7164\3543952576.py", line 3, in ls_target
    return {"response": my_app(inputs["question"])}
                        ~~~~~~^

In [None]:
### Call my_app for every datapoints - change model
def ls_target(inputs: str) -> dict:
    return {"response": my_app(inputs["question"],model="gpt-3.5-turbo")}

In [None]:
## Run our evaluation
experiment_results=client.evaluate(
    ls_target, ## Your AI system
    data=dataset_name_new,
    evaluators=[correctness,concision],
    experiment_prefix="gpt-3.5"
)

View the evaluation results for experiment: 'gpt-3.5-d057bf78' at:
https://smith.langchain.com/o/c8f8810e-4941-552b-aef3-15ad938ead98/datasets/780675d4-642d-4404-b5e1-1d6b066d5bfc/compare?selectedSessions=6307657c-3f35-44f3-b858-f8e517d88467




0it [00:00, ?it/s]

In [None]:
def ls_target(inputs: str) -> dict:
    return {"response": my_app(inputs["question"],model="gpt-5.1-2025-11-13")}



In [None]:
#new dataset
experiment_results=client.evaluate(
    ls_target, ## Your AI system
    data=dataset_name_new,
    evaluators=[correctness,concision],
    experiment_prefix="gpt-5.1-2025-11-13"
)

View the evaluation results for experiment: 'gpt-5.1-2025-11-13-66e7acee' at:
https://smith.langchain.com/o/c8f8810e-4941-552b-aef3-15ad938ead98/datasets/780675d4-642d-4404-b5e1-1d6b066d5bfc/compare?selectedSessions=b3ccfd0b-17c5-46c6-a3b0-2c46ff2c47cf




0it [00:00, ?it/s]

In [None]:
#change the llm model name
def ls_target(inputs: str) -> dict:
    return {"response": my_app(inputs["question"],model="gpt-5.1-2025-11-13")}

In [None]:
#change the prefix
experiment_results=client.evaluate(
    ls_target, ## Your AI system
    data=dataset_name_new,
    evaluators=[correctness,concision],
    experiment_prefix="gpt-4.1-nano-2025-04-14_1"
)

##Summary

You tested 4 different GPT models using LangSmith.
The results show:

All models scored very high on correctness (Âµ â‰ˆ 1.00 or 0.85).

All models had 0% failure rate (no crashes, no empty outputs).

The speed of the models is different: nano is fastest, GPT-5.1 is slower but more accurate.

The dataset size used was 34 test examples.

This is a simple comparison of model accuracy and response time.

Breakdown of Each Column (Beginner Explanation)
1. Model Name

Examples:

gpt-5.1-2025-11-13-e7a64ff3

gpt-4.1-nano

gpt-3.5

This just tells you which model you tested.
(People use this to compare price, speed, and accuracy.)

2. base 1

This simply means it was the base evaluation run (not fine-tuned, no custom version).

3. Âµ 0.85 or Âµ 1.00

This Âµ (mu) value represents the average score of all test examples.

What it means:

1.00 â†’ Perfect. The model got every test case right.

0.85 â†’ The model got 85% of the test questions correct.

0.56 (second Âµ) â†’ This is another metric, often a secondary score, like reasoning clarity.

In your results:

GPT-5.1 scored 0.85 (meaning 85% accurate)

GPT-4.1-nano scored 1.00 (meaning perfect for this dataset)

GPT-3.5 scored 1.00

This only means:

These models were good enough to get all your simple recipe facts correct.

ðŸ”¹ Important:
These are simple questions (teaspoon conversions, temperatures, basic facts), so even small models score 1.00.

In a real FDA or safety context, large models usually outperform nano/small ones, especially on edge cases.

4. Time Columns â€” 0.25s, 2.15s, etc.

These show:

First number: time to respond

Second number: full round-trip latency

What it means for beginners:

Model	Response Speed
GPT-4.1 nano	Fastest (0.25 sec)
GPT-3.5	Fast (0.35 sec)
GPT-4.1 nano #2	Fast (0.34 sec)
GPT-5.1	Slowest (0.61 sec)

This is normal:

Smaller models â†’ faster

Bigger models â†’ slower but better at reasoning and edge cases

5. 34

This indicates the number of examples run in evaluation:
â†’ You tested 34 recipe questions.

6. Failure Rate â€” 0%

Good news:

No model failed

No empty results, no errors, no timeouts

7. Metadata (null values)

These JSON blocks:

{"tags":null,"dirty":null,"branch":null,"commit":null,"repo_name":null,"remote_url":null,"author_name":null,"commit_time":null,"author_email":null}


This is normal.

It just means:

You did not connect the run to a GitHub repo

There is no version tag associated with this evaluation

Beginners can ignore this completely.

What This Evaluation Means in Simple Words

All your models handled the simple recipe dataset extremely well.
Even the smallest "nano" model achieved a perfect score.

This tells us:

âœ” Your test questions are easy

Conversions, temperatures, food safety basics â€” models already know these.

âœ” The real challenge is edge cases, not core facts