# Chatbot and RAG Evaluation

Retrievel Augmented Generation (RAG) is a technique that enhances Large Language Models (LLMS) by proving them with relevant context from external sources. 
It has become one of the most widely used approcaches for building LLM applications.

This tutorial will show you how to evaluate your RAG applications using LangSmith. 

1. How o create Test DataSets
2. How to run a RAG application on those datasets
3. How to measure your application's performance using different evaluation metrics.

A typical RAG evaluation workflow consists of three main steps:
1. Creating the dataset with questions and their expected answers
2. Running your RAG application on those questions
3. Using evaluators to measure how well your application performed, lookin at factors like:
* Answer relevance
* Answer accuracy
* Retrieval quality


# Chatbot evaluation

In [55]:
import os
from dotenv import load_dotenv
load_dotenv()

os.environ["LANGSMITH_TRACING"] = "true"

In [None]:
### Create the datapoints
from langsmith import Client

client = Client()

## Define the dataset
dataset_name="Simple chatbot evaluation"
dataset = client.create_dataset(dataset_name)

client.create_examples(
    dataset_id=dataset.id,
    examples=[  
        {
            "inputs": {"question": "What are the most common symptoms of the flu?"},
            "outputs": {"answer": "The most common symptoms of the flu include fever, cough, sore throat, muscle aches, headache, fatigue, and sometimes nausea or vomiting. Symptoms usually appear suddenly."}
        },
        {
            "inputs": {"question": "How long is the incubation period for COVID-19?"},
            "outputs": {"answer": "The incubation period for COVID-19 is typically 2 to 14 days, but most commonly around 5 to 6 days after exposure."}
        },
        {
            "inputs": {"question": "Which vaccinations are recommended for adults over 60?"},
            "outputs": {"answer": "For adults over 60, the following vaccinations are recommended: annual flu shot, pneumococcal vaccine, herpes zoster (shingles) vaccine, and tetanus-diphtheria-pertussis booster every 10 years. FSME vaccination may also be advised if there is a risk of exposure."}
        },
        {
            "inputs": {"question": "What is the difference between a cold and the flu?"},
            "outputs": {"answer": "A cold usually starts gradually with a sore throat, runny nose, and mild cough, while the flu begins suddenly with high fever, severe muscle aches, and pronounced fatigue. Flu symptoms are generally more severe."}
        },
        {
            "inputs": {"question": "How is high blood pressure diagnosed?"},
            "outputs": {"answer": "High blood pressure is diagnosed through repeated blood pressure measurements. A reading of 140/90 mmHg or higher is considered elevated. Confirmation often requires multiple measurements on different days, sometimes including 24-hour ambulatory monitoring."}
        },
        {
            "inputs": {"question": "What natural measures help lower blood pressure?"},
            "outputs": {"answer": "Natural ways to lower blood pressure include regular physical activity, a low-salt diet, stress reduction, weight loss, reducing alcohol consumption, and quitting smoking."}
        },
        {
            "inputs": {"question": "What are the first signs of a stroke?"},
            "outputs": {"answer": "The first signs of a stroke include sudden weakness or paralysis (often on one side), speech difficulties, vision problems, dizziness, severe headache, and confusion. If suspected, call emergency services immediately (remember FAST: Face, Arms, Speech, Time)."}
        },
        {
            "inputs": {"question": "How often should you go for a preventive check-up with your GP?"},
            "outputs": {"answer": "Adults should have a general health check-up every 3 years. In many countries, this is covered by health insurance every 3 years starting at age 35, and annually from age 50."}
        },        
    ],
)

### Define the Metrics (LLM As a Judge)

In [82]:
import openai
from langsmith import wrappers

# Define the metrics for evaluating LLM responses
openai_client = wrappers.wrap_openai(
    openai.OpenAI(base_url="http://localhost:11434/v1", api_key="lcl-123")
)

eval_instructions = "You are an expert professor specialized in grading students' answers to questions."

def correctness(inputs: dict, outputs: dict, reference_outputs: dict) -> bool:
    user_content = f"""
    You are gradeing the following question: {inputs['question']}
    
    Here ist the real answer: {reference_outputs['answer']}
    
    You are grading the following predicted answer:
    {outputs['response']}
    
    Repsond with CORRECT or INCORRECT:
    Grade: 
    """
    
    response = openai_client.chat.completions.create(
        model="qwen3:4b",
        temperature=0,
        messages=[
            {"role": "system", "content": eval_instructions},
            {"role": "user", "content": user_content}
        ]        
    ).choices[0].message.content
    
    return response.strip().upper() == "CORRECT"
    

In [83]:
## Concision
def concision(outputs: dict, reference_outputs: dict) -> bool:
    return int(len(outputs["response"]) < 2 * len(reference_outputs["answer"]))

### Run Evaluation

In [87]:
default_instruction = "Respond to the users question in a short, concise manner (one short sentence)-"

def my_app(question: str, model: str = "llama3.1:8b", instruction= default_instruction) -> str:
    return openai_client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": instruction},
            {"role": "user", "content": question}
        ],
    ).choices[0].message.content

In [88]:
### Call my_app for every datapoints
def ls_target(inputs: str) -> dict:
    return {"response": my_app(inputs["question"])}

In [89]:
data = client.read_dataset(dataset_name=dataset_name)

## Run our evaluation
experiment_results=client.evaluate(
    ls_target, ### your AI System
    data=data,
    evaluators=[correctness, concision],
    experiment_prefix="llama3.1:8b-chatbot",
)

View the evaluation results for experiment: 'llama3.1:8b-chatbot-d7e216f1' at:
https://smith.langchain.com/o/ab019cb8-ab69-4090-af14-ca3701d306a3/datasets/7d5ca103-e823-41a5-a80c-120f9ba7f91e/compare?selectedSessions=dec71842-3f9b-4eb2-a855-062fed364617




5it [03:06, 37.35s/it]


KeyboardInterrupt: 

In [62]:
data

Dataset(name='Simple chatbot evaluation', description=None, data_type=<DataType.kv: 'kv'>, id=UUID('7d5ca103-e823-41a5-a80c-120f9ba7f91e'), created_at=datetime.datetime(2025, 12, 7, 7, 58, 11, 620887, tzinfo=datetime.timezone.utc), modified_at=datetime.datetime(2025, 12, 7, 7, 58, 11, 620887, tzinfo=datetime.timezone.utc), example_count=8, session_count=2, last_session_start_time=datetime.datetime(2025, 12, 7, 7, 59, 45, 755827), inputs_schema=None, outputs_schema=None, transformations=None, metadata=None)