# Lesson 9.2: Evaluation Methods

---

In Lesson 9.1, we explored the reasons and challenges of evaluating LLM applications, along with important metrics. This lesson will delve into specific **evaluation methods**, from setting up datasets to using techniques like **model-based evaluation** and **A/B testing**, helping you establish an effective evaluation process.

## 1. Criterion-based Evaluation

**Criterion-based evaluation** is a structured qualitative (manual) evaluation method where human evaluators use a predefined set of criteria to score or rank LLM responses.

* **Concept:** Instead of just judging "good" or "bad," evaluators examine each response and score it against specific criteria such as:
    * **Accuracy:** Is the response factually correct?
    * **Relevance:** Does the response directly answer the question?
    * **Fluency:** Is the grammar, spelling, and sentence structure good?
    * **Safety:** Does it contain harmful or biased content?
    * **Groundedness:** Is the information in the response supported by the provided context (for RAG)?
    * **Helpfulness:** Does the response help the user solve their problem?
* **Process:**
    1.  **Define Criteria:** Clearly list criteria and scoring scales (e.g., 1-5 stars, Yes/No).
    2.  **Train Evaluators:** Ensure all evaluators understand and apply criteria consistently.
    3.  **Collect Responses:** Run the application with questions from the evaluation dataset and collect responses.
    4.  **Manual Scoring:** Evaluators read each (question, response) pair and score it against the criteria.
    5.  **Analyze Results:** Calculate average scores for each criterion, identify strengths and weaknesses.
* **Pros:** Captures nuances and subjective aspects, provides detailed information for improvement.
* **Cons:** Costly, not scalable, potentially inconsistent across evaluators.
* **When to use:** Early development stages, testing edge cases, evaluating overall quality, verifying safety/ethical issues.




---

## 2. Model-based Evaluation (LLM-as-a-Judge)

**Model-based evaluation** is an automated evaluation method that uses a more powerful LLM (often a larger or better-tuned model) to evaluate the output of another LLM. This technique is commonly known as **"LLM-as-a-Judge"**.

* **Concept:** Instead of humans, a "judge" LLM reads the question, context (if any), and the response from the LLM being evaluated, then provides a score or commentary based on criteria provided in the prompt.
* **Process:**
    1.  **Select Judge LLM:** Choose an LLM with strong reasoning capabilities and an understanding of evaluation criteria.
    2.  **Design Prompt for Judge LLM:** This prompt is extremely critical. It must instruct the judge LLM on its role, evaluation criteria, scoring scale, and desired output format.
    3.  **Run Evaluation:** Pass the (question, LLM response) pairs through the judge LLM.
    4.  **Analyze Results:** Collect scores and comments from the judge LLM.
* **Pros:**
    * **Automation and Scalability:** Can evaluate thousands or millions of responses quickly.
    * **Captures Nuance:** Better than traditional n-gram based metrics at understanding semantics and context.
    * **Reduced Human Cost:** Significantly reduces the need for manual evaluators.
* **Cons:**
    * **LLM Cost:** Calling the judge LLM still incurs cost.
    * **Judge LLM Bias:** The judge LLM might have its own biases or not fully understand complex criteria.
    * **Consistency:** There might still be variability in the judge LLM's scores.
* **When to use:** Large-scale evaluation, regression testing, quick comparison of model versions.




---

## 3. A/B Testing Evaluation

**A/B testing** is a production environment evaluation method where different versions of an application are deployed to separate user groups to compare real-world performance.

* **Concept:** Divide users into two or more groups. Each group is exposed to a different version of the application (e.g., version A is the current version, version B is a new version with LLM/Agent improvements). User interaction metrics are collected and compared.
* **Metrics Measured:**
    * **Conversion Rate:** E.g., the percentage of users completing a task (placing an order, signing up).
    * **Engagement:** Time spent using, number of questions asked.
    * **User Satisfaction:** Through surveys, ratings.
    * **Churn Rate:** The percentage of users who stop using the application.
    * **Error Rate:** Number of times the Agent fails to respond or gives an undesirable response.
* **Pros:**
    * **Real-world Data:** Provides reliable information about performance in a real environment.
    * **Direct User Feedback:** Captures actual user experience.
* **Cons:**
    * **Traffic Requirements:** Needs enough users for statistically significant results.
    * **Time-Consuming:** Can take a long time to collect enough data.
    * **Risk:** The new version might perform worse, affecting user experience.
    * **Complex to Implement:** Requires A/B testing infrastructure.
* **When to use:** Validating improvements before widespread deployment, optimizing user experience.




---

## 4. Setting up an Evaluation Dataset

A high-quality evaluation dataset is the foundation for any evaluation method.

* **Purpose:** Provides (input, expected_output/context) pairs to systematically test LLM performance.
* **Characteristics of a Good Dataset:**
    * **Diverse:** Includes various types of questions, topics, and complexities.
    * **Representative:** Reflects the types of questions real users will ask.
    * **Includes Edge Cases:** Difficult, ambiguous, or error-prone questions.
    * **Ground Truth:** For Q&A tasks, requires accurate answers or relevant context.
    * **Metadata:** Includes additional information like question type, difficulty, source.
* **Data Sources:**
    * **Historical Data:** Real conversations from existing systems.
    * **Synthetic Data:** Generate questions and answers using LLMs or other tools.
    * **Human-Annotated Data:** Humans create or annotate (question, answer) pairs.
* **Format:** Typically a CSV, JSON file, or a database, with each row/object representing an evaluation case.
    * Example for Q&A: `{"question": "...", "context": "...", "ground_truth_answer": "..."}`




---

## 5. Practical Example: Building a Simple Evaluation Pipeline for a Q&A System or Chatbot

We will build a simple evaluation pipeline using the **LLM-as-a-Judge** method to assess the relevance and factual consistency of responses from a simulated Q&A system.

**Preparation:**
* Ensure you have the necessary libraries installed: `langchain-openai`, `langsmith`.
* Set the `OPENAI_API_KEY`, `LANGCHAIN_API_KEY`, `LANGCHAIN_TRACING_V2="true"`, `LANGCHAIN_PROJECT="LLM Evaluation Example"` environment variables.

In [None]:
# Install libraries if not already installed
# pip install langchain-openai openai langsmith

import os
from typing import List, Dict, Any
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langsmith import Client # To interact with LangSmith

# Set environment variable for OpenAI API key
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

# Set up LangSmith
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "YOUR_LANGSMITH_API_KEY" # Replace with your LangSmith API key
os.environ["LANGCHAIN_PROJECT"] = "LLM Evaluation Example" # Set your project name

# Initialize LangSmith Client
langsmith_client = Client()

# --- 1. Simulated Q&A System (LLM being evaluated) ---
# This is the LLM whose response quality we want to evaluate.
target_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)

def simple_qa_system(question: str, context: str = None) -> str:
    """
    Simulated Q&A system that uses an LLM to answer questions based on context.
    """
    qa_prompt_template = ChatPromptTemplate.from_messages([
        ("system", "You are a helpful Q&A assistant. Answer the following question. If context is provided, use it to answer. Otherwise, answer based on your general knowledge."),
        ("user", f"Context: {context}\n\nQuestion: {question}" if context else f"Question: {question}"),
    ])
    qa_chain = qa_prompt_template | target_llm | StrOutputParser()
    response = qa_chain.invoke({"question": question, "context": context})
    return response

# --- 2. Evaluation Dataset ---
# Each item includes: question, context (if any), and ground truth answer
evaluation_dataset = [
    {
        "question": "Thủ đô của Pháp là gì?", # What is the capital of France?
        "context": "Paris là thủ đô và thành phố đông dân nhất của Pháp.", # Paris is the capital and most populous city of France.
        "ground_truth_answer": "Paris"
    },
    {
        "question": "Ai là người đầu tiên đặt chân lên mặt trăng?", # Who was the first person to step on the moon?
        "context": "Neil Armstrong là người Mỹ đầu tiên đặt chân lên mặt trăng vào năm 1969.", # Neil Armstrong was the first American to step on the moon in 1969.
        "ground_truth_answer": "Neil Armstrong"
    },
    {
        "question": "Tính 15 cộng 7.", # Calculate 15 plus 7.
        "context": None, # No context for calculation question
        "ground_truth_answer": "22"
    },
    {
        "question": "Thời tiết hôm nay ở Đà Nẵng thế nào?", # What's the weather like today in Da Nang?
        "context": "Thời tiết Đà Nẵng hôm nay nắng đẹp, nhiệt độ 30 độ C.", # The weather in Da Nang today is sunny and beautiful, 30 degrees Celsius.
        "ground_truth_answer": "Nắng đẹp, 30 độ C" # Sunny and beautiful, 30 degrees Celsius
    },
    {
        "question": "Màu sắc của bầu trời vào ban đêm?", # What color is the sky at night?
        "context": None,
        "ground_truth_answer": "Màu đen hoặc xanh đậm" # Black or dark blue
    }
]

# --- 3. LLM Judge ---
# This LLM will evaluate the responses of the target_llm
judge_llm = ChatOpenAI(model_name="gpt-4o", temperature=0) # Use a more powerful model as judge

def evaluate_with_llm_judge(question: str, context: str, ground_truth: str, llm_response: str) -> Dict[str, Any]:
    """
    Uses a judge LLM to evaluate the relevance and factual consistency of a response.
    """
    judge_prompt_template = ChatPromptTemplate.from_messages([
        ("system", """You are an AI evaluation expert. Your task is to assess the quality of a response from another AI.
        You will be provided with:
        - The original user question.
        - The context provided to the AI (if any).
        - The human ground truth answer.
        - The AI's response to be evaluated.

        Evaluate the AI's response based on the following 2 criteria:
        1.  **Relevance:** Does the AI's response directly answer the question? (0 = not relevant, 1 = partially relevant, 2 = fully relevant)
        2.  **Factual Consistency / Groundedness:** Is the information in the AI's response supported by the provided context (if any) and/or reliable general knowledge? (0 = not consistent/hallucination, 1 = partially consistent, 2 = fully consistent)

        Respond in the following JSON format:
        ```json
        {{
            "relevance_score": <relevance score>,
            "factual_consistency_score": <factual consistency score>,
            "reasoning": "<reasoning for scores>"
        }}
        ```
        """"),
        ("user", f"""
        Original Question: {question}
        Context: {context if context else "None"}
        Ground Truth Answer: {ground_truth}
        AI Response: {llm_response}
        """)
    ])
    
    judge_chain = judge_prompt_template | judge_llm | StrOutputParser()
    
    try:
        raw_evaluation = judge_chain.invoke({
            "question": question,
            "context": context,
            "ground_truth": ground_truth,
            "llm_response": llm_response
        })
        # Attempt to parse JSON. If LLM doesn't return valid JSON, it will raise an error.
        evaluation_result = json.loads(raw_evaluation)
        return evaluation_result
    except json.JSONDecodeError as e:
        print(f"Error parsing JSON from judge LLM: {e}. Raw response: {raw_evaluation}")
        return {
            "relevance_score": -1, # Error score
            "factual_consistency_score": -1,
            "reasoning": f"JSON parsing error: {e}. Raw response: {raw_evaluation}"
        }
    except Exception as e:
        print(f"Error calling judge LLM: {e}")
        return {
            "relevance_score": -1,
            "factual_consistency_score": -1,
            "reasoning": f"Error calling judge LLM: {e}"
        }

# --- 4. Main Evaluation Pipeline ---
import json # Import json module

def run_evaluation_process(dataset: List[Dict[str, Any]]):
    """
    Runs the evaluation pipeline for the entire dataset.
    """
    results = []
    for i, item in enumerate(dataset):
        print(f"\n--- Evaluating case {i+1}/{len(dataset)} ---")
        question = item["question"]
        context = item["context"]
        ground_truth = item["ground_truth_answer"]

        # Step 1: Get response from our Q&A system
        print(f"Question: {question}")
        print(f"Context: {context}")
        llm_response = simple_qa_system(question, context)
        print(f"AI Response: {llm_response}")

        # Step 2: Evaluate the response using the LLM judge
        evaluation = evaluate_with_llm_judge(question, context, ground_truth, llm_response)
        print(f"Evaluation Result: {evaluation}")
        results.append(evaluation)
    
    # --- Aggregate Results ---
    total_relevance_score = 0
    total_factual_consistency_score = 0
    valid_evaluations = 0

    for res in results:
        if res["relevance_score"] != -1 and res["factual_consistency_score"] != -1:
            total_relevance_score += res["relevance_score"]
            total_factual_consistency_score += res["factual_consistency_score"]
            valid_evaluations += 1
    
    if valid_evaluations > 0:
        avg_relevance = total_relevance_score / valid_evaluations
        avg_factual_consistency = total_factual_consistency_score / valid_evaluations
        print(f"\n--- Aggregated Results ---")
        print(f"Average Relevance Score: {avg_relevance:.2f}")
        print(f"Average Factual Consistency Score: {avg_factual_consistency:.2f}")
    else:
        print("\nNo valid evaluations were performed.")

# --- Execute the evaluation pipeline ---
print("Starting evaluation process...")
run_evaluation_process(evaluation_dataset)
print("\nEvaluation process completed.")