## Evaluation of AI Agent's Responses

### Overview

In this demo, we’ll implement a framework for evaluating the responses of AI agent. Whether you're building a chatbot, a knowledge assistant, or a task-specific agent, evaluation is key to ensuring trust, relevance, and continuous improvement.

### Scenario

Suppose you’ve deployed a knowledge-based agent that answers user questions using company documentation. Now, stakeholders want to know:

- How accurate are the responses?
- Are the answers grounded in the context provided?
- Do they follow the expected format?

To answer these questions, you need to implement a response evaluation pipeline that scores or classifies agent outputs based on defined criteria — either using another LLM (automatic evaluation) or a manual review process.


The workflow should:

- A RAG pipeline for information retrieval: Retrieve, augment, and generate answers.
- An LLM-based judge for evaluation.
- Quality assessment: Evaluate the answers using RAGAS.
- Observability: Log performance metrics in MLflow.

### 0. Import libraries

In [7]:
import mlflow
from mlflow import log_params, log_metrics
from typing import List, Dict
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.documents import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_community.document_loaders import PyPDFLoader
from langgraph.graph import START, END, StateGraph
from langgraph.graph.message import MessagesState
from langchain.prompts import ChatPromptTemplate
from ragas import evaluate
from datasets import Dataset
from IPython.display import Image, display

from dotenv import load_dotenv
load_dotenv()

True

### 1. **Multiple Model Setup**

Three models are initialized:

- `llm`: a standard OpenAI model used for generating answers.
- `llm_judge`: a more powerful model used to evaluate the generated answers.
- `embedding`: OpenAI embeddings used for vector search.

In [2]:
llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0.0,
)

# This will evaluate the responses
llm_judge = ChatOpenAI(
    model="gpt-4o",
    temperature=0.0,
)

embeddings_fn = OpenAIEmbeddings(
    model="text-embedding-3-large"
)

### 2. **MLflow Experiment Configuration**

- An MLflow experiment is created and a run is started with a custom name.
- Run metadata, such as model names and embedding models, are logged as parameters.

In [3]:
mlflow.set_experiment("evaluation-demo")

2025/05/08 15:36:15 INFO mlflow.tracking.fluent: Experiment with name 'evaluation-demo' does not exist. Creating a new experiment.


<Experiment: artifact_location='file:///Users/tim/Devs/Projects/agentic-ai-research/tutorials/knowledge-base-agent-with-langgraph/improving-reliability/evaluation/mlruns/350342372225944296', creation_time=1746689775119, experiment_id='350342372225944296', last_update_time=1746689775119, lifecycle_stage='active', name='evaluation-demo', tags={}>

In [5]:
with mlflow.start_run(run_name="llm-as-judge") as run:
    log_params(
        {
            "embeddings_model":embeddings_fn.model,
            "llm_model": llm.model_name,
            "llm_judge_model": llm_judge.model_name,
        }
    )
    print(run.info)

<RunInfo: artifact_uri='file:///Users/tim/Devs/Projects/agentic-ai-research/tutorials/knowledge-base-agent-with-langgraph/improving-reliability/evaluation/mlruns/350342372225944296/f0f466819d544d428a798380ffd2c930/artifacts', end_time=None, experiment_id='350342372225944296', lifecycle_stage='active', run_id='f0f466819d544d428a798380ffd2c930', run_name='llm-as-judge', run_uuid='f0f466819d544d428a798380ffd2c930', start_time=1746689803522, status='RUNNING', user_id='tim'>


In [6]:
mlflow_run_id = run.info.run_id
mflow_client = mlflow.tracking.MlflowClient()
mflow_client.get_run(mlflow_run_id)

<Run: data=<RunData: metrics={}, params={'embeddings_model': 'text-embedding-3-large',
 'llm_judge_model': 'gpt-4o',
 'llm_model': 'gpt-4o-mini'}, tags={'mlflow.runName': 'llm-as-judge',
 'mlflow.source.name': '/Users/tim/miniforge3/envs/agent/lib/python3.10/site-packages/ipykernel_launcher.py',
 'mlflow.source.type': 'LOCAL',
 'mlflow.user': 'tim'}>, info=<RunInfo: artifact_uri='file:///Users/tim/Devs/Projects/agentic-ai-research/tutorials/knowledge-base-agent-with-langgraph/improving-reliability/evaluation/mlruns/350342372225944296/f0f466819d544d428a798380ffd2c930/artifacts', end_time=1746689803531, experiment_id='350342372225944296', lifecycle_stage='active', run_id='f0f466819d544d428a798380ffd2c930', run_name='llm-as-judge', run_uuid='f0f466819d544d428a798380ffd2c930', start_time=1746689803522, status='FINISHED', user_id='tim'>, inputs=<RunInputs: dataset_inputs=[]>>

### 3. **Document Processing**

- A local PDF document is reloaded.
- Text is chunked and embedded into a vector store using Chroma.

In [8]:
# Initialize vector store
vector_store = Chroma(
    collection_name="evaluation-demo",
    embedding_function=embeddings_fn
)

# Load and process PDF documents
file_path = "the-era-of-experience.pdf"
loader = PyPDFLoader(file_path)

pages = []
for page in loader.load():
    pages.append(page)

# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=200
)
all_splits = text_splitter.split_documents(pages)

# Store document chunks in the vector database
_ = vector_store.add_documents(documents=all_splits)


### 4. **State Schema Definition**

- A new session state class is defined, extending `MessageGraphState`.
- It includes fields for:
    - `run_id`
    - `question`
    - `ground_truth`
    - `documents` (retrieved context)
    - `answer` (LLM response)
    - `evaluation` (dictionary of evaluation metrics)

run_id(str), ground_truth(str), evaluation(Dict),vquestion(str), documents(List) and answer(str)

In [9]:
class State(MessagesState):
    run_id: str
    question: str
    ground_truth: str
    documents: List[Document]
    answer: str
    evaluation: Dict

### 5. **RAG Node Pipeline**

Four functional nodes are reused from previous exercises:

- `retrieve`: Similarity search on the vector store.
- `augment`: Uses a chat prompt with `question` and `context`.
- `generate`: Uses the standard LLM to produce the response.

In [10]:
def retrieve(state: State):
    question = state["question"]
    retrieved_docs = vector_store.similarity_search(question)
    return {"documents": retrieved_docs}

In [12]:
def augment(state: State):
    question = state["question"]
    documents = state["documents"]
    docs_content = "\n\n".join(doc.page_content for doc in documents)

    template = ChatPromptTemplate([
        ("system", "You are an assistant for question-answering tasks."),
        ("human", "Use the following pieces of retrieved context to answer the question. "
                "If you don't know the answer, just say that you don't know. " 
                "Use three sentences maximum and keep the answer concise. "
                "\n# Question: \n-> {question} "
                "\n# Context: \n-> {context} "
                "\n# Answer: "),
    ])

    messages = template.invoke(
        {"context": docs_content, "question": question}
    ).to_messages()

    return {"messages": messages}

In [13]:
def generate(state: State):
    ai_message = llm.invoke(state["messages"])
    return {"answer": ai_message.content, "messages": ai_message}

### 6. **Evaluation Node**

- A new `evaluate_rag` node is created.

- It constructs a dataset of:

    - `question`, `answer`, `context`, and `ground_truth`. 

- Uses `llm_judge` and `evaluate()` from RAGAS to score:

    - `faithfulness`
    - `context_precision`
    - `context_recall`
    - `answer_relevancy`

- Each metric is logged to MLflow.

In [14]:
def evaluate_rag(state: State):
    question = state["question"]
    documents = state["documents"]
    answer = state["answer"]
    ground_truth = state["ground_truth"]
    dataset = Dataset.from_dict(
        {
            "question": [question],
            "answer": [answer],
            "contexts": [[doc.page_content for doc in documents]],
            "ground_truth": [ground_truth]
        }
    )

    evaluation_results = evaluate(
        dataset=dataset,
        llm=llm_judge
    )
    print(evaluation_results)

    # Log metrics in MLflow
    # The evaluation_results output value is a list
    # Example: evaluation_results["faithfulness"][0]
    with mlflow.start_run(state["run_id"]):
        
        log_metrics({
            "faithfulness": evaluation_results["faithfulness"][0],
            "context_precision": evaluation_results["context_precision"][0],
            "context_recall": evaluation_results["context_recall"][0],
            "answer_relevancy": evaluation_results["answer_relevancy"][0]
        })

    return {"evaluation": evaluation_results}

### 7. **Workflow Construction**

- A `StateGraph` is created with the following nodes and edges:

    - `start → retrieve → augment → generate → evaluate_rag → end`

In [15]:
workflow = StateGraph(State)

workflow.add_node("retrieve", retrieve)
workflow.add_node("augment", augment)
workflow.add_node("generate", generate)
workflow.add_node("evaluate_rag", evaluate_rag)

workflow.add_edge(START, "retrieve")
workflow.add_edge("retrieve", "augment")
workflow.add_edge("augment", "generate")
workflow.add_edge("generate", "evaluate_rag")
workflow.add_edge("evaluate_rag", END)

<langgraph.graph.state.StateGraph at 0x337c9a140>

In [17]:
graph = workflow.compile()

# display(
#     Image(
#         graph.get_graph().draw_mermaid_png()
#     )
# )

### 8. **Execution and Evaluation**

- A test query is submitted:
    - *"What is the key difference between the era of human data and the era of experience?"*
    - A ground truth reference answer is provided.
- The pipeline completes and produces evaluation metrics, e.g.:
    - `answer_relevancy: 0.9`
    - `faithfulness: 0.85`, etc.
- MLflow logs both the parameters and evaluation metrics for inspection.

In [18]:
reference = [
    {
        "question": "What is the key difference between the era of human data and the era of experience?",
        "ground_truth": "The key difference is that in the era of human data, AI systems learn mainly by imitating and " 
                        "fine-tuning on large amounts of human-generated data, which limits them to reproducing " 
                        "human knowledge and abilities. In contrast, the era of experience is defined by AI agents " 
                        "learning predominantly from their own interactions with the environment, allowing them to " 
                        "continually improve, adapt, and discover novel strategies beyond what is available in human data." 
    }
]

In [19]:
output = graph.invoke(
    {
        "question": reference[0]["question"],
        "ground_truth": reference[0]["ground_truth"],
        "run_id": mlflow_run_id
    }
)

Evaluating:   0%|          | 0/4 [00:00<?, ?it/s]

{'answer_relevancy': 1.0000, 'context_precision': 1.0000, 'faithfulness': 1.0000, 'context_recall': 1.0000}


### 9. Inspect in MLFlow

In [20]:
mflow_client.get_run(mlflow_run_id)

<Run: data=<RunData: metrics={'answer_relevancy': 0.9999999999999996,
 'context_precision': 0.999999999975,
 'context_recall': 1.0,
 'faithfulness': 1.0}, params={'embeddings_model': 'text-embedding-3-large',
 'llm_judge_model': 'gpt-4o',
 'llm_model': 'gpt-4o-mini'}, tags={'mlflow.runName': 'llm-as-judge',
 'mlflow.source.name': '/Users/tim/miniforge3/envs/agent/lib/python3.10/site-packages/ipykernel_launcher.py',
 'mlflow.source.type': 'LOCAL',
 'mlflow.user': 'tim'}>, info=<RunInfo: artifact_uri='file:///Users/tim/Devs/Projects/agentic-ai-research/tutorials/knowledge-base-agent-with-langgraph/improving-reliability/evaluation/mlruns/350342372225944296/f0f466819d544d428a798380ffd2c930/artifacts', end_time=1746690595731, experiment_id='350342372225944296', lifecycle_stage='active', run_id='f0f466819d544d428a798380ffd2c930', run_name='llm-as-judge', run_uuid='f0f466819d544d428a798380ffd2c930', start_time=1746689803522, status='FINISHED', user_id='tim'>, inputs=<RunInputs: dataset_inputs