# Eval Driven Development with MLflow & LangChain

This notebook demonstrates how to perform **Evaluation Driven Development (EDD)** for GenAI applications using **MLflow 3.0+** and **LangChain**.

We will cover two main scenarios:
1.  **RAG Evaluation**: We will be using (`Correctness`, `Answer Relevancy`, `Context Relevancy`) to evaluate the RAG Agent.
2.  **Agent Evaluation**: We will be using (`Task Completness`, `Tool Trajectory Analysis`) to evaluate the Agent traces and efficiency.

### Prerequisites
Ensure you have set your `OPENAI_API_KEY` in the environment or a `.env` file.

# Import Setup

In [1]:
# !uv sync

In [2]:
# Install dependencies if running in Colab or a fresh environment
#%pip install -q "mlflow>=2.14" langgraph langchain langchain-openai langchain-community langchain-text-splitters faiss-cpu pandas openai python-dotenv bs4

In [3]:
import mlflow
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
import os
from langchain.tools import tool
from langchain.chat_models import init_chat_model
from langchain.agents import create_agent

from mlflow.genai.scorers import (
    Correctness,
    RelevanceToQuery,
    Guidelines,
)
from mlflow.entities import Feedback, SpanType, Trace
from mlflow.genai import scorer
from deepeval.metrics import TaskCompletionMetric
from deepeval.test_case import LLMTestCase, ToolCall
import json
from deepeval.metrics import ContextualRelevancyMetric



from dotenv import load_dotenv

print(f"MLflow version: {mlflow.__version__}")

MLflow version: 3.6.0


In [4]:
# Load API Key
load_dotenv()

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = input("Enter your OpenAI API Key: ")

# Set a specific experiment for this notebook
mlflow.set_experiment("GenAI_Eval_Demo")

  return FileStore(store_uri, store_uri)


<Experiment: artifact_location='file:///Users/pedro.azevedo/dspt-mlflow/mlruns/413835162552422093', creation_time=1763905836235, experiment_id='413835162552422093', last_update_time=1763905836235, lifecycle_stage='active', name='GenAI_Eval_Demo', tags={}>

## Part 1: RAG Evaluation

We will build a simple RAG chain that answers questions about software tools. We will then evaluate it using MLflow's **"Trace Required"** judges, which inspect the actual retrieved documents to ensure relevance and groundedness.

In [5]:
# 1. Enable Autologging
mlflow.langchain.autolog()

### Helper Functions to Process Traces

In [6]:


def extract_source_nodes(json_input):
    """
    Parses a JSON string containing a message history and extracts source nodes
    from tool artifacts.
    """
    try:
        parsed_data = json.loads(json_input)
    except json.JSONDecodeError as e:
        print(f"Error parsing JSON: {e}")
        return []

    # Handle the structure: {"messages": [...]}
    messages = parsed_data.get("messages", []) if isinstance(parsed_data, dict) else []
    
    source_nodes = []
    
    for message in messages:
        # We are looking for messages where type is 'tool' and an 'artifact' list exists
        if message.get("type") == "tool" and "artifact" in message:
            artifacts = message["artifact"]
            
            # Ensure artifact is a list before extending our results
            if isinstance(artifacts, list):
                source_nodes.extend(artifacts)
                
    return source_nodes

def extract_final_response(json_input):
    """
    Parses a JSON string and extracts the content of the final AI response.
    """
    try:
        parsed_data = json.loads(json_input)
    except json.JSONDecodeError:
        return None

    messages = parsed_data.get("messages", []) if isinstance(parsed_data, dict) else []
    
    # Iterate backwards to find the most recent AI message with content
    for message in reversed(messages):
        if message.get("type") == "ai" and message.get("content"):
            return message["content"]
            
    return None

def _extract_deepeval_components(trace : Trace):
    """Extract input, output, and context from trace data"""
    request = str(trace.data.request)
    response = str(trace.data.response)

    # extract source nodes if they exist
    # Extract Source Nodes
    outputs = extract_source_nodes(response)
    retrieval_context = [node['page_content'] for node in outputs]
    
    
    actual_output = extract_final_response(response)

    return {
        'input': request,
        'actual_output': actual_output,
        'retrieval_context': retrieval_context
    }




## Part 2 RAG Eval

In [7]:
# 1. Define content as variables to ensure 100% match between VectorStore and Eval Dataset
rag_content_phone = "Orbit Phone X10 Specs: Runs OrbitOS 4.0, uses USB-C charging port, supports 5G, release date Jan 2024."
rag_content_watch = "Orbit Watch Pro Specs: Requires phone running OrbitOS 4.0 or higher to sync. Battery life 24h."
rag_content_buds = "Orbit Buds Lite Specs: Connects via Bluetooth 5.0. Compatible with any device supporting Bluetooth."
rag_content_old_charger = "Legacy Charger Adapter: This adapter converts Micro-USB to USB-C. Max output 5W."
rag_content_new_charger = "Orbit FastCharger: Native USB-C charger. Output 30W. Required for fast charging on X10."

rag_docs = [
    Document(page_content=rag_content_phone, metadata={"id": "doc_1"}),
    Document(page_content=rag_content_watch, metadata={"id": "doc_2"}),
    Document(page_content=rag_content_buds, metadata={"id": "doc_3"}),
    Document(page_content=rag_content_old_charger, metadata={"id": "doc_4"}),
    Document(page_content=rag_content_new_charger, metadata={"id": "doc_5"}),
]

# Create Vector Store & Retriever
vectorstore = FAISS.from_documents(rag_docs, OpenAIEmbeddings())
retriever = vectorstore.as_retriever(k=1)

@tool(response_format="content_and_artifact")
def retrieve_context(query: str):
    """Retrieve information to help answer a query."""
    retrieved_docs = vectorstore.similarity_search(query, k=2)
    serialized = "\n\n".join(
        (f"Source: {doc.metadata}\nContent: {doc.page_content}")
        for doc in retrieved_docs
    )
    return serialized, retrieved_docs



In [8]:
rag_system_prompt = (
    "You are the Orbit Electronics Support Bot. "
    "For every user question, you must retrieve specifications for ALL devices mentioned. "
    "Synthesize the answer based ONLY on the retrieved text."
)

model = init_chat_model("gpt-4.1")
tools = [retrieve_context]


rag_agent = create_agent(model, tools, system_prompt=rag_system_prompt)

def qa_predict_rag_fn(query: str) -> str:
    response = rag_agent.invoke({
        "messages": [{"role": "user", "content": query}],
    })
    answer = response['messages'][-1].content
    return answer

In [9]:
qa_predict_rag_fn("What charger do I need for the Orbit Phone X10?")

'The Orbit Phone X10 uses a USB-C charging port. For fast charging, you need the Orbit FastCharger, which is a native USB-C charger with a 30W output.'

## Define Dataset

In [10]:
rag_eval_dataset = [
    # Case 1: Multi-hop Compatibility
    # Logic: User asks about Watch + Phone. 
    # Requirement: Must retrieve Phone Specs (OS version) AND Watch Specs (OS requirement).
    {
        "inputs": {"query": "Can I use the Orbit Watch Pro with the Orbit Phone X10?"},
        "expectations": {
            "expected_response": "Yes. The Orbit Phone X10 runs OrbitOS 4.0, which matches the Orbit Watch Pro's requirement.",
            # VERBATIM MATCHES:
            "expected_facts": [rag_content_phone, rag_content_watch],
            "retrieval_context": [rag_content_phone, rag_content_watch] 
        }
    },
    # Case 2: Multi-hop Power/Charging
    # Logic: User has Old Charger + New Phone. 
    # Requirement: Must retrieve Old Charger Specs (5W) AND New Charger Specs (Requirement for Fast Charge).
    {
        "inputs": {"query": "Will the Legacy Charger Adapter allow me to fast charge my Orbit Phone X10?"},
        "expectations": {
            "expected_response": "No. The Legacy Adapter output is 5W, but the X10 requires 30W (Orbit FastCharger) to fast charge.",
            # VERBATIM MATCHES:
            "expected_facts": [rag_content_old_charger, rag_content_new_charger],
            "retrieval_context": [rag_content_old_charger, rag_content_new_charger]
        }
    }
]

## Define Scorers (Evaluators)

In [11]:
@scorer(name="Context Relevancy")
def contextual_relevancy(trace: Trace):
    """Evaluate RAG traces using DeepEval metrics"""
    
    data = _extract_deepeval_components(trace)

    # Define DeepEval metrics   
    

    # Create test cases
    test_case= LLMTestCase(
            input=data['input'],
            actual_output=data['actual_output'],
            retrieval_context=data['retrieval_context']
        )
    
    relevancy_metric = ContextualRelevancyMetric(threshold=0.7, model="gpt-4o-mini",
                                        include_reason=True)        
        
    # Run the metric evaluation
    relevancy_metric.measure(test_case)

    # Extract results
    score = relevancy_metric.score
    reason = relevancy_metric.reason

    return Feedback(
        value=score,
        rationale=reason,
    )

# Business-specific guidelines
business_guidelines = Guidelines(
    name="toxicity_clear",
    guidelines="""
    The response should avoid toxic language and adhere to community guidelines.
    """,
)

# Configure RAG-specific scorers
rag_scorers = [
    Correctness(
       # model="litellm_proxy:/amazon.nova-micro-v1:0",
    ),
    RelevanceToQuery(
        name="AnswerRelevance"
        #model="litellm_proxy:/amazon.nova-micro-v1:0",
    ),
    contextual_relevancy,
]


In [12]:
import mlflow

with mlflow.start_run(run_name="Simple Langgraph Agent"):

    eval_results = mlflow.genai.evaluate(
        data=rag_eval_dataset,
        predict_fn=qa_predict_rag_fn,
        scorers=rag_scorers,
    )

2025/11/24 21:14:12 INFO mlflow.models.evaluation.utils.trace: Auto tracing is temporarily enabled during the model evaluation for computing some metrics and debugging. To disable tracing, call `mlflow.autolog(disable=True)`.
  from .autonotebook import tqdm as notebook_tqdm
2025/11/24 21:14:13 INFO mlflow.genai.utils.data_validation: Testing model prediction with the first sample in the dataset. To disable this check, set the MLFLOW_GENAI_EVAL_SKIP_TRACE_VALIDATION environment variable to True.
Evaluating:   0%|          | 0/2 [Elapsed: 00:00, Remaining: ?] 

Evaluating:  50%|█████     | 1/2 [Elapsed: 00:07, Remaining: 00:07] 

Evaluating: 100%|██████████| 2/2 [Elapsed: 00:09, Remaining: 00:00] 



✨ Evaluation completed.

Metrics and evaluation results are logged to the MLflow run:
  Run name: [94mSimple Langgraph Agent[0m
  Run ID: [94m0cb62d5b82ef485cb37048232864abc9[0m

To view the detailed evaluation results with sample-wise scores,
open the [93m[1mTraces[0m tab in the Run page in the MLflow UI.



eval_results.metrics

In [13]:
# Mock Database
cms_db = {
    "101": {"title": "AI Trends 2024", "status": "draft", "tags": ["tech"]},
    "102": {"title": "Summer Recipes", "status": "published", "tags": ["food"]},
}

# We define expected output strings for our test cases to verify against
EXPECTED_SEARCH_OUTPUT_102 = str([{"id": "102", "title": "Summer Recipes", "status": "published"}])
EXPECTED_DETAILS_OUTPUT_102 = str({"title": "Summer Recipes", "status": "published", "tags": ["food"]})


# Tools 
@tool
def search_articles(query: str):
    """Searches for articles by title. Returns JSON string of matches."""
    # Simple logic to mimic a search engine
    results = [{"id": k, "title": v["title"], "status": v["status"]} 
               for k, v in cms_db.items() if query.lower() in v["title"].lower()]
    return str(results)

@tool
def get_article_details(article_id: str):
    """Gets full details for an ID."""
    return str(cms_db.get(article_id, "Article not found"))

@tool
def publish_article(article_id: str):
    """Publishes an article."""
    if article_id in cms_db:
        cms_db[article_id]["status"] = "published"
        return f"Success: Article {article_id} published."
    return "Error: ID not found."



In [14]:
# Setup Tools + LLM
cms_tools = [search_articles, get_article_details, publish_article]
chat_model = init_chat_model("gpt-4.1")

# Setup Prompt
agent_system_prompt = (
    "You are a CMS Manager. "
    "SOP: Always SEARCH for an article ID first. Never guess IDs. "
    "Before publishing, GET DETAILS to confirm the current status."
)

# Setup Agent
cms_agent = create_agent(model=chat_model, tools=cms_tools, system_prompt=agent_system_prompt)

# Prediction Function
def agent_predict_fn(query: str) -> str:
    response = cms_agent.invoke({
        "messages": [{"role": "user", "content": query}],
    })
    answer = response['messages'][-1].content
    return answer

## Setup Agent Eval Dataset

In [15]:
agent_eval_dataset = [
    # Case 1: Simple Retrieval
    {
        "inputs": {"query": "What is the status of the Summer Recipes post?"},
        "expectations": {
            "expected_response": "It is currently published.",
            "task_completion_threshold": 1.0,
            # The agent must call search, and the 'fact' it relies on is the tool output
            "expected_facts": [EXPECTED_SEARCH_OUTPUT_102], 
            "tool_call_trajectory": ["search_articles"]
        }
    },
    # Case 2: Complex Action (Search -> Check -> Publish)
    {
        "inputs": {"query": "Find the Summer Recipes article and ensure it is published."},
        "expectations": {
            "expected_response": "The article is already published.",
            # The agent should see the search result, AND the details showing it's published
            "expected_facts": [EXPECTED_SEARCH_OUTPUT_102, EXPECTED_DETAILS_OUTPUT_102],
            "tool_call_trajectory": ["search_articles", "get_article_details"] 
            # Note: It should NOT call publish_article because it sees it is already published
        }
    }
]

## Setup Scorers and Evaluators

In [16]:


@scorer(name="Task Completeness")
def task_completion_with_deepeval(trace: Trace, inputs: dict, outputs: str, expectations: dict) -> Feedback:
    """
    Custom scorer that uses DeepEval's TaskCompletionMetric to evaluate task completion
    based on trace analysis and tool calls
    """

    try:
        # Extract tool call information from the trace
        tool_call_spans = trace.search_spans(span_type=SpanType.TOOL)

        # Convert MLflow trace tool calls to DeepEval ToolCall format
        tools_called = []
        for span in tool_call_spans:
            tool_call = ToolCall(
                name=span.name,
                description=span.attributes.get("description", f"Tool call for {span.name}"),
                input_parameters=span.inputs or {},
                output=span.outputs or []
            )
            tools_called.append(tool_call)

        # Extract the actual response text from the complex output structure
        if isinstance(outputs, dict):
            # Handle complex response structure
            if 'response' in outputs and 'blocks' in outputs['response']:
                actual_output = outputs['response']['blocks'][0]['text']
            elif 'response' in outputs and isinstance(outputs['response'], str):
                actual_output = outputs['response']
            else:
                actual_output = str(outputs)
        elif isinstance(outputs, str):
            actual_output = outputs
        else:
            actual_output = str(outputs)

        # Create DeepEval test case
        test_case = LLMTestCase(
            input=inputs.get("query", ""),
            actual_output=actual_output,
            tools_called=tools_called
        )

        # Initialize TaskCompletionMetric
        threshold = expectations.get("task_completion_threshold", 0.7)
        metric = TaskCompletionMetric(
            threshold=threshold,
            model="gpt-4o",  # Use consistent model
            include_reason=True
        )

        # Run the metric evaluation
        metric.measure(test_case)

        # Extract results
        score = metric.score
        reason = metric.reason

        return Feedback(
            value=score,
            rationale=f"Task completion score: {score:.2f} (threshold: {threshold}). Tools used: {len(tools_called)}. {reason}",
        )

    except Exception as e:
        return Feedback(
            value=0.0,
            rationale=f"Error evaluating task completion: {str(e)}",
            error=e
        )


@scorer(name="Tool Trajectory")
def tool_call_trajectory_analysis(trace: Trace, expectations: dict) -> Feedback:
    """
    Analyze the tool call trajectory against expected sequence
    """
    try:
        # Search for tool call spans in the trace
        tool_call_spans = trace.search_spans(span_type=SpanType.TOOL)

        # Extract actual trajectory
        actual_trajectory = [span.name for span in tool_call_spans]
        expected_trajectory = expectations.get("tool_call_trajectory", [])

        # Calculate trajectory match
        trajectory_match = actual_trajectory == expected_trajectory

        # Calculate partial match score
        if not expected_trajectory:
            partial_score = 1.0 if actual_trajectory else 0.0
        else:
            # Calculate sequence similarity
            min_len = min(len(actual_trajectory), len(expected_trajectory))
            max_len = max(len(actual_trajectory), len(expected_trajectory))
            if max_len == 0:
                partial_score = 1.0
            else:
                matches = sum(1 for i in range(min_len)
                             if i < len(actual_trajectory) and i < len(expected_trajectory)
                             and actual_trajectory[i] == expected_trajectory[i])
                partial_score = matches / max_len

        return Feedback(
            value=partial_score,
            rationale=(
                f"Tool trajectory {'matches' if trajectory_match else 'differs from'} expectations. "
                f"Expected: {expected_trajectory}. Actual: {actual_trajectory}. "
                f"Match score: {partial_score:.2f}"
            )
        )

    except Exception as e:
        return Feedback(
            value=0.0,
            rationale=f"Error analyzing tool trajectory: {str(e)}",
            error=e
        )
    

agent_scorers = [
    task_completion_with_deepeval,
    tool_call_trajectory_analysis,
]


In [17]:
import mlflow

with mlflow.start_run(run_name="Simple Langgraph CMS Agent"):

    eval_results = mlflow.genai.evaluate(
        data=agent_eval_dataset,
        predict_fn=agent_predict_fn,
        scorers=agent_scorers,
    )

2025/11/24 21:15:07 INFO mlflow.genai.utils.data_validation: Testing model prediction with the first sample in the dataset. To disable this check, set the MLFLOW_GENAI_EVAL_SKIP_TRACE_VALIDATION environment variable to True.
Evaluating:   0%|          | 0/2 [Elapsed: 00:00, Remaining: ?] 

Evaluating:  50%|█████     | 1/2 [Elapsed: 00:03, Remaining: 00:03] 

Evaluating: 100%|██████████| 2/2 [Elapsed: 00:05, Remaining: 00:00] 



✨ Evaluation completed.

Metrics and evaluation results are logged to the MLflow run:
  Run name: [94mSimple Langgraph CMS Agent[0m
  Run ID: [94ma77d1491668b4133a076296f1a2f1f8d[0m

To view the detailed evaluation results with sample-wise scores,
open the [93m[1mTraces[0m tab in the Run page in the MLflow UI.



In [18]:
eval_results.metrics

{'Tool Trajectory/mean': np.float64(0.75),
 'Task Completeness/mean': np.float64(0.85)}