# Building the Evaluation Agent
For full documentation on building AI agents in code, [refer to this doc](https://docs.databricks.com/aws/en/generative-ai/agent-framework/author-agent?language=LangGraph).

This notebook is developed on DBR 16.4 ML LTS. At the time of writing, 17.3 ML LTS is having issues with Spark Connect and the default client mode. Tested in 17.2 and this works fine. This is because as of 11-Nov-2025 my region (aws us-west-2) only has serverless runtime 17.2 installed. Running DBR 17.3 causes a runtime mis-match error. Until my serverless plane is updated to 17.3, I have to use a slightly older DB Runtime. This is separate from running a notebook in serverless. This has to do with Databricks Connect in the backend.

## Dependencies
Creating an agent requires the latest version of MLFlow (>=3.1.3), Python 3.10 or newer (default on serverless) as well as the Databricks agent framework and the langchain AI Bridge. Since ChatAgent has been deprecated, we will need to use OpenAI's `ResponsesAgent` framework for an MLFlow-compliant interface :)

In [0]:
#Use -qqq to show any errors or -qqqq for very quiet
%pip install -U -qqq langgraph uv databricks-agents databricks-langchain mlflow-skinny[databricks] 
dbutils.library.restartPython()

## ResponsesAgent Overview
It's important to note that `ResponsesAgent` is a wrapper to seamless interface with a variety of different agents in a common way. This means that we can author agents in Databricks, but have them interface with any platform.
<br/>
<br/>
<img src="https://docs.databricks.com/aws/en/assets/images/responses-agent-overview-611d843718bf94974d277a365695043c.svg" width=1000 />

In [0]:
%%writefile -a agent.py
from typing import Annotated, Any, Generator, Optional, Sequence, TypedDict, Union

import mlflow
from databricks_langchain import (
    ChatDatabricks,
    UCFunctionToolkit,
    VectorSearchRetrieverTool,
    # DatabricksFunctionClient
)
from langchain_core.messages import AIMessage, AIMessageChunk, AnyMessage
from langchain_core.runnables import RunnableConfig, RunnableLambda
from langchain_core.tools import BaseTool
from langgraph.graph import END, StateGraph
from langgraph.graph.message import add_messages
from langgraph.prebuilt.tool_node import ToolNode
from mlflow.pyfunc import ResponsesAgent
from mlflow.types.responses import (
    ResponsesAgentRequest,
    ResponsesAgentResponse,
    ResponsesAgentStreamEvent,
    output_to_responses_items_stream,
    to_chat_completions_input,
)
from unitycatalog.ai.core.databricks import DatabricksFunctionClient



### Defining the Foundation Model
As we build up our agent, we need to define the foundation model we want to back the agent. This handles all of the text en/decoding, routing logic (based on descriptions) and instruction handling.

In [0]:
%%writefile -a agent.py
#Define the endpoint to use for the agent foundation and system prompt
LLM_ENDPOINT_NAME = "databricks-claude-3-7-sonnet"
llm = ChatDatabricks(endpoint=LLM_ENDPOINT_NAME)
system_prompt = "You are a helpful assistant that can run Python code." #Give my agent a better description later.


### AI Tools
Various tools for agent capabilities can be added here. We're creating a collection of tools defined as `tools[]` that we then append to.

In [0]:
%%writefile -a agent.py
tools = []

# You can use UDFs in Unity Catalog as agent tools
# Below, we add the `system.ai.python_exec` UDF, which provides
# a python code interpreter tool to our agent
# You can also add local LangChain python tools. See https://python.langchain.com/docs/concepts/tools

# TODO: Add additional tools
UC_TOOL_NAMES = ["system.ai.python_exec"]
client = DatabricksFunctionClient(execution_mode="serverless")
uc_toolkit = UCFunctionToolkit(function_names=UC_TOOL_NAMES, client=client)
tools.extend(uc_toolkit.tools)


### Vector Search Tools
Vector searches and indexes can be used for context adding RAG capabilities to the agent. Databricks endpoints can be used directly using the VectorSearchRetrieverTool(). Other vectord databases can be added as external MCP endpoints. These are all appended to the `tools[]` collection we defined above.

In [0]:
%%writefile -a agent.py
#Vector search tools are used for unstructured text tools. Useful for a RAG agent.
VECTOR_SEARCH_TOOLS = []

# To add vector search retriever tools,
# use VectorSearchRetrieverTool and create_tool_info,
# then append the result to TOOL_INFOS.
# Example:
# VECTOR_SEARCH_TOOLS.append(
#     VectorSearchRetrieverTool(
#         index_name="",
#         # filters="..."
#     )
# )
tools.extend(VECTOR_SEARCH_TOOLS)


### Agent State
`AgentState()` is an object used to persist the conversation turns of the agent. We pass this object into the conversation chain to keep track of what it's doing for the duration of the session.

In [0]:
%%writefile -a agent.py

class AgentState(TypedDict):
    messages: Annotated[Sequence[AnyMessage], add_messages]
    custom_inputs: Optional[dict[str, Any]]
    custom_outputs: Optional[dict[str, Any]]
    

### Agent Logic
This is where we define the actual logic for the agent and handle how the tools are called. We're also defining how the agent deals with conversation turns.

In [0]:
%%writefile -a agent.py
def create_tool_calling_agent(
    model: ChatDatabricks,
    tools: Union[ToolNode, Sequence[BaseTool]],
    system_prompt: Optional[str] = None,
):
    model = model.bind_tools(tools)

    # Define the function that determines which node to go to
    def should_continue(state: AgentState):
        messages = state["messages"]
        last_message = messages[-1]
        # If there are function calls, continue. else, end
        if isinstance(last_message, AIMessage) and last_message.tool_calls:
            return "continue"
        else:
            return "end"

    if system_prompt:
        preprocessor = RunnableLambda(
            lambda state: [{"role": "system", "content": system_prompt}] + state["messages"]
        )
    else:
        preprocessor = RunnableLambda(lambda state: state["messages"])
    model_runnable = preprocessor | model

    def call_model(
        state: AgentState,
        config: RunnableConfig,
    ):
        response = model_runnable.invoke(state, config)

        return {"messages": [response]}

    workflow = StateGraph(AgentState)

    workflow.add_node("agent", RunnableLambda(call_model))
    workflow.add_node("tools", ToolNode(tools))

    workflow.set_entry_point("agent")
    workflow.add_conditional_edges(
        "agent",
        should_continue,
        {
            "continue": "tools",
            "end": END,
        },
    )
    workflow.add_edge("tools", "agent")

    return workflow.compile()
    

### ResponsesAgent (OpenAI) Framework
Next, we create a new class we're calling LangGraphResponsesAgent which is a concrete implementation of the ResponsesAgent base class. ResponsesAgent is a definition created by OpenAI that's being used by pretty much every major agent platform now. This is becoming the standard implementation. We're going to be seeing future agent registries being built that require this as a protocol. ResponsesAgent is responsible for handling the conversation from human-to-agent, agent-to-agent and agent-to-human in a standard way.

**IMPORTANT!** ResponsesAgent() is still classified as experimental within Databricks - OpenAI is now considering ResponsesAgent() as stable release. Databricks _may_ move to a different implementation later. The specific implementation we're using here comes from MLFlow. If you prefer, you can use OpenAI's native implementation, however the MLFlow version supports the full MLOps lifecycle including conversation tracking for easy detection of hallucination.

`predict()` is the boundary conversation to and from the agent. This is either the human-agent or agent-agent interface.
`predict_stream()` is the internal conversation and discourse the agent has with itself (reasoning).

In [0]:
%%writefile -a agent.py
class LangGraphResponsesAgent(ResponsesAgent):
    def __init__(self, agent):
        self.agent = agent

    def predict(self, request: ResponsesAgentRequest) -> ResponsesAgentResponse:
        outputs = [
            event.item
            for event in self.predict_stream(request)
            if event.type == "response.output_item.done"
        ]
        return ResponsesAgentResponse(output=outputs, custom_outputs=request.custom_inputs)

    def predict_stream(
        self,
        request: ResponsesAgentRequest,
    ) -> Generator[ResponsesAgentStreamEvent, None, None]:
        cc_msgs = to_chat_completions_input([i.model_dump() for i in request.input])

        for event in self.agent.stream({"messages": cc_msgs}, stream_mode=["updates", "messages"]):
            if event[0] == "updates":
                for node_data in event[1].values():
                    if len(node_data.get("messages", [])) > 0:
                        yield from output_to_responses_items_stream(node_data["messages"])
            # filter the streamed messages to just the generated text messages
            elif event[0] == "messages":
                try:
                    chunk = event[1][0]
                    if isinstance(chunk, AIMessageChunk) and (content := chunk.content):
                        yield ResponsesAgentStreamEvent(
                            **self.create_text_delta(delta=content, item_id=chunk.id),
                        )
                except Exception as e:
                    print(e)




In [0]:
%%writefile -a agent.py
mlflow.langchain.autolog()
agent = create_tool_calling_agent(llm, tools, system_prompt)
AGENT = LangGraphResponsesAgent(agent)
mlflow.models.set_model(AGENT)

### Testing the agent

In [0]:
#Restart the python interpreter to flush out any lingering instances of in-memory objects
dbutils.library.restartPython()

In [0]:
#Test the summarizer
from agent import AGENT

result = AGENT.predict({"input": [{"role": "user", "content": "What is 6*7 in Python?"}]})
print(result.model_dump(exclude_none=True))

### Viewing the Results as Chunks
Here we can blow out the response and print each chunk as it's processed by the agent. We can also clearly see how the result is being re-assembled. This is the type of conversation the agent has and we can see it's chain of reasoning to help us debug.

**NOTE!** The agent description has a big effect on the results. Often logical errors or fallacies can be remedied by fixing the descriptions and instructions.

In [0]:
#Test internal conversation turns
for chunk in AGENT.predict_stream(
    {"input": [{"role": "user", "content": "What is 6*7 in Python?"}]}
):
    print(chunk.model_dump(exclude_none=True))

## Log the agent as an MLflow model
__This is taken straight from the custom agent boilerplate example__

Log the agent as code from the `agent.py` file (or whatever you called it in the writefile statements). See [MLflow - Models from Code](https://mlflow.org/docs/latest/models.html#models-from-code).

If you are creating multiple agents, each one needs to be logged in MLFlow so it can be used later in a multi-agent setup.

### Enable automatic authentication for Databricks resources
For the most common Databricks resource types, Databricks supports and recommends declaring resource dependencies for the agent upfront **during logging**. This enables automatic authentication passthrough when you deploy the agent. With automatic authentication passthrough, Databricks automatically provisions, rotates, and manages short-lived credentials to securely access these resource dependencies from within the agent endpoint.

To enable automatic authentication, specify the dependent Databricks resources when calling `mlflow.pyfunc.log_model().`

  - **TODO**: If your Unity Catalog tool queries a [vector search index](docs link) or leverages [external functions](docs link), you need to include the dependent vector search index and UC connection objects, respectively, as resources. See docs ([AWS](https://docs.databricks.com/generative-ai/agent-framework/log-agent.html#specify-resources-for-automatic-authentication-passthrough) | [Azure](https://learn.microsoft.com/azure/databricks/generative-ai/agent-framework/log-agent#resources)).