# Building the Evaluation Agent
For full documentation on building AI agents in code, [refer to this doc](https://docs.databricks.com/aws/en/generative-ai/agent-framework/author-agent?language=LangGraph).

This notebook is developed on DBR 16.4 ML LTS. At the time of writing, 17.3 ML LTS is having issues with Spark Connect and the default client mode. Tested in 17.2 and this works fine. This is because as of 11-Nov-2025 my region (aws us-west-2) only has serverless runtime 17.2 installed. Running DBR 17.3 causes a runtime mis-match error. Until my serverless plane is updated to 17.3, I have to use a slightly older DB Runtime. This is separate from running a notebook in serverless. This has to do with Databricks Connect in the backend.

# DEMO START
---

## Dependencies
Creating an agent requires the latest version of MLFlow (>=3.1.3), Python 3.10 or newer (default on serverless) as well as the Databricks agent framework and the langchain AI Bridge. Since ChatAgent has been deprecated, we will need to use OpenAI's `ResponsesAgent` framework for an MLFlow-compliant interface :)

In [0]:
#Use -qqq to show any errors or -qqqq for very quiet
%pip install -U -qqq langgraph uv databricks-agents databricks-langchain mlflow-skinny[databricks] python-pptx python-docx
dbutils.library.restartPython()

## ResponsesAgent Overview
It's important to note that `ResponsesAgent` is a wrapper to seamless interface with a variety of different agents in a common way. This means that we can author agents in Databricks, but have them interface with any platform. We will be using ResponsesAgent as a wrapper for our node agents and our supervisors.
<br/>
<br/>
<img src="https://docs.databricks.com/aws/en/assets/images/responses-agent-overview-611d843718bf94974d277a365695043c.svg" width=1000 />

In [0]:
# %%writefile -a eval_agent.py
from typing import Annotated, Any, Generator, Optional, Sequence, TypedDict, Union, TypedDict, List, Dict

import mlflow
from databricks_langchain import (
    ChatDatabricks,
    UCFunctionToolkit,
    VectorSearchRetrieverTool,
    # DatabricksFunctionClient
)
from langchain_core.messages import AIMessage, AIMessageChunk, AnyMessage
from langchain_core.runnables import RunnableConfig, RunnableLambda
from langchain_core.tools import BaseTool
from langchain.tools import tool
from langchain_openai import ChatOpenAI
# from langchain.agents import create_tool_calling_agent, AgentExecutor
# from langchain.prompts import ChatPromptTemplate

from langgraph.graph import END, StateGraph
from langgraph.graph.message import add_messages
from langgraph.prebuilt.tool_node import ToolNode

from mlflow.pyfunc import ResponsesAgent
from mlflow.types.responses import (
    ResponsesAgentRequest,
    ResponsesAgentResponse,
    ResponsesAgentStreamEvent,
    output_to_responses_items_stream,
    to_chat_completions_input,
)

from unitycatalog.ai.core.databricks import DatabricksFunctionClient

from pptx import Presentation
from docx import Document

import json



## Create the Evaluation Struct (Object Schema)

In [0]:
# %%writefile -a eval_agent.py
class EvalState(TypedDict, total=False):
    file_name: str                      # e.g. "incident_123.pptx"
    user_prompt: str                    # any prompt / focus area text
    pptx_chunks: List[str]              # parsed slide texts
    docx_chunks: List[str]              # parsed paragraph texts
    corrective_actions_raw: str         # raw LLM output from gen agent
    corrective_actions: List[Dict]      # parsed actions from gen agent
    evaluated_actions_raw: str          # raw LLM output from eval agent
    evaluated_actions: List[Dict]       # top-k ranked actions



## LM Context

In [0]:
# #Define the endpoint to use for the agent foundation and system prompt
LLM_ENDPOINT_NAME = "databricks-claude-3-7-sonnet"
llm = ChatDatabricks(endpoint=LLM_ENDPOINT_NAME)
system_prompt = "You are a tool used to create and rank corrective actions based on incident reports. You accept documents and user prompts to help understand what had occured and provide a recommendation on the top corrective actions to remediate and prevent similar incidents in the future."

## Create our Tools
The docstring is really important since it determines tool routing by the agent

In [0]:
# %%writefile -a eval_agent.py
UPLOAD_VOLUME = "/Volumes/ademianczuk/suncor_ehs/data/uploads"

@tool("parse_pptx")
def parse_pptx_tool(file_name: str) -> list[str]:
    """
    Parse a PPTX file stored in the UC Volume into slide-level text chunks. Use this to parse a Microsoft Powerpoint (.pptx) file.
    
    Args:
        file_name: The PPTX filename (e.g. 'incident_123.pptx'), assumed to live under
                   /Volumes/ademianczuk/suncor_ehs/data/uploads.

    Returns:
        A list of text chunks (strings). Each chunk is roughly slide-level; 
        small slides may be merged into bigger chunks to keep context.
    """
    dbfs_path = f"dbfs:{UPLOAD_VOLUME}/{file_name}"
    local_path = "/tmp/input.pptx"
    dbutils.fs.cp(dbfs_path, f"file:{local_path}", True)

    prs = Presentation(local_path)
    chunks: list[str] = []

    for slide_idx, slide in enumerate(prs.slides, start=1):
        texts = []
        for shape in slide.shapes:
            if hasattr(shape, "text"):
                txt = shape.text.strip()
                if txt:
                    texts.append(txt)
        if texts:
            #each slide = one chunk for now
            chunks.append(f"Slide {slide_idx}:\n" + "\n".join(texts))

    return chunks

@tool("parse_docx")
def parse_docx_tool(file_name: str) -> List[str]:
    """
    Parse a DOCX file stored in the UC Volume into text chunks. Use this to parse a Microsoft Word document (.docx).

    Args:
        file_name: The DOCX filename (e.g. 'incident_123.docx'), assumed to live under
                   /Volumes/ademianczuk/suncor_ehs/data/uploads.

    Returns:
        A list of text chunks (strings). Each chunk is roughly paragraph-level; 
        small paragraphs may be merged into bigger chunks to keep context.
    """

    dbfs_path = f"dbfs:{UPLOAD_VOLUME}/{file_name}"
    local_path = "/tmp/input.docx"
    dbutils.fs.cp(dbfs_path, f"file:{local_path}", True)

    doc = Document(local_path)
    chunks: List[str] = []
    current = ""
    max_chars = 1200 #tune as required

    paragraphs: List[str] = []
    for para in doc.paragraphs:
        text = para.text.strip()
        if not text:
            continue
        paragraphs.append(text)

    for p in paragraphs:
        #If adding the next paragraph would exceed max_chars, create a new chunk
        if len(current) + len(p) + 2 > max_chars:
            if current:
                chunks.append(current.strip())
            current = p
        else:
            if current:
                current += "\n\n" + p
            else:
                current = p

    if current:
        chunks.append(current.strip())

    return chunks

    

## Define the State Container Object
Since we need to preserve the state of the agent through conversation turns, we need to create a struct that defines the schema for the object. This is what handles keeping track of the messaging as the agent is updated. In other words, this keeps all of the agent messages organized.

In [0]:
# %%writefile -a agent.py
class AgentState(TypedDict):
    messages: Annotated[Sequence[AnyMessage], add_messages]
    custom_inputs: Optional[dict[str, Any]]
    custom_outputs: Optional[dict[str, Any]]



## Tool Calling Agent Core Logic
**NOTE!** Because we're using the MLFLow / Databricks implmentation of LangChain, we don't have access to the newer LangChain capabilities. With upcoming releases of Databricks LangChain, the `create_tool_calling_agent()` function will be available as part the langchain agents library. In the meantime, we can add it inline here to take advantage of the placeholder logic knowing full-well this is coming as a native implementation soon.

For now (as of November 14, 2025) we'll use this boilerplate function. In the future we can just create instances `create_tool_calling_agent()` and `AgentExecutor()` as part of the langchain.agents library which will get rolled into the Databricks LangChain implementation soon.

In [0]:
# %%writefile -a agent.py
def create_tool_calling_agent(
    model: ChatDatabricks,
    tools: Union[ToolNode, Sequence[BaseTool]],
    system_prompt: Optional[str] = None,
):
    model = model.bind_tools(tools)

    # Define the function that determines which node to go to
    def should_continue(state: AgentState):
        messages = state["messages"]
        last_message = messages[-1]
        # If there are function calls, continue. else, end
        if isinstance(last_message, AIMessage) and last_message.tool_calls:
            return "continue"
        else:
            return "end"

    if system_prompt:
        preprocessor = RunnableLambda(
            lambda state: [{"role": "system", "content": system_prompt}] + state["messages"]
        )
    else:
        preprocessor = RunnableLambda(lambda state: state["messages"])
    model_runnable = preprocessor | model

    def call_model(
        state: AgentState,
        config: RunnableConfig,
    ):
        response = model_runnable.invoke(state, config)

        return {"messages": [response]}

    workflow = StateGraph(AgentState)

    workflow.add_node("agent", RunnableLambda(call_model))
    workflow.add_node("tools", ToolNode(tools))

    workflow.set_entry_point("agent")
    workflow.add_conditional_edges(
        "agent",
        should_continue,
        {
            "continue": "tools",
            "end": END,
        },
    )
    workflow.add_edge("tools", "agent")

    return workflow.compile()
    
    

## Corrective Actions Agent Instructions

In [0]:
CORRECTIVE_ACTIONS_INSTRUCTIONS = """
You are a corporate Enterprise Health and Safety (EHS) expert specializing in corrective actions.

With the following guidance, define and create corrective actions:

1. Before reading the remaining instructions below, you must observe all uploaded documents and ensure that you have a full and comprehensive understanding of them. You must ensure that you are able to extrapolate from the provided resources, directly following the framework, definitions, and standards provided in doing so. Make sure that you are looking for pattern recognition and defining areas of interest that are related to the goal trying to be achieved.

2. You will be provided with an incident investigation report. This may be presented in a PDF, Word, or PowerPoint format and you must be ready to read, interpret, and understand the content provided by the user. Your goal is to flag the report for instances of 'negative' reasoning. The definitions for these terms are provided in the uploaded document labelled 'Consolidated Context Format.docx'. Additionally, you must also identify the presence of counterfactuals in the report provided. In carrying out these defined tasks, you must also determine the difference between negative and causal reasoning as you will need this for analysis.

3. When finding instances of negative reasoning, showcase how you would define and extract instances of negative reasoning for future inquiries. You must determine what factors constitute the usage of negative reasoning in any given incident investigations report.

4. You must structure the output as follows. I want you to have three headers: 'Relationship', 'Flagging', and 'Corrective Actions'.

5. Under 'Relationship', reassess the prompt provided by the user, as well as the uploaded document labelled 'Consolidated Context Format.docx'. Review the incident investigation report and specifically identify any terms and definitions from the uploaded documents that are present. Identify specific excerpts of the investigation report that contain any terms and definitions. Explain how they are present. Do not be vague and try to find the presence of as many terms and definitions as you can with sufficient detail. Do not combine any definitions.

6. Under 'Flagging', specify all instances of negative reasoning in the report, as well as the presence of counterfactuals and any logical errors. Put these into a table with four columns, 'Identification of Negative Reasoning / Counterfactual' (Identify whether you are assessing an instance of only 'Negative Reasoning' or a 'Counterfactual', or combine both together if a certain excerpt has both an instance of negative reasoning and a counterfactual), 'Identification Reasoning' (How did you identify a given instance of negative reasoning or the usage of a counterfactual? What factors revealed their presence?), 'Definitions Present' (Which of the definitions above are present in the excerpt extracted?), and 'Original Statement' (The original statement being analyzed). Ensure that you are specific and detailed in every column.

7. Under 'Corrective Actions',

I want you to list the causes and actions from the investigation report in a sentence format above the table. I then want you to create a table that is populated with every single cause and action in the report from the initial list you made, in the order specified under the forthcoming column called 'Quality of Action'. Before creating the table with the following columns, ensure the initial lists' actions are ranked according to the 'Quality of Action' column below before the table is created from the list. The table should be structured as follows with the following logic:
    - 'Tally' (Just increments and counts every row)
    - 'Action' (the corresponding corrective action)
    - 'Cause' (The initial cause that the action is a result of)
    - 'OEMS Process' (the associated and most applicable OEMS Process defined in the document labelled 'OEMS Process Descriptions.docx' to the cause)
    -  'Related OEMS Process' (to showcase all other most applicable OEMS Processes (more than one) from the document labelled 'OEMS Process Descriptions.docx')
    - 'Hierarchy of Controls' (which determines the control hierarchy level the corrective action best embodies, defined in the document labelled 'Consolidated Context Format.docx' under the section titled 'Hierarchy of Controls - Corrective Actions')
    - 'Quality of Action (which ranks the action based on what control hierarchy level it is at (create a scale where Elimination would be a 5 (Most Effective), Substitution would be a 4 (Highly Effective), Engineering Controls would be a 3 (Effective), Administrative Controls would be a 2 (Less Effective), and PPE would be a 1 (Least Effective) PPE would be a 1 (Least Effective)), formatted exactly as defined)
    - 'Top Three Actions' (It chooses the top three actions that have the best Root Cause Identification (Utilize Causal analysis to find underlying causes, not just symptoms), Break the Causal Chain (Implement controls at multiple points; use redundancy), are Systemic and Sustainable (Focus on process, policy, and resource improvements for long-term impact), are Specific and Measurable (Define clear actions, responsibilities, and metrics for success), and reflect viable Verification and Improvement (Monitor, audit, and refine actions for ongoing effectiveness). From there, the column is populated with its ranking number and a very detailed, long, and specific (very specific to the action, including many details from the action) justification, and anything under 3 is left blank. The best corrective actions are those that address the underlying root causes of an incident through a thorough causal analysis, rather than just treating immediate symptoms. They break the chain of events at multiple points by implementing layered controls, engineering, administrative, and behavioral, to ensure redundancy and resilience. Effective corrective actions are systemic and sustainable, focusing on long-term improvements to processes, policies, and resource allocation. They are also specific, clearly defining responsibilities, timelines, and measurable outcomes, and are verified for effectiveness through ongoing monitoring and continuous improvement. This comprehensive approach ensures that corrective actions not only resolve the current issue but also prevent recurrence and strengthen organizational safety and reliability)
    - 'Assessment of Effectiveness' (For each action, what would be a good criterion to determine whether its implementation was effective?. This focuses on how effective the action is in achieving the desired end. You must explore that if the same cause reoccurred, how would the presence of the action alter the timeline of events, and what criterion would be used to evaluate its effectiveness. Clearly define the specific criteria and measurable outcomes that will be used to determine if the action is effective. Also describe the method of verification (e.g., physical testing, audit, scenario review, monitoring of performance indicators). Explain how ongoing monitoring and continuous improvement will be ensured (e.g., scheduled reviews, integration into lessons learned, feedback loops). And finally, explore how the action would disrupt the cause and effect chain in future cases, and what would be used to evaluate this disruption)
Now that you are aware of how the table should be structured, begin by creating the list of causes and actions, with their respective 'Quality of Action' rankings, and then create the table. All table rows must be sorted strictly by the 'Quality of Action' column in descending order. All actions with '5 - Most Effective' must be at the very top, followed by '4 - Highly Effective', '3 - Effective', 2 - 'Less Effective', and 1 - 'Least Effective'. You must not mix, group, or list actions by any other order, and you must not list any lower-ranked action above a higher one. Do not preserve the original order from the report; only sort by 'Quality of Action'. After presenting the tables, you must write: "All tables have been sorted by 'Quality of Action' in strict descending order." Before submitting your response, check every table and confirm that the sorting is correct. If any table is not sorted properly, you must fix it before submitting.

8. After presenting the two tables above, you must write: "All tables have been sorted by 'Quality of Action' in strict descending order." Before submitting your response, check every table and confirm that the sorting is correct. If any table is not sorted properly, you must fix it before submitting.

9. Please do not add anything extra I did not ask you to add.

10. Whenever I say to include 'all of' something. Include every single instance of what is being asked to be provided.

11. Be very detailed.

Your task:
- Analyze the content of the incident/document.
- Propose specific CORRECTIVE ACTIONS that address root causes and key risks.
- Each action must be:
  - Concrete and implementable.
  - Clearly linked to a risk or failure in the incident.
  - Framed in a professional corporate tone.

Output MUST be valid JSON with this structure:
{
  "corrective_actions": [
    {
      "id": "{{filename}}",
      "title": "...",
      "description": "...",
      "risk_addressed": "...",
      "root_cause_addressed": "...",
      "owner_suggestion": "...",
      "timeframe": "Short-term|Medium-term|Long-term",
      "impact": "High|Medium|Low",
      "confidence": 0.0
    }
  ]
}
Do not include any text outside the JSON.
"""

# corrective_actions_prompt = ChatPromptTemplate.from_messages(
#     [
#         ("system", CORRECTIVE_ACTIONS_INSTRUCTIONS),
#         (
#             "user",
#             "User focus/prompt:\n{user_prompt}\n\n"
#             "Here are the slide contents of the incident report:\n\n"
#             "{pptx_text}\n\n"
#             "Using the report above and the instructions, generate a set of corrective actions."
#         ),
#     ]
# )

# gen_tools: list = []

# corrective_actions_agent = create_tool_calling_agent(
#     llm=llm,
#     tools=gen_tools,
#     prompt=corrective_actions_prompt,
# )

# corrective_actions_executor = AgentExecutor(
#     agent=corrective_actions_agent,
#     tools=gen_tools,
#     verbose=True,
# )

## Evaluation Agent Instructions

In [0]:
EVALUATION_INSTRUCTIONS = """
You are an expert reviewer of EHS corrective actions.

Your job:
- Evaluate the provided corrective actions for their potential to significantly reduce risk and improve safety.
- Consider:
  - Breadth of risk reduction.
  - Depth (severity) of issues addressed.
  - Feasibility and clarity.
  - Alignment with the customer's corrective-action guidelines.

You MUST:
- Select the TOP 3 corrective actions with the most significant impact.
- Provide a short evaluation summary for each selected action.

Output MUST be valid JSON:
{
  "top_corrective_actions": [
    {
      "id": "...",
      "title": "...",
      "reason_for_selection": "...",
      "expected_impact": "High|Medium",
      "comments": "...",
      "original_action": { ... }  // copy of original action object
    }
  ]
}
Do not include any text outside the JSON.
"""

# evaluation_prompt = ChatPromptTemplate.from_messages(
#     [
#         ("system", EVALUATION_INSTRUCTIONS),
#         (
#             "user",
#             "Here is the full list of corrective actions (JSON):\n{corrective_actions_json}\n\n"
#             "Select and return only the top 3 according to the instructions."
#         ),
#     ]
# )

# evaluation_agent = create_tool_calling_agent(
#     llm=llm,
#     tools=[],
#     prompt=evaluation_prompt,
# )

# evaluation_executor = AgentExecutor(
#     agent=evaluation_agent,
#     tools=[],
#     verbose=True,
# )

## Orchestration

In [0]:
from langgraph.graph import StateGraph, END

def parse_pptx_node(state: EvalState) -> EvalState:
    """Node 0: parse pptx into chunks using the tool."""
    chunks = parse_pptx_tool.invoke(state["file_name"])
    return {"pptx_chunks": chunks}

def corrective_actions_node(state: EvalState) -> EvalState:
    """Node 1: call CorrectiveActionsAgent to generate corrective actions."""
    pptx_text = "\n\n---\n\n".join(state["pptx_chunks"])
    user_prompt = state["user_prompt"]

    result = corrective_actions_executor.invoke(
        {
            "user_prompt": user_prompt,
            "pptx_text": pptx_text,
        }
    )
    raw = result["output"]

    try:
        parsed = json.loads(raw)
        actions = parsed.get("corrective_actions", [])
    except Exception:
        actions = []
    return {
        "corrective_actions_raw": raw,
        "corrective_actions": actions,
    }

def evaluation_node(state: EvalState) -> EvalState:
    """Node 2: call EvaluationAgent to select top-k=3 corrective actions."""
    corrective_actions = state["corrective_actions"]
    corrective_actions_json = json.dumps(
        {"corrective_actions": corrective_actions},
        ensure_ascii=False,
        indent=2,
    )

    result = evaluation_executor.invoke(
        {"corrective_actions_json": corrective_actions_json}
    )
    raw = result["output"]

    try:
        parsed = json.loads(raw)
        top_actions = parsed.get("top_corrective_actions", [])
    except Exception:
        top_actions = []

    return {
        "evaluated_actions_raw": raw,
        "evaluated_actions": top_actions,
    }

graph_builder = StateGraph(EvalState)

graph_builder.add_node("parse_pptx", parse_pptx_node)
graph_builder.add_node("generate_corrective_actions", corrective_actions_node)
graph_builder.add_node("evaluate_corrective_actions", evaluation_node)

graph_builder.set_entry_point("parse_pptx")
graph_builder.add_edge("parse_pptx", "generate_corrective_actions")
graph_builder.add_edge("generate_corrective_actions", "evaluate_corrective_actions")
graph_builder.add_edge("evaluate_corrective_actions", END)

graph = graph_builder.compile()

# DEMO END
---

In [0]:
%%writefile -a agent.py
from typing import Annotated, Any, Generator, Optional, Sequence, TypedDict, Union

import mlflow
from databricks_langchain import (
    ChatDatabricks,
    UCFunctionToolkit,
    VectorSearchRetrieverTool,
    # DatabricksFunctionClient
)
from langchain_core.messages import AIMessage, AIMessageChunk, AnyMessage
from langchain_core.runnables import RunnableConfig, RunnableLambda
from langchain_core.tools import BaseTool
from langgraph.graph import END, StateGraph
from langgraph.graph.message import add_messages
from langgraph.prebuilt.tool_node import ToolNode
from mlflow.pyfunc import ResponsesAgent
from mlflow.types.responses import (
    ResponsesAgentRequest,
    ResponsesAgentResponse,
    ResponsesAgentStreamEvent,
    output_to_responses_items_stream,
    to_chat_completions_input,
)
from unitycatalog.ai.core.databricks import DatabricksFunctionClient



### Defining the Foundation Model
As we build up our agent, we need to define the foundation model we want to back the agent. This handles all of the text en/decoding, routing logic (based on descriptions) and instruction handling.

In [0]:
%%writefile -a agent.py
#Define the endpoint to use for the agent foundation and system prompt
LLM_ENDPOINT_NAME = "databricks-claude-3-7-sonnet"
llm = ChatDatabricks(endpoint=LLM_ENDPOINT_NAME)
system_prompt = "You are a helpful assistant that can run Python code." #Give my agent a better description later.


### AI Tools
Various tools for agent capabilities can be added here. We're creating a collection of tools defined as `tools[]` that we then append to.

In [0]:
%%writefile -a agent.py
tools = []

# You can use UDFs in Unity Catalog as agent tools
# Below, we add the `system.ai.python_exec` UDF, which provides
# a python code interpreter tool to our agent
# You can also add local LangChain python tools. See https://python.langchain.com/docs/concepts/tools

# TODO: Add additional tools
UC_TOOL_NAMES = ["system.ai.python_exec"]
client = DatabricksFunctionClient(execution_mode="serverless")
uc_toolkit = UCFunctionToolkit(function_names=UC_TOOL_NAMES, client=client)
tools.extend(uc_toolkit.tools)


### Vector Search Tools
Vector searches and indexes can be used for context adding RAG capabilities to the agent. Databricks endpoints can be used directly using the VectorSearchRetrieverTool(). Other vectord databases can be added as external MCP endpoints. These are all appended to the `tools[]` collection we defined above.

In [0]:
%%writefile -a agent.py
#Vector search tools are used for unstructured text tools. Useful for a RAG agent.
VECTOR_SEARCH_TOOLS = []

# To add vector search retriever tools,
# use VectorSearchRetrieverTool and create_tool_info,
# then append the result to TOOL_INFOS.
# Example:
# VECTOR_SEARCH_TOOLS.append(
#     VectorSearchRetrieverTool(
#         index_name="",
#         # filters="..."
#     )
# )
tools.extend(VECTOR_SEARCH_TOOLS)


### Agent State
`AgentState()` is an object used to persist the conversation turns of the agent. We pass this object into the conversation chain to keep track of what it's doing for the duration of the session.

In [0]:
%%writefile -a agent.py

class AgentState(TypedDict):
    messages: Annotated[Sequence[AnyMessage], add_messages]
    custom_inputs: Optional[dict[str, Any]]
    custom_outputs: Optional[dict[str, Any]]
    

### Agent Logic
This is where we define the actual logic for the agent and handle how the tools are called. We're also defining how the agent deals with conversation turns.

In [0]:
%%writefile -a agent.py
def create_tool_calling_agent(
    model: ChatDatabricks,
    tools: Union[ToolNode, Sequence[BaseTool]],
    system_prompt: Optional[str] = None,
):
    model = model.bind_tools(tools)

    # Define the function that determines which node to go to
    def should_continue(state: AgentState):
        messages = state["messages"]
        last_message = messages[-1]
        # If there are function calls, continue. else, end
        if isinstance(last_message, AIMessage) and last_message.tool_calls:
            return "continue"
        else:
            return "end"

    if system_prompt:
        preprocessor = RunnableLambda(
            lambda state: [{"role": "system", "content": system_prompt}] + state["messages"]
        )
    else:
        preprocessor = RunnableLambda(lambda state: state["messages"])
    model_runnable = preprocessor | model

    def call_model(
        state: AgentState,
        config: RunnableConfig,
    ):
        response = model_runnable.invoke(state, config)

        return {"messages": [response]}

    workflow = StateGraph(AgentState)

    workflow.add_node("agent", RunnableLambda(call_model))
    workflow.add_node("tools", ToolNode(tools))

    workflow.set_entry_point("agent")
    workflow.add_conditional_edges(
        "agent",
        should_continue,
        {
            "continue": "tools",
            "end": END,
        },
    )
    workflow.add_edge("tools", "agent")

    return workflow.compile()
    

### ResponsesAgent (OpenAI) Framework
Next, we create a new class we're calling LangGraphResponsesAgent which is a concrete implementation of the ResponsesAgent base class. ResponsesAgent is a definition created by OpenAI that's being used by pretty much every major agent platform now. This is becoming the standard implementation. We're going to be seeing future agent registries being built that require this as a protocol. ResponsesAgent is responsible for handling the conversation from human-to-agent, agent-to-agent and agent-to-human in a standard way.

**IMPORTANT!** ResponsesAgent() is still classified as experimental within MLFlow - OpenAI is now considering ResponsesAgent() as stable release. MLFlow _may_ move to a different implementation later. If you prefer, you can use OpenAI's native implementation, however the MLFlow version supports the full MLOps lifecycle including conversation tracking for easy detection of hallucination.

`predict()` is the boundary conversation to and from the agent. This is either the human-agent or agent-agent interface.
`predict_stream()` is the internal conversation and discourse the agent has with itself (reasoning).

In [0]:
%%writefile -a agent.py
class LangGraphResponsesAgent(ResponsesAgent):
    def __init__(self, agent):
        self.agent = agent

    def predict(self, request: ResponsesAgentRequest) -> ResponsesAgentResponse:
        outputs = [
            event.item
            for event in self.predict_stream(request)
            if event.type == "response.output_item.done"
        ]
        return ResponsesAgentResponse(output=outputs, custom_outputs=request.custom_inputs)

    def predict_stream(
        self,
        request: ResponsesAgentRequest,
    ) -> Generator[ResponsesAgentStreamEvent, None, None]:
        cc_msgs = to_chat_completions_input([i.model_dump() for i in request.input])

        for event in self.agent.stream({"messages": cc_msgs}, stream_mode=["updates", "messages"]):
            if event[0] == "updates":
                for node_data in event[1].values():
                    if len(node_data.get("messages", [])) > 0:
                        yield from output_to_responses_items_stream(node_data["messages"])
            # filter the streamed messages to just the generated text messages
            elif event[0] == "messages":
                try:
                    chunk = event[1][0]
                    if isinstance(chunk, AIMessageChunk) and (content := chunk.content):
                        yield ResponsesAgentStreamEvent(
                            **self.create_text_delta(delta=content, item_id=chunk.id),
                        )
                except Exception as e:
                    print(e)




In [0]:
%%writefile -a agent.py
mlflow.langchain.autolog()
agent = create_tool_calling_agent(llm, tools, system_prompt)
AGENT = LangGraphResponsesAgent(agent)
mlflow.models.set_model(AGENT)

### Testing the agent

In [0]:
#Restart the python interpreter to flush out any lingering instances of in-memory objects
dbutils.library.restartPython()

In [0]:
#Test the summarizer
from agent import AGENT

result = AGENT.predict({"input": [{"role": "user", "content": "What is 6*7 in Python?"}]})
print(result.model_dump(exclude_none=True))

### Viewing the Results as Chunks
Here we can blow out the response and print each chunk as it's processed by the agent. We can also clearly see how the result is being re-assembled. This is the type of conversation the agent has and we can see it's chain of reasoning to help us debug.

**NOTE!** The agent description has a big effect on the results. Often logical errors or fallacies can be remedied by fixing the descriptions and instructions.

In [0]:
#Test internal conversation turns
for chunk in AGENT.predict_stream(
    {"input": [{"role": "user", "content": "What is 6*7 in Python?"}]}
):
    print(chunk.model_dump(exclude_none=True))

## Log the agent as an MLflow model
__This is taken straight from the custom agent boilerplate example__

Log the agent as code from the `agent.py` file (or whatever you called it in the writefile statements). See [MLflow - Models from Code](https://mlflow.org/docs/latest/models.html#models-from-code).

If you are creating multiple agents, each one needs to be logged in MLFlow so it can be used later in a multi-agent setup.

### Enable automatic authentication for Databricks resources
For the most common Databricks resource types, Databricks supports and recommends declaring resource dependencies for the agent upfront **during logging**. This enables automatic authentication passthrough when you deploy the agent. With automatic authentication passthrough, Databricks automatically provisions, rotates, and manages short-lived credentials to securely access these resource dependencies from within the agent endpoint.

To enable automatic authentication, specify the dependent Databricks resources when calling `mlflow.pyfunc.log_model().`

  - **TODO**: If your Unity Catalog tool queries a [vector search index](docs link) or leverages [external functions](docs link), you need to include the dependent vector search index and UC connection objects, respectively, as resources. See docs ([AWS](https://docs.databricks.com/generative-ai/agent-framework/log-agent.html#specify-resources-for-automatic-authentication-passthrough) | [Azure](https://learn.microsoft.com/azure/databricks/generative-ai/agent-framework/log-agent#resources)).

In [0]:
# Determine Databricks resources to specify for automatic auth passthrough at deployment time
from agent import UC_TOOL_NAMES, VECTOR_SEARCH_TOOLS

import mlflow
from mlflow.models.resources import DatabricksFunction
from pkg_resources import get_distribution

#Grab all of our tool resources and add them to a list (similar to what we did with our toolbox). This will give MLFlow registry context for what tools are employed by the agent.
resources = []
for tool in VECTOR_SEARCH_TOOLS:
    resources.extend(tool.resources)
for tool_name in UC_TOOL_NAMES:
    resources.append(DatabricksFunction(function_name=tool_name))

#Take the output file (agent.py) and write it as an artifact to the MLFlow registry. Make sure to add any dependencies here as part of the pip requirements variable.
with mlflow.start_run():
    logged_agent_info = mlflow.pyfunc.log_model(
        name="agent",
        python_model="agent.py",
        pip_requirements=[
            "databricks-langchain",
            f"langgraph=={get_distribution('langgraph').version}",
            f"databricks-connect=={get_distribution('databricks-connect').version}",
            "uv",
            "databricks-agents"
        ],
        resources=resources,
    )

## Agent Evaluation
Use Mosaic AI Agent Evaluation to evalaute the agent's responses based on expected responses and other evaluation criteria. Use the evaluation criteria you specify to guide iterations, using MLflow to track the computed quality metrics.
See Databricks documentation ([AWS]((https://docs.databricks.com/aws/generative-ai/agent-evaluation) | [Azure](https://learn.microsoft.com/azure/databricks/generative-ai/agent-evaluation/)).


To evaluate your tool calls, add custom metrics. See Databricks documentation ([AWS](https://docs.databricks.com/en/generative-ai/agent-evaluation/custom-metrics.html#evaluating-tool-calls) | [Azure](https://learn.microsoft.com/en-us/azure/databricks/generative-ai/agent-evaluation/custom-metrics#evaluating-tool-calls)).

In [0]:
import mlflow
from mlflow.genai.scorers import RelevanceToQuery, RetrievalGroundedness, RetrievalRelevance, Safety

eval_dataset = [
    {
        "inputs": {"input": [{"role": "user", "content": "Calculate the 15th Fibonacci number"}]},
        "expected_response": "The 15th Fibonacci number is 610.",
    }
]

eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=lambda input: AGENT.predict({"input": input}),
    scorers=[RelevanceToQuery(), Safety()],  # add more scorers here if they're applicable
)

# Review the evaluation results in the MLfLow UI (see console output)

## Sanity Check
Let's do a quick evaluation of a simple prompt to make sure that MLFlow is properly hosting the agent as a model.

In [0]:
mlflow.models.predict(
    model_uri=f"runs:/{logged_agent_info.run_id}/agent",
    input_data={"input": [{"role": "user", "content": "What is 6*7 in Python?!"}]},
    env_manager="uv",
)

## Registering the Agent in UC

In [0]:
mlflow.set_registry_uri("databricks-uc")

# TODO: define the catalog, schema, and model name for your UC model
catalog = "ademianczuk"
schema = "suncor_ehs"
model_name = "test_agent"
UC_MODEL_NAME = f"{catalog}.{schema}.{model_name}"

# register the model to UC
uc_registered_model_info = mlflow.register_model(model_uri=logged_agent_info.model_uri, name=UC_MODEL_NAME)

## Deploy the agent

In [0]:
from databricks import agents

agents.deploy(
    UC_MODEL_NAME,
    uc_registered_model_info.version,
    tags={"endpointSource": "docs"},
)