# SageMaker AI Managed MLflow Agent Introduction
Welcome to the hands-on SageMaker AI Managed MLflow workshop lab! This notebook will guide you through how to create an agent via LangGraph framework and use MLflow to log trace. We'll build an intelligent ETL error resolution Agent that can analyze error tickets and provide step-by-step solutions. Our agent will be equipped with two agent tools (log_identifer, information_retriever) to: 
1. Identify errors from ticket IDs 
2. Retrieve relevant solutions from a solution book.

#### Prerequisites

**Important:** You need the SageMaker AI Managed MLflow tracking server ARN from the workshop prerequisites section. Replace the placeholder below with your actual tracking server ARN.

### Building Intelligent IT support Agents with MLflow 

In this notebook lab, you will develop an IT support agent where the agent is designed for taking user IT support questions, automated ETL (Extract, Transform, Load) of the support tickets and provide actionable resolution to the user question. The support tickets are regarding issues of an IT system and the support agent helps users with ticket resolution. This intelligent agent is implemented via the LangGraph framework and instrumented with MLflow tracing, demonstrating how modern agentic architectures can elevate IT system incident response and operational reliability.
1. Identify errors from ticket IDs 
2. Retrieve relevant solutions from a solution book.

![Sample_Architect](./static/05-ITSupportAgentTracing.png) 


### Scenario Overview
The agent is built to assist data engineers or support staff by analyzing helpdesk tickets reporting ETL pipeline errors. Upon receiving a ticket ID, it will diagnose the underlying issue and deliver targeted, step-by-step solutions—reducing downtime and improving user satisfaction.

**Agent Tools**
- `log_identifier`: Analyzes supplied ticket IDs and extracts error type, diagnostics, and metadata from pipeline logs.
- `information_retriever`: Looks up solutions in a curated solution book and compiles actionable instructions for error resolution and mitigation.

Here is the ideal resolution steps:  
1. The user requests assistance by providing a `ticket_id`.
2. The agent calls the `log_identifier` tool to classify the error from backend logs associated with the ticket.
3. The agent invokes the information_retriever tool to fetch detailed solution steps based on error type and context.
4. The user receives clear, actionable step-by-step instructions for resolving the ETL error.

## Load data and libaries 

Install required libraries. You might get some dependency conflict errors but it shouldn't affect the functionality of the rest of this notebook.

In [None]:
# Install following dependencies. Ignore any warnings and residual dependency errors.
!pip install -r requirements-langgraph.txt -q

> Note: The latest MLflow version (November, 2025 at the time of this workshop release) supported in Sagemaker AI Managed MLflow tracking server is MLflow 3.0.0 and python 3.9 or later, You can find more information [here](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow.html).
> To make sure you can successfuly run this notebook you will need use the compatible version, i.e. install `sagemaker_mlflow==0.1.0` and `mlflow=>3.0.0`. This is already provided as part of the `requirements.txt` installation. The Sagemaker AI Managed MLflow tracking server can be different than the mlflow SDK python version used as long as the APIs used in the Sagemaker AI Managed MLflow tracking server MLflow versions are supported. 

In [None]:
import boto3
import os

## Import Data 
from data.data import log_data_set, log_data
from data.solution_book import solution_book

## Import MLflow Libs
import sagemaker_mlflow
import mlflow

## Check AWS Credentials 
try:
    boto3.client('bedrock-runtime')
except Exception as e:
    print(f"Error configuring AWS credentials: {e}")
    print("Please set your AWS credentials before proceeding.")

First, let's examine our sample synthesis data that represents a typical support ticket system. `log_data` is a list of dictionaries contains ticked id and error name, for this workshop, we will use a simplified version to extract the error_name based on the ticket_id. `solution_book` is a dictionary where the key is the error name, the value is solution steps. Here we use the dictionary to mimic the real world use case solution which normally use Vector Store Knowledge Base to retrieve relevant solutions

In [None]:
log_data

In [None]:
solution_book

## Sagemaker AI MLflow
We will configure MLflow integration with Amazon SageMaker AI for tracking and tracing agent-based workflows. Frist we establishes a connection to a specific SageMaker AI MLflow Tracking Server using an ARN (Amazon Resource Name). 
The `mlflow.set_tracking_uri()` method directs all subsequent MLflow logging operations to this server, ensuring experiment metadata is stored in the secure backend maintained by SageMaker AI.

You can organize all your MLFlow observibility data by setting `mlflow.set_experiment("<YOUR_MLFLOW_EXPERIMENT_NAME>")`. MLflow allows you to group runs under experiments which can be useful for comparing runs intended to tackle a particular task.

We will retrieve the stored notebook values containing the SageMaker AI MLflow Tracking Server ARN. If the stored value is empty you can enter your tracking server arn to the following place holder string `TRACKING_SERVER_ARN`. Additionally, you can give any MLflow experiment name, in this workshop, we will give an experiment name: agent-mlflow-demo 

In [None]:
# Retrieve values stored from previous labs
# If the stored value (or if you get NameError) is empty, set your SageMaker AI Managed MLflow tracking server ARN copied from prerequisites
%store -r 

%store
if TRACKING_SERVER_ARN == "":
    print("ENTER YOUR MLFLOW TRACKING SERVER ARN")
TRACKING_SERVER_ARN

In [None]:
# If the stored value is empty set your SageMaker AI Managed MLflow tracking server ARN by uncommenting the below line
#TRACKING_SERVER_ARN = "ENTER YOUR MLFLOW TRACKIHG SERVER ARN HERE" 

In [None]:
experiment_name = "agent-mlflow-demo"

# Set MLflow SDK to your configured tracking server 
mlflow.set_tracking_uri(TRACKING_SERVER_ARN) 
# Create or select an MLflow experiment
mlflow.set_experiment(experiment_name)

Since we will use LangGraph agent, let's set up the auto tracing for LangGraph. 

> `mlflow.langchain.autolog()` is a function within the MLflow LangChain flavor that enables automatic logging of crucial details about LangChain models and their execution. This feature simplifies experiment tracking and analysis by eliminating the need for explicit logging statements. By default, `mlflow.langchain.autolog()`automatically logs traces of your LangChain components, providing a visual representation of data flow through chains, agents, and retrievers. This includes invocations of methods like invoke, batch, stream, ainvoke, abatch, astream, get_relevant_documents (for retrievers), and `__call__` (for Chains and AgentExecutors).

> Note: You can use the the generic high-level autologging function mlflow.autolog() to capture the traces, however, when mlflow.autolog() called, it attempts to enable autologging for all supported standard ML libraries you have installed including the agentic libraries which can make it difficuilt to debug agentic behaviour and deviates the primary focus of helping with agentic framework tracing. Where you use the agentic framework specific autolog like mlflow.langchain.autolog(), Traces (detailed execution flow) are the primary log item with scope specific to focuse only on the agentic framework integration like LangChain

In [None]:
print(mlflow.__version__)

In [None]:
mlflow.langchain.autolog()

#### Initialize sample data for the agent tool 
To build a practical ETL error resolution agent, we first need to provide sample lookup data for each tool operation. In this demonstration:
- `log_data` mimics a support ticket backend, mapping ticket IDs to error types.
- `solution_book` acts as a simple knowledge base, mapping error names to resolution steps.

In production, these would typically be external databases, knowledge stores, or APIs. Here, we simplify with in-notebook Python objects for demonstration clarity.

> Note: This sample data demonstrates a typical support ticket system. `log_data` is a list of dictionaries contains ticked id and error name, for this workshop lab, we will use a simplified version to extract the error_name based on the ticket_id. `solution_book` is a python dictionary where the key is the error name, the value is resolution steps. Here we use the dictionary to mimic the real world use case solution which normally use data store like a database or a Vector Store Knowledge Base to retrieve relevant solutions.

In [None]:
# Initialize sample log data
# Each entry in log_data is a ticket with a unique ID and its error type
log_data

In [None]:
# Initialize sample solution book
# Each error type maps to a list of step-by-step instructions for resolution
solution_book

These staged dictionaries form the dynamic data foundation for the agent tools you’ll implement and invoke within your LangGraph workflow:
- The log_identifier tool will look up a ticket ID in log_data to extract the relevant error type.
- The information_retriever tool will fetch detailed, actionable remediation steps from the solution_book, given that error type.

> In a real-world scenario, these would likely be microservices, API calls, or vector search queries to production data stores. For this workshop, the simple data structures let you focus on agent logic, tool call orchestration, and MLflow-powered trace monitoring.

You’re now ready to define, register, and test your agent tools in the LangGraph agent framework.

### Defining Agent Tools in LangGraph
In LangGraph, agent tools let your LLM-powered agent interact with the world beyond language—retrieving information, performing searches, or calling APIs. Tools are simply Python functions, but with the @tool decorator they become part of the agent’s action space: the agent can decide when and how to use them.

First Lets import libaries from LangGraph Framework. here: 
1. `create_react_agent`: Create an agent that uses ReAct prompting.
2. `init_chat_model`: Initialize a ChatModel in a single line using the model’s name and provider.
3. `langchain_core.tools`: Tool that takes in function or coroutine directly. You can customize your tool with `@tool` decoration

In [None]:
from langgraph.prebuilt import create_react_agent
from langchain.chat_models import init_chat_model
from langchain_core.tools import tool

#### The log_identifier Tool
This function allows the agent to look up the error type associated with a given support ticket ID. This is your agent’s primary hook into your simulated support system.
- The @tool decorator registers this as an agent-available tool. The @tool decorator is core: it annotates a function so that your agent can see (and use) it as part of its available toolkit.
- The agent will call this function when it needs to map a ticket ID to a specific error type.
- A clear docstring is critical: the LLM references these descriptions to decide when the tool is appropriate to use.
- The function loops through log_data to find the corresponding error, allowing easy simulation and interpretation.

In [None]:
@tool 
def log_identifier(ticket_id: str) -> str:
    """Get error type from ticket number

    Args:
        ticket_id: ticket id

    Returns:
        an error type

    """
    if ticket_id not in log_data_set:
        return "ticket id not found in the database"
    
    for item in log_data:
        if item["id"] == ticket_id:
            return item['error_name']


#### The information_retriever Tool
After an error type is found, the agent can fetch the appropriate, step-by-step solution using this retrieval tool:
- The @tool(return_direct=True) tells LangGraph to return this tool’s output to the user directly—useful for final answers (e.g., resolution steps).
- Like before, the docstring describes the tool’s intent and usage.
- This tool looks up solution steps as a string, making it perfect for the agent to hand off clear resolution instructions directly to the end-user.
- If the error type isn't found, it provides a safe fallback message.

In [None]:
@tool(return_direct=True)
def information_retriever(error_type: str) -> str:
    """Retriever error solution based on error type

    Args:
        error_type: user input error type
    
    Returns:
        a str of steps 
    """

    if error_type not in solution_book.keys():
        return "error type not found in the knowledge base, please use your own knowledge"
    
    return solution_book[error_type]



Agent Workflow:
- Agent receives a user input (e.g., help request with ticket id)
- Calls log_identifier to transform the ticket id to an error type
- Uses information_retriever to fetch actionable guidance for the error
- Returns steps directly to the user (since return_direct=True)

You can now proceed to wire these tools into your LangGraph agent with create_react_agent. Each tool call will be traced in MLflow for transparency and assessment.

#### Initialize Language Model with Amazon Bedrock
We use the init_chat_model function to create a consistent interface for initializing various LLM providers. In this workshop, we'll use Amazon Bedrock anthropic Claude model as our LLM to power the agent's reasoning and decision-making.

> Note: You will need IAM permissions to access to Amazon bedrock model with Cross-region inference. If you completed the pre-requisite section you will have access to the bedrock model.

In [None]:
llm = init_chat_model(
    model= "global.anthropic.claude-haiku-4-5-20251001-v1:0",
    model_provider="bedrock_converse",
)

- model: Specifies the exact model identifier for the agent. Here a Amazon Bedrock anthropic Claude model.
- model_provider: Indicates the provider service, here bedrock_converse.

This LLM instance will be the brain of the agent, generating responses, deciding tool invocations, and following the reasoning workflow.

#### Define the System Prompt for Agent Behavior
The system prompt acts as the agent’s personality and instruction manual. We define it clearly to instruct the agent that it is an expert in resolving ETL errors, equipped with two tools, and specify the expected workflow and output formatting.

The system prompt acts as the agent’s personality and instruction manual. We define it clearly to instruct the agent that it is an expert in resolving ETL errors, equipped with two tools, and specify the expected workflow and output formatting.

In [None]:
system_prompt = """
You are an expert a resolving ETL errors. You are equiped with two tools: 
1. log_identifier: Get error type from ticket number
2. information_retriever: Retriever error solution based on error type

You will use the ticket ID to gather information about the error using the log_identifier tool. 
Then you should search the database for information on how to resolve the error using the information_retriever tool

Return ONLY the numbered steps without any introduction or conclusion. Format as:
1. step 1 text
2. step 2 text
...
"""

- This prompt ensures the agent uses the right tools in sequence.
- It emphasizes output format consistency, making it easier for end users to follow.
- Provides context to the LLM for optimized, goal-directed behavior.

We next utilize the create_react_agent function, which builds a ReAct agent: a Reasoning and Acting agent that can iteratively decide which tool to call and what responses to generate.

In [None]:
agent = create_react_agent(
    model=llm,
    tools= [log_identifier, information_retriever], 
    prompt=system_prompt
)

- model: The LLM instance driving the agent.
- tools: List of pre-defined tools (functions marked with @tool) the agent can call during reasoning.
- prompt: The system prompt that sets agent instructions and behavior.

`create_react_agent` abstracts away the complexity of integrating multiple tools and manages the reasoning cycle per the ReAct paper.

In [None]:
# (Optional) Draw agent graph to local file 
# agent.get_graph(xray=True).draw_mermaid_png(
#     output_file_path="graph_diagram.png",  # Specify where to save the PNG image
#     background_color="white", 
#     padding=10
# )

Now let's define a clean interface for interacting with our agent. It formats the input as a message (following the chat format), invokes the agent, and extracts the final response content. The agent will automatically use the tools in the correct sequence to resolve the ticket.

In [None]:
def get_langGraph_agent_response(user_prompt):
    # Prepare input for the agent
    agent_input = {"messages": [{"role": "user", "content": user_prompt}]}
    response = agent.invoke(agent_input)
    return response['messages'][-1].content

Finally, we test our agent with a sample user prompt including the ticket ID. The agent should:

1. Use the log_identifier tool to find that TICKET-001 corresponds to "Connection Timeout"
2. Use the information_retriever tool to get the solution steps
3. Return the formatted solution steps to the user

When you run the next code cell, you will see the agent reponse to the user prompt with a numbered list of steps for resolving the ticket issue, demonstrating that your agent successfully chained the tools together to provide a complete solution.

In [None]:
langGraph_agent_response = get_langGraph_agent_response(user_prompt = 'Can you help me with this ticket_id : TICKET-001?')

In [None]:
print(langGraph_agent_response)

Now you will be able to see the tracing in the MLflow. 

1. Open your SageMaker AI Managed MLflow UI
2. In the MLflow UI navigate to the `Traces` tab to see traces of the agent’s reasoning steps and tool calls, enabling rich observability for analysis and debugging.
3. If you used the default values in this notebook, your MLflow experiment will `experiment_name = "agent-mlflow-demo"`