# Agent online Evalution with MLflow 

Online evaluation is a real-time assessment method for AI agents that occurs during actual deployment and interaction with users or environments. This approach provides immediate, real-world feedback on how the agent performs under genuine conditions, capturing nuances and edge cases that might not be apparent in historical data. Unlike offline evaluation, online evaluation offers insights into how the system adapts to changing user behaviors, environmental conditions, and emerging patterns in real-time.

MLflow's tracking capabilities can record live metrics (such as user feedback, system behavior), allowing developers to compare actual performance against expectations and benchmark results from offline testing. The platform can also help manage model updates and versions based on real-world performance data. In this section, we will demonstration how to store online evaluation (specifically user feedback into MLflow)

In [None]:
import boto3
import os

from data.data import log_data_set, log_data
from data.solution_book import knowledge_base

import sagemaker_mlflow
import mlflow

print(sagemaker_mlflow.__version__)
print(mlflow.__version__)

tracking_server_arn = "" 
experiment_name = "agent-mlflow-demo"
mlflow.set_tracking_uri(tracking_server_arn) 
mlflow.set_experiment(experiment_name)

mlflow.langchain.autolog()

0.1.0
3.0.0


In [2]:
mlflow.langchain.autolog()

### LangGraph Agent implementation

In [3]:
from langgraph.prebuilt import create_react_agent
from langchain.chat_models import init_chat_model
from langchain_core.tools import tool

@tool 
def log_identifier(ticket_id: str) -> str:
    """Get error type from ticket number

    Args:
        ticket_id: ticket id

    Returns:
        an error type

    """
    if ticket_id not in log_data_set:
        return "ticket id not found in the database"
    
    for item in log_data:
        if item["id"] == ticket_id:
            return item['error_name']

@tool(return_direct=True)
def information_retriever(error_type: str) -> str:
    """Retriever error solution based on error type

    Args:
        error_type: user input error type
    
    Returns:
        a str of steps 
    """

    if error_type not in knowledge_base.keys():
        return "error type not found in the knowledge base, please use your own knowledge"
    
    return knowledge_base[error_type]

llm = init_chat_model(
    model= "us.anthropic.claude-3-5-haiku-20241022-v1:0",
    model_provider="bedrock_converse",
)

system_prompt = """
You are an expert a resolving ETL errors. You are equiped with two tools: 
1. log_identifier: Get error type from ticket number
2. information_retriever: Retriever error solution based on error type

You will use the ticket ID to gather information about the error using the log_identifier tool. 
Then you should search the database for information on how to resolve the error using the information_retriever tool

Return ONLY the numbered steps without any introduction or conclusion. Format as:
1. step 1 text
2. step 2 text
...
"""

agent = create_react_agent(
    model=llm,
    tools= [log_identifier, information_retriever], 
    prompt=system_prompt
)

def get_langGraph_agent_response(ticket_id = 'TICKET-001'):
    # Prepare input for the agent
    agent_input = {"messages": [{"role": "user", "content": ticket_id}]}
    response = agent.invoke(agent_input)
    return response

langGraph_agent_response = get_langGraph_agent_response(ticket_id = 'TICKET-001')
print(langGraph_agent_response['messages'][-1].content)

content.str
  Input should be a valid string [type=string_type, input_value=[{'type': 'text', 'text':...OB0O5GjSACjI3ElqCHR8A'}], input_type=list]
    For further information visit https://errors.pydantic.dev/2.11/v/string_type
content.list[tagged-union[TextContentPart,ImageContentPart,AudioContentPart]].1
  Input tag 'tool_use' found using 'type' does not match any of the expected tags: 'text', 'image_url', 'input_audio' [type=union_tag_invalid, input_value={'type': 'tool_use', 'nam...rOB0O5GjSACjI3ElqCHR8A'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.11/v/union_tag_invalid


1. Check network connectivity between client and server
2. Verify if the server is running and accessible
3. Increase the connection timeout settings
4. Check for firewall rules blocking the connection
5. Monitor network latency and bandwidth
    


## Online Evaluation for user feedback 

This example will show how you can create and store user feedback in MLflow via Tags and how you can retrieval relavent traces based on Tags 

> Note: For later version MLflow 3.2.0 or later installed, MLflow Feedback provides a comprehensive system for capturing quality evaluations from multiple sources - whether automated AI judges, programmatic rules, or human reviewers. This systematic approach to feedback collection enables you to understand and improve your GenAI application's performance at scale. For complete API documentation and implementation details, see the [mlflow.log_feedback()](https://mlflow.org/docs/3.4.0/api_reference/python_api/mlflow.html#mlflow.log_feedback) reference.

Let's first define some methods for user feedbacks: 
- `collect_user_feedback`: this method will automatically use the latest trace, user can choose whether the trace is helpful or not. If this is helpful, then a 👍 will show in UI else 👎. Users can add **ANY** key words arguments as the additional information to store in the MLflow tags
- `collect_user_feedback_with_traceID`: this method allow you to add / edit an exiting feedback for a trace, just like `collect_user_feedback`, once you specify the trace_id, you can overwrite the values you entered before
-  `delete_feedback`: this will delete a feedback based on your trace id and key

In [4]:
from mlflow.entities import AssessmentSource, AssessmentSourceType
import mlflow

In [5]:
def collect_user_feedback(is_helpful, user_id = 'user-dev', **kwargs):
    '''Collect User Feedback, store them into MLflow tag

    Args: 
        is_helpful: whether the output is good or not 
        user_id: user who logs the information 
        kwargs: Any key word arguments

    Returns: 
        None 
    '''
    trace_id = mlflow.get_last_active_trace_id()
    feedback_value = '👍' if is_helpful else '👎'
    print(f"✓ Feedback recorded: {feedback_value}")
    mlflow.set_trace_tag(trace_id=trace_id, key="User_ID", value=user_id)
    mlflow.set_trace_tag(trace_id=trace_id, key="User_Feedback", value=feedback_value)

    for key, value in kwargs.items():
        mlflow.set_trace_tag(trace_id=trace_id, key= key , value=value)

def collect_user_feedback_with_traceID(trace_id, is_helpful, user_id = 'user-dev', **kwargs):
    '''Collect User Feedback using specific trace id, store them into MLflow tag

    Args: 
        trace_id : trace id 
        is_helpful: whether the output is good or not 
        user_id: user who logs the information 
        kwargs: Any key word arguments

    Returns: 
        None 
    '''

    feedback_value = '👍' if is_helpful else '👎'
    print(f"✓ Feedback recorded: {feedback_value}")
    mlflow.set_trace_tag(trace_id=trace_id, key="User_ID", value=user_id)
    mlflow.set_trace_tag(trace_id=trace_id, key="User_Feedback", value=feedback_value)
    #mlflow.set_trace_tag(trace_id=trace_id, key="WooHOO", value="yeah I can store anything in str format")

    for key, value in kwargs.items():
        mlflow.set_trace_tag(trace_id=trace_id, key= key , value=value)

def delete_feedback(trace_id, key):
    '''Delete a feedback

    Args:
        trace_id: trace id 
        key: key 
    
    Returns:
        None
    '''
    mlflow.delete_trace_tag(trace_id=trace_id, key=key)


In [6]:
# Get the most recent trace
trace_id = mlflow.get_last_active_trace_id()

# For example, I can have my config dictionary contains all the key-value pairs to be stored in the MLflow. 
# Let's say this is helpful and use user_id as online-demo-1 along with our key-value pairs 

config = {
    'tag_anything': 'text_anything'
}

collect_user_feedback(is_helpful = True, user_id = 'online-demo-1', **config)

✓ Feedback recorded: 👍


You can search relavant traces based on the tags, for example, I want to search all traces with User ID equal to online-demo-1 

In [7]:
user_traces = mlflow.search_traces(
    filter_string="tags.User_ID = 'online-demo-1'"
)
user_traces

Unnamed: 0,trace_id,trace,client_request_id,state,request_time,execution_duration,request,response,trace_metadata,tags,spans,assessments
0,cc248d650ac2410191f3027f35cf0c5b,Trace(trace_id=cc248d650ac2410191f3027f35cf0c5b),,TraceState.OK,1758758909633,3866,"{'messages': [{'role': 'user', 'content': 'TIC...","{'messages': [{'content': 'TICKET-001', 'addit...","{'mlflow.trace_schema.version': '3', 'mlflow.u...","{'mlflow.traceName': 'LangGraph', 'mlflow.arti...","[{'trace_id': 'z9IlPmYtYgTp/TYFoX624A==', 'spa...",[]
1,a097eb615ed142a3832d9094dc1b9f90,Trace(trace_id=a097eb615ed142a3832d9094dc1b9f90),,TraceState.OK,1758550274070,3717,"{'messages': [{'role': 'user', 'content': 'TIC...","{'messages': [{'content': 'TICKET-001', 'addit...","{'mlflow.trace_schema.version': '3', 'mlflow.u...","{'mlflow.traceName': 'LangGraph', 'mlflow.arti...","[{'trace_id': 'wl3fjxY6wepWhxlXeU9BuA==', 'spa...",[]
