# Evaluating agent workflows: Customer Support

A more advanced use case of LLM evaluation is measuring the response of conversational systems.

This short cookbook sets up an customer support agent system and then shows a practical evaluation method using Braintrust


We'll have an LLM take the role of both the customer and the support agent, in this case.

First, let's put together a customer support operating procedure and prompt, with relevant actions:


In [97]:
from textwrap import dedent

CUSTOMER_SUPPORT_AGENT_PROMPT = dedent(
    """
    You are a reliable and friendly customer support agent named Viola, speaking to a customer.
    You will attempt to assist the customer while complying with the company policies.

    Company Policies:
    Items can be exchanged within 60 days of purchase if they are defective. 
    Items can be returned within 30 days of purchase so long as they are in the same condition as when they were purchased.
    Offer an exchange before offering a refund, but if the customer insists on a refund, you can process it.
    If a situation is outside of the above, you can refuse a return or exchange.

    You don't need any information from the customer to process refunds or exchanges, but you do need their permission.

    You task:
    Read the customer's situation and respond to them with the appropriate message and action.
    The available actions are: TALK, EXCHANGE, REFUND, or REFUSE.
    Select TALK if you do not have enough information to make a final decision. Make sure to ask for the information you need in your MESSAGE!
    You must get the customer's permission before processing a REFUND or EXCHANGE.
    If you have enough information to make a final decision, provide the customer with a message explaining your decision.
    Your response MUST be formatted exactly like the following, with you providing content in the brackets:

    ACTION:action
    MESSAGE:your message to the customer
    """
)

And we'll create a couple of example customer situations to simulate as well:


In [98]:
customer_scenarios = {
    "Ankur": "Ankur bought a violin but it arrived broken yesterday. He is very frustrated because he has a concert coming up",
    "Tara": "Tara bought a violin 4 months ago and didn't use it as much as she wanted to",
    "Albert": "Albert bought a violin and it arrived in perfect condition, but after a week, he decided he didn't want it anymore. He hasn't used it much",
    "Hendricks": "Hendricks bought a violin and it arrived in perfect condition, but after a week, he decided he didn't want it anymore. He broke a string while playing it",
}

By comparing the customer scenario with our refund policy, we can come up with expected responses for our agent


In [99]:
expected_responses = {
    "Ankur": "EXCHAGE",
    "Tara": "REFUSE",
    "Albert": "REFUND",
    "Hendricks": "REFUSE",
}

In [100]:
def produce_customer_prompt(name, persona):
    return dedent(
        f"""
    You are playing the role of {name}, who is a customer with the following situation:
    <situation>{persona}</situation>
    You are chatting with a customer support agent. Use the above details to carry out the conversation. 
    When you feel your issue has been resolved, please include the upper-case word DONE in your response.
    """
    )

In [102]:
import os
import braintrust
import openai

braintrust.login(
    api_key=os.environ.get("BRAINTRUST_API_KEY", "Your BRAINTRUST_API_KEY here")
)

openai_client = braintrust.wrap_openai(
    openai.AsyncOpenAI(
        base_url="https://braintrustproxy.com/v1",
        default_headers={"x-bt-use-cache": "always"},
        api_key=os.environ.get("OPENAI_API_KEY", "Your OPENAI_API_KEY here"),
    )
)

In [134]:
# prevent runaway conversations from chatty AI :)
N_CONVERSATION_LIMIT = 10


def switch_roles(message):
    return {
        "role": "user" if message["role"] == "assistant" else "assistant",
        "content": message["content"],
    }


def reformat_messages_for_final(message, customer_name):
    # since we've been reformatting the messages for openai, display "true" format to end user
    # The last message will always be the agent
    return {
        "role": "Viola" if message["role"] == "user" else customer_name,
        "content": message["content"],
    }


async def simulate_conversation(input):
    customer_name = input["customer_name"]
    customer_scenario = input["customer_scenario"]
    messages = [
        {"role": "system", "content": None},  # Will be dynamically filled in
        {"role": "user", "content": "Hi, my name is Viola, how can I help you?"},
    ]
    action_taken = False
    whose_turn = "customer"

    while not action_taken and len(messages) < N_CONVERSATION_LIMIT:
        if whose_turn == "customer":
            messages[0]["content"] = produce_customer_prompt(
                customer_name, customer_scenario
            )
        else:
            messages[0]["content"] = CUSTOMER_SUPPORT_AGENT_PROMPT
        response = await openai_client.chat.completions.create(
            model="gpt-4o", messages=messages, max_tokens=150
        )
        response_message = response.choices[0].message.content

        if whose_turn == "agent":
            action = response_message.split("ACTION:")[1].split("\n")[0].strip()
            response_message = response_message.split("MESSAGE:")[1].strip()
            if action != "TALK":
                action_taken = True
        messages.append({"role": "assistant", "content": response_message})
        whose_turn = "customer" if whose_turn == "agent" else "agent"
        # flip user/assistant roles for every message:
        if not action_taken:
            messages = [messages[0]] + [switch_roles(m) for m in messages[1:]]
    # turn message list back into format for end user
    messages = [messages[0]] + [
        reformat_messages_for_final(m, customer_name) for m in messages
    ]
    return action

In [136]:
# Example customer details
customer_name = "John"
customer_scenario = "John bought a phone from your store 40 days ago and it stopped working. John would like to return it or get a refund."

# Start the conversation simulation
convo = await simulate_conversation(
    dict(customer_name=customer_name, customer_scenario=customer_scenario)
)

In [137]:
from braintrust import Eval

In [141]:
from autoevals import Score


async def score_func(output, expected, metadata, hooks, **kwargs):
    return float(output == expected)


eval_result = await Eval(
    name="Agent Evaluation: Customer Support",
    experiment_name="async hell",
    data=[
        dict(
            input=dict(customer_name=name, customer_scenario=scenario),
            expected=expected_responses[name],
        )
        for name, scenario in customer_scenarios.items()
    ],
    task=simulate_conversation,
    scores=[score_func],
)

Experiment async hell-4cb0e0cd is running at https://www.braintrust.dev/app/braintrustdata.com/p/Agent%20Evaluation%3A%20Customer%20Support/experiments/async%20hell-4cb0e0cd
Agent Evaluation: Customer Support [experiment_name=async hell] (data): 4it [00:00, 11530.73it/s]


Agent Evaluation: Customer Support [experiment_name=async hell] (tasks):   0%|          | 0/4 [00:00<?, ?it/s]


async hell-4cb0e0cd compared to async hell:
75.00% (+75.00%) 'score_func' score	(3 improvements, 0 regressions)

0.28s (-29.77%) 'duration'	(4 improvements, 0 regressions)

See results for async hell-4cb0e0cd at https://www.braintrust.dev/app/braintrustdata.com/p/Agent%20Evaluation%3A%20Customer%20Support/experiments/async%20hell-4cb0e0cd
