# Evaluating a Chat Bot

This tutorial will walk through using Braintrust to evaluate a conversational, multi-turn chat assistant.

These types of chat bots have become very important parts of applications, acting as customer service agents, sales representatives, or travel agents, to name a few. As an owner of such an application, it's important to be sure the bot provides value to the user.

We will expand on this below, but the history and context of a conversation is usually crucial in being able to produce a good response. If you received a request to "Make a dinner reservation at 7pm" and you knew the context of where, on what date, and for how many people, you could provide some assistance; otherwise, you'd need to ask for more information. 

Before starting, please make sure that you have a Braintrust account. If you do not have one, you can [sign up here](https://www.braintrustdata.com).

## Installing dependencies

Begin by installing the necessary dependencies if you have not done so already.

In [None]:
npm install autoevals braintrust openai

## Inspecting the data

Let's take a look at the small dataset prepared for this cookbook. You can find the full dataset in the accompanying [dataset.ts](https://github.com/braintrustdata/braintrust-cookbook/tree/main/examples/EvaluatingCustomerSupportAgent/assets/dataset.ts) file.

Below is an example of a data point.
- `chat_history` contains the history of the conversation between the user and the assistant
- `input` is the last `user` turn that will be sent to the completion
- `expected` is the output expected given the input

From looking at this one example, it is clear why we need the history to be able to provide a helpful response.

If you were asked "who won the men's trophy that year?", you would wonder *What trophy? Which year?*. But if you read the `chat_history`, you'd be able to answer the question (maybe after some quick research).

In [None]:
const dataset = [
  {
    chat_history: [
      { role: "user", content: "when was the ballon d'or first awarded for female players?" },
      {
        role: "assistant",
        content:
          `The Ballon d'Or Féminin, awarded to the best female football player in the world, was first introduced in 2018.
          The inaugural winner was Ada Hegerberg of Norway, who played for Olympique Lyonnais at the time.`,
      },
    ],
    input: "who won the men's trophy that year?",
    expected: "The men's Ballon d'Or trophy in 2018 was won by Luka Modrić of Croatia, who played for Real Madrid."
  },
  // ... rest of dataset
]

## Running evals

The key to running evals on a multi-turn conversation is to include the history of the chat in the chat completion request.

To do so, we simply add the history between the `system` message and the final `user` message in the `messages` argument as seen below.

In [None]:
import { wrapOpenAI } from "braintrust";
import { OpenAI } from "openai";

type ChatHistory = {
  role: "user" | "assistant",
  content: string
}

async function runTask({
  input,
  chat_history
}: {
  input: string,
  chat_history: ChatHistory[]
}) {
  const client = wrapOpenAI(
    new OpenAI({
      baseURL: "https://braintrustproxy.com/v1",
      apiKey: process.env.OPENAI_API_KEY ?? "", // Can use OpenAI, Anthropic, Mistral etc. API keys here
    })
  )

  const response = await client.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: "You are a helpful and polite assistant who knows about sports."
      },
      ...chat_history,
      {
        role: "user",
        content: input,
      }
    ]
  });
  return response.choices[0].message.content || "";
}

### Scoring the eval

We'll use a `Factuality` scoring function to check how the output of the completion compares to the expected value.

For this cookbook, we will alter the [built in factuality prompt](https://github.com/braintrustdata/autoevals/blob/main/templates/factuality.yaml) slightly.

(I have used the built-in Factuality scorer and it does not significantly reduce the score compared to running the eval with no chat history. If the model does not have enough context to factually answer the question and responds with a request for the user to be more specific, the Factuality scorer evaluates this response as having no factual inconsistencies between the expert and submitted answers. Because this cookbook is to illustrate how to incorporate chat history to produce a good response, we will tweak the built-in prompt slightly to account for this.)

In the [altered factuality spec](https://github.com/braintrustdata/braintrust-cookbook/tree/main/examples/EvaluatingCustomerSupportAgent/assets/factuality.ts), you can see that we have added two more score choices:
- (F) The submitted answer asks for more context, specifics or clarification but provides factual information consistent with the expert answer.
- (G) The submitted answer asks for more context, specifics or clarification but does not provide factual information consistent with the expert answer.

These will score (F) = 0.2 and (G) = 0 so the model gets some credit if there was any context it was able to gather from the user's input.

In [None]:
import { Eval } from "braintrust";
import dataset from "./assets/dataset"
import factualitySpec from "./assets/factuality";
import { LLMClassifierFromSpec } from "autoevals";

function Factuality(args: {
  input: {
    input: string,
    chat_history: ChatHistory[]
  },
  output: string,
  expected: string
}) {
  const scorer = LLMClassifierFromSpec(
    "Factuality",
    {
      ...factualitySpec,
      model: "gpt-4o"
    }
  );
  return scorer(args);
}

Eval("Chat assistant", {
  experimentName: "gpt-4o assistant",
  data: () => dataset,
  task: runTask,
  scores: [Factuality],
  metadata: {
    model: "gpt-4o",
    prompt: "You are a helpful, polite assistant who knows about sports."
  }
});

In [None]:
Experiment gpt - 4o assistant is running at http://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant
 ████████████████████████████████████████ | Chat assistant[experimentName = gpt - 4o... | 100 % | 5 / 5 datapoints

=========================SUMMARY=========================
64.00% 'Factuality' score       (0 improvements, 0 regressions)

4.28s 'duration'        (0 improvements, 0 regressions)

See results for gpt-4o assistant at http://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant

## Visualizing

Now that we've run the eval, we can look at the results on [braintrust.dev](https://www.braintrust.dev).

Open the newly created `Chat assistant` project and you can see the experiment we just ran:

![dist-chart](./assets/assistant-dist.png)

The model scored 64% on Factuality, but we can see from the chart that the model responses got a lot of partial credit (as opposed to a few perfect responses and a few completely incorrect responses). The responses were either exactly the same as the expected answer or a subset/superset but still fully factually consistent with the expected answer.

### Comparing to a model with no context

To illustrate how the model needs the chat_history to produce a good response, let's remove that argument from the completions request.

Comment out or delete the line `...chat_history` from the `client.chat.completions.create(...)` request in the task function.

In [None]:
const response = await client.chat.completions.create({
  model: "gpt-4o",
  messages: [
    {
      role: "system",
      content: "You are a helpful and polite assistant who knows about sports."
    },
    // ...chat_history,
    {
      role: "user",
      content: input,
    }
  ]
});

Run the eval again in a new `gpt-4o assistant - no history` experiment.

In [None]:
Eval("Chat assistant", {
  experimentName: "gpt-4o assistant - no history",
  data: () => dataset,
  task: runTask,
  scores: [Factuality],
  metadata: {
    model: "gpt-4o",
    prompt: "You are a helpful, polite assistant who knows about sports."
  }
});

In [None]:
Experiment gpt - 4o assistant - no history is running at http://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant%20-%20no%20history
 ████████████████████████████████████████ | Chat assistant[experimentName = gpt - 4o... | 100 % | 5 / 5 datapoints

=========================SUMMARY=========================
gpt-4o assistant - no history compared to gpt-4o assistant:
4.00% (-60.00%) 'Factuality' score      (0 improvements, 5 regressions)

4.14s 'duration'        (2 improvements, 3 regressions)

See results for gpt-4o assistant - no history at http://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant%20-%20no%20history

As we were expecting, the Factuality score went way down, from 64% to just 4%.

If we open [the dashboard](https://www.braintrust.dev), we can compare the results of the two experiments

![dist-chart](./assets/assistant-dist-comp.png)

For one datapoint the model was able to glean some context from the user's input and provide some information that was factually consistent with the expected answer, but for the majority the model needed the context and specificity that the `chat_history` provides.

## Conclusion

Hopefully it's clear at this point why a model needs context to produce helpful responses and why providing the chat history to the completions request is the strategy for running evals on chat assistant models.

The experiment still only scores 64% on Factuality, so see what you can do with the model parameters or prompt to improve the results!