# Comparing AI models with Braintrust

This tutorial will teach you how to use Braintrust to compare the same prompts across different AI models and parameters to help decide on choosing a model to run your AI apps.

Before starting, please make sure that you have a Braintrust account. If you do not, please [sign up](https://www.braintrustdata.com). After this tutorial, feel free to dig deeper by visiting [the docs](http://www.braintrustdata.com/docs).


## Installing dependencies

To see a list of dependencies, you can view the accompanying [package.json](https://github.com/braintrustdata/braintrust-cookbook/tree/main/examples/ModelComparison/package.json) file. Feel free to copy/paste snippets of this code to run in your environment, or use [tslab](https://github.com/yunabe/tslab) to run the tutorial in a Jupyter notebook.

## Setting up the data

For this example, we will use a small subset of data taken from the [google/boolq](https://huggingface.co/datasets/google/boolq) dataset. If you'd like, you can try datasets and prompts from any of the other [cookbooks](https://www.braintrustdata.com/docs/cookbook/) at Braintrust.


In [1]:
// curl -X GET "https://datasets-server.huggingface.co/rows?dataset=google%2Fboolq&config=default&split=train&offset=600&length=5" > ./assets/dataset.json
import dataset from "./assets/dataset.json";

// labels these 1-3 so that they will be easier to recognize in the app
const prompts = [
  "(1) - true or false",
  "(2) - Answer using true or false only",
  "(3) - Answer the following question as accurately as possble with the words 'true' or 'false' in lowercase only. Do not include any other words in the response",
];

// extract question/answers from rows into input/expected
const evalData = dataset.rows.map(({ row: { question, answer } }) => ({
  input: question,
  expected: `${answer}`,
}));
console.log(evalData.slice(0, 2));


[
  {
    input: [32m'is there a season 2 of hunted on cinemax'[39m,
    expected: [32m'false'[39m
  },
  {
    input: [32m'would a change in price shift the demand curve'[39m,
    expected: [32m'false'[39m
  }
]


## Running comparison evals across multiple models

Let's set up some code to compare these prompts and inputs across 3 different models and different temperature values. For this cookbook we will be using different models from OpenAI, but you can use any [LLM client](https://www.braintrustdata.com/docs/guides/tracing#wrapping-a-custom-llm-client).

We'll start by initializing an OpenAI client and wrapping it with some Braintrust instrumentation. `wrapOpenAI`
is initially a no-op, but later on when we use Braintrust, it will help us capture helpful debugging information about each model's performance. Then we will set up our simple eval to generate data for each combination of model, prompt and temperature


In [2]:
import { wrapOpenAI } from "braintrust";
import { OpenAI } from "openai";

const client = wrapOpenAI(
  new OpenAI({
    apiKey: process.env.OPENAI_API_KEY || "Your OpenAI API Key",
  })
);


In [3]:
async function callModel(
  input: string,
  {
    model,
    temperature,
    systemPrompt,
  }: { model: string; temperature: number; systemPrompt: string }
) {
  const response = await client.chat.completions.create({
    model: model,
    messages: [
      {
        role: "system",
        content: systemPrompt,
      },
      {
        role: "user",
        content: input,
      },
    ],
    temperature,
    seed: 123,
  });
  return response.choices[0].message.content || "";
}


In [4]:
const combinations: { model: string; temperature: number; prompt: string }[] =
  [];
for (const model of ["gpt-3.5-turbo", "gpt-4", "gpt-4o"]) {
  for (const temperature of [
    0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1,
  ]) {
    for (const prompt of prompts) {
      combinations.push({
        model,
        temperature,
        prompt,
      });
    }
  }
}
console.log(combinations.slice(0, 5));


[
  {
    model: [32m'gpt-3.5-turbo'[39m,
    temperature: [33m0[39m,
    prompt: [32m'(1) - true or false'[39m
  },
  {
    model: [32m'gpt-3.5-turbo'[39m,
    temperature: [33m0[39m,
    prompt: [32m'(2) - Answer using true or false only'[39m
  },
  {
    model: [32m'gpt-3.5-turbo'[39m,
    temperature: [33m0[39m,
    prompt: [32m"(3) - Answer the following question as accurately as possble with the words 'true' or 'false' in lowercase only. Do not include any other words in the response"[39m
  },
  {
    model: [32m'gpt-3.5-turbo'[39m,
    temperature: [33m0.1[39m,
    prompt: [32m'(1) - true or false'[39m
  },
  {
    model: [32m'gpt-3.5-turbo'[39m,
    temperature: [33m0.1[39m,
    prompt: [32m'(2) - Answer using true or false only'[39m
  }
]


Let's use the functions and data that we have set up to run some evals on Braintrust! We will be using two scorers for this eval:

1. A simple exact match scorer that will compare the output from the LLM exactly with the expected value
2. A Levenshtein scorer which will calculate the Levenshtein distance

We are also adding the model, temperature, and prompt into the metadata so that we can use those fields to help our visualization inside the braintrust app after the evals are finished running.


In [None]:
import { Eval } from "braintrust";
import { Levenshtein } from "autoevals";

const exactMatch = (args: { input; output; expected? }) => {
  return {
    name: "ExactMatch",
    score: args.output === args.expected ? 1 : 0,
  };
};

// feel free to loop differently instead depending on your rate limits
for (const { model, temperature, prompt } of combinations) {
  await Eval("Model comparison", {
    data: () =>
      evalData.map(({ input, expected }) => ({
        input,
        expected,
      })),
    task: async (input) => {
      return await callModel(input, {
        model: model,
        temperature: temperature,
        systemPrompt: prompt,
      });
    },
    scores: [exactMatch, Levenshtein],
    metadata: {
      model,
      temperature,
      prompt,
    },
  });
}


In [None]:
{
  projectName: 'Model comparison',
  experimentName: 'main-1715969037',
  projectUrl: 'https://www.braintrustdata.com/app/braintrustdata.com/p/Model%20comparison',
  experimentUrl: 'https://www.braintrustdata.com/app/braintrustdata.com/p/Model%20comparison/experiments/main-1715969037',
  comparisonExperimentName: undefined,
  scores: {
    ExactMatch: { name: 'ExactMatch', score: 0, improvements: 0, regressions: 0 },
    Levenshtein: {
      name: 'Levenshtein',
      score: 0.5133333333333333,
      improvements: 0,
      regressions: 0
    }
  },
  metrics: {
    duration: {
      name: 'duration',
      metric: 0.3899999141693115,
      unit: 's',
      improvements: 0,
      regressions: 0
    }
  }
}

## Visualizing

Now we have successfully run our Evals! Let's log onto [braintrust.dev](braintrust.dev) and take a look at the results.

Click into the newly generated project called `Model comparison`, and check it out! You should notice a few things:

![initial-chart](assets/initial-chart.png)

- You should see that each line represents a score over time, and each data point represents an experiment that was run.
- From the code, we ran 99 experiments (11 temperature values * 3 models * 3 prompts), so one line should consist of 99 dots, each with a different combination of temperature, model, and prompt.
- Metadata fields with numeric values are automatically populated as viable X and Y axis values.
- Enabling the temperature score on the Y axis also illustrates the order which we ran each experiment via the code.

![initial-chart-temperature](assets/initial-chart-temperature.png)



## Diving in

This chart allows us to also group data to allow us to compare experiment runs by model, prompt, and temperature. 

By selecting the `one color per model`, `one symbol per prompt`, and `X Axis temperature`, we can easily visualize the the data across these dimensions.

![grouped-chart](assets/grouped-chart.png)

Looking at this view of the data allows us to see that gpt-4 performed better than gpt-4o at lower temperatures!


With just a few lines of code and a few clicks we were able to find meaningful insights into how different models perform across our data!

## Parting thoughts

This is just the start of evaluating and improving your AI applications. From here, you should run more experiments with larger datasets, and also try out different prompts! Once you have run another set of experiments, come back to the chart and play with the different views and groupings. You can also add  filtering to filter for experiments with specific scores and metadata to find even more insights.

Happy evaluating!