# Mistral benchmark

Before starting, please make sure that you have a Braintrust account. If you do not, please [sign up](https://braintrust.dev). Within the account,
make sure to add OpenAI and Mistral API keys to the [AI secrets](https://www.braintrust.dev/app/settings?subroute=secrets) configuration.


## Setting up the environment

The next few commands will install some libraries and include some helper code for the text2sql application. Feel free to copy/paste/tweak/reuse this code in your own tools.


In [2]:
%pip install -U autoevals braintrust openai --quiet

Note: you may need to restart the kernel to use updated packages.


### Loading the data

We'll use the CoQA dataset created for the [LLaMa 3.1 Tools](https://www.braintrust.dev/docs/cookbook/recipes/LLaMa-3_1-Tools) guide.


In [3]:
import json
data_path = "../LLaMa-3_1-Tools/coqa-factuality.json"
data = json.load(open(data_path))

print(data[0])

{'input': {'input': 'What color was Cotton?', 'output': 'white', 'expected': 'white'}, 'expected': 1, 'metadata': {'source': 'mctest', 'story': 'Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn where all of the farmer\'s horses slept. But Cotton wasn\'t alone in her little home above the barn, oh no. She shared her hay bed with her mommy and 5 other sisters. All of her sisters were cute and fluffy, like Cotton. But she was the only white one in the bunch. The rest of her sisters were all orange with beautiful white tiger stripes like Cotton\'s mommy. Being different made Cotton quite sad. She often wished she looked like the rest of her family. So one day, when Cotton found a can of the old farmer\'s orange paint, she used it to paint herself like them. When her mommy and sisters found her they started laughing. \n\n"What are you doing, Cotton?!" \n\n"I only wanted to be more like you

In [7]:
import os
from autoevals import Factuality, NumericDiff
from braintrust import current_span, Eval

models = ["gpt-4o", "gpt-4o-mini", "mistral-large-latest", "open-mistral-nemo"]

for model in models:
    async def task(input):
        result = await Factuality(
            api_key=os.environ["BRAINTRUST_API_KEY"],
            model=model
            
        ).eval_async(**input)
        current_span().log(output=result)
        return result.score

    await Eval(
        "coqa-factuality",
        data=data[:20],
        task=task,
        scores=[NumericDiff()],
        trial_count=1,
        experiment_name=f"{model}",
        metadata={"model": model}
    )

Experiment gpt-4o is running at https://www.braintrust.dev/app/braintrustdata.com/p/coqa-factuality/experiments/gpt-4o
coqa-factuality [experiment_name=gpt-4o] (data): 20it [00:00, 87655.26it/s]
coqa-factuality [experiment_name=gpt-4o] (tasks): 100%|██████████| 20/20 [00:00<00:00, 22.72it/s]



85.00% 'NumericDiff' score

See results for gpt-4o at https://www.braintrust.dev/app/braintrustdata.com/p/coqa-factuality/experiments/gpt-4o


Experiment gpt-4o-mini is running at https://www.braintrust.dev/app/braintrustdata.com/p/coqa-factuality/experiments/gpt-4o-mini
coqa-factuality [experiment_name=gpt-4o-mini] (data): 20it [00:00, 111848.11it/s]
coqa-factuality [experiment_name=gpt-4o-mini] (tasks): 100%|██████████| 20/20 [00:00<00:00, 27.50it/s]



gpt-4o-mini compared to gpt-4o:
85.00% (-) 'NumericDiff' score	(0 improvements, 0 regressions)

See results for gpt-4o-mini at https://www.braintrust.dev/app/braintrustdata.com/p/coqa-factuality/experiments/gpt-4o-mini


Experiment mistral-large-latest is running at https://www.braintrust.dev/app/braintrustdata.com/p/coqa-factuality/experiments/mistral-large-latest
coqa-factuality [experiment_name=mistral-large-latest] (data): 20it [00:00, 78179.01it/s]
coqa-factuality [experiment_name=mistral-large-latest] (tasks): 100%|██████████| 20/20 [00:00<00:00, 31.59it/s]



mistral-large-latest compared to gpt-4o-mini:
85.00% (-) 'NumericDiff' score	(0 improvements, 0 regressions)

See results for mistral-large-latest at https://www.braintrust.dev/app/braintrustdata.com/p/coqa-factuality/experiments/mistral-large-latest


Experiment open-mistral-nemo is running at https://www.braintrust.dev/app/braintrustdata.com/p/coqa-factuality/experiments/open-mistral-nemo
coqa-factuality [experiment_name=open-mistral-nemo] (data): 20it [00:00, 62695.13it/s]
coqa-factuality [experiment_name=open-mistral-nemo] (tasks): 100%|██████████| 20/20 [00:00<00:00, 27.29it/s]



open-mistral-nemo compared to mistral-large-latest:
85.00% (-) 'NumericDiff' score	(0 improvements, 0 regressions)

See results for open-mistral-nemo at https://www.braintrust.dev/app/braintrustdata.com/p/coqa-factuality/experiments/open-mistral-nemo
