# Using Braintrust to evaluate the precision and recall of an emotion classifier
In this cookbook, we'll learn how to evaluate a custom LLM classifier in Braintrust using custom scoring functions that measure precision and recall. For this cookbook, We'll use the [go_emotions dataset](https://huggingface.co/datasets/google-research-datasets/go_emotions), which contains Reddit comments labeled with 28 different emotions. What makes this dataset particularly interesting is that each comment can be labeled with multiple emotions - for example, a single message might express both "excitement" and "anger".

We'll build two classifiers - a random baseline and an LLM-based approach using OpenAI's GPT-4o. By comparing their performance using custom scorers, we'll demonstrate how to effectively measure then improve your LLM's accuracy on complex classification tasks.

## Getting started

You will need to make sure you have a [Braintrust account](https://www.braintrust.dev/signup) and an API key for [OpenAI](https://platform.openai.com/). We'll use the [AI proxy](https://www.braintrust.dev/docs/guides/proxy) to access the GPT-4o, allowing us to quickly change the model or provider without having to write model-specific code. Make sure to plug the OpenAI key into your Braintrust account's [AI providers](https://www.braintrust.dev/app/settings?subroute=secrets) configuration and acquire a [BRAINTRUST_API_KEY](https://www.braintrust.dev/app/settings?subroute=api-keys). Lastly, add your `BRAINTRUST_API_KEY` to your Python environment, or just hardcode it into the code below.,
   

   


![aiproviders.png](assets/aiproviders.png)



## Setting up the environment

Let's start by installing the required Python packages:

In [29]:
pip install braintrust openai datasets autoevals


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [30]:
import os
import random
from pydantic import BaseModel, Field
from typing import List, Literal, Union

import openai
import braintrust
from datasets import load_dataset
from autoevals import *

Next, configure your Braintrust client:

I you haven't already, export your API key as an environment variable:
``bash
export BRAINTRUST_API_KEY='YOUR_BRAINTRUST_API_KEY'
``

<Callout type="info">
 Best practice is to export your API key as an environment variable. However, to make it easier to follow along with this cookbook, you can also hardcode it into the code below.
</Callout>


In [31]:

os.environ["BRAINTRUST_API_KEY"] = "sk-2omSuFIj5arE85lA37l1AVwLzx8bgWoVUU0ugTqoZzmxN9JI"

openai_client = braintrust.wrap_openai(
    openai.OpenAI(
        base_url="https://api.braintrust.dev/v1/proxy",
        api_key=os.environ["BRAINTRUST_API_KEY"], #Or hardcode your API key here
    )
)



## Type Definitions and Data Models

 We'll to set up our core types and data structures. To do this, we use Python's `Literal` type to define a strict set of allowed emotions - this gives us strong type safety and IDE support. The `EmotionType` definition helps catch invalid emotions at compile time, while the `EMOTIONS` list lets us work with these values at runtime.
Our Pydantic model, `EmotionClassification`, defines the structure for emotion analysis results. For each piece of text, it captures:

   * The detected emotions (must be from our predefined set)
   * A confidence score between 0 and 1
   * A rationale explaining why these emotions were detected

The `load_data` function processes our raw dataset, yielding one sample at a time to be memory efficient. It extracts both the labeled emotions and metadata, making it perfect for iteration and evaluation.

In [32]:
EmotionType = Literal[
    "admiration", "amusement", "anger", "annoyance", "approval",
    "caring", "confusion", "curiosity", "desire", "disappointment",
    "disapproval", "disgust", "embarrassment", "excitement", "fear",
    "gratitude", "grief", "joy", "love", "nervousness", "optimism",
    "pride", "realization", "relief", "remorse", "sadness", "surprise",
    "neutral"
]

# The list remains the same
EMOTIONS: List[str] = [
    "admiration", "amusement", "anger", "annoyance", "approval",
    "caring", "confusion", "curiosity", "desire", "disappointment",
    "disapproval", "disgust", "embarrassment", "excitement", "fear",
    "gratitude", "grief", "joy", "love", "nervousness", "optimism",
    "pride", "realization", "relief", "remorse", "sadness", "surprise",
    "neutral"
]

class EmotionClassification(BaseModel):
    emotions: List[EmotionType]
    confidence: float = Field(default=1.0, ge=0.0, le=1.0)
    rationale: str = ""


def load_data(limit: int = 100):
    ds = load_dataset("google-research-datasets/go_emotions", "raw")
    for i, item in enumerate(ds["train"]):
        if i >= limit:
            break

        actual_emotions = [emotion for emotion in EMOTIONS if item.get(emotion, 0) == 1]

        yield {
            "input": item["text"],
            "expected": actual_emotions,
            "metadata": {"subreddit": item["subreddit"], "author": item["author"]},
        }

## Creating the Classifiers

We implement two different approaches to emotion classification:

1. An LLM-based classifier using GPT-4o that:
   * Uses structured prompts to ensure valid emotion labels
   * Returns standardized JSON with emotions, confidence, and rationale
   * Includes robust validation and error handling

2. A random baseline classifier that assigns 1-3 emotions randomly from our predefined list. This helps establish a minimum performance threshold and validates our evaluation metrics.

In [33]:
def llm_classifier(text: str) -> EmotionClassification:
    prompt = (
        f"Analyze the emotional content in this text. Classify using ONLY the following emotion labels:\n"
        f"{', '.join(EMOTIONS)}\n\n"
        f"Text: {text}\n\n"
        f"Respond with a JSON object containing:\n"
        f"- emotions: array of emotions (use only the exact labels listed above)\n"
        f"- confidence: number from 0 to 1\n"
        f"- rationale: brief explanation"
    )

    try:
        response = openai_client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
            response_format={"type": "json_object"}
        )
        
        result = response.choices[0].message.content
        if not result:
            raise ValueError("Empty response from OpenAI")
            
        # First validate that we got valid emotions
        data = json.loads(result)
        emotions = data.get('emotions', [])
        
        # Validate each emotion is in our allowed set
        validated_emotions = [e for e in emotions if e in EMOTIONS]
        
        # Create new validated json
        validated_data = {
            "emotions": validated_emotions,
            "confidence": min(max(float(data.get('confidence', 1.0)), 0), 1),
            "rationale": str(data.get('rationale', ""))
        }
        
        return EmotionClassification.model_validate(validated_data)

    except Exception as e:
        print(f"[llm_classifier] Exception: {e}")
        return EmotionClassification(
            emotions=[],
            confidence=0.0,
            rationale="Error occurred during classification"
        )

def random_classifier(text: str) -> EmotionClassification:
    num_emotions = random.randint(1, 3)
    selected_emotions = random.sample(EMOTIONS, num_emotions)
    return EmotionClassification(
        emotions=selected_emotions,
        confidence=random.random(),
        rationale="Random selection"
    )

## Implementing evaluation metrics 

For multi-label classification tasks like emotion detection, we need specialized metrics that can handle multiple correct answers. We implement three key metrics:

`Precision`: Measures how many of the identified emotions are actually correct. For example, if our system predicts "joy" and "anger" for a comment that only expresses "joy", we achieve a precision of 0.5 because only "joy" is correct.
   - High precision means we rarely predict emotions that aren't present
   - Important for applications where false positives are costly

`Recall`: Measures whether we captured all the emotions that were truly present in the Reddit comments. For instance, if a comment expresses both "sadness" and "fear" but we only identify "sadness," we get a recall of 0.5. This is fundamentally different from precision because it tells us if we missed emotions that should have been identified.
   - High recall means we rarely miss emotions that are present
   - Important for applications where false negatives are costly

`F1 Score`: The F1 Score balances precision and recall by calculating their harmonic mean, which is necessary because optimizing for one often comes at the expense of the other. If our LLM achieves high precision, it might be too conservative and miss emotions (low recall). Conversely, if it achieves high recall, it might overidentify emotions (low precision). The F1 score gives us a single metric to evaluate our LLM's overall performance in identifying emotions in Reddit comments.


In [34]:
def emotion_precision(_: str, expected: List[EmotionType], output: EmotionClassification) -> float:
    expected_set = set(expected)
    output_set = set(output.emotions)
    true_positives = len(expected_set & output_set)
    false_positives = len(output_set - expected_set)
    denominator = true_positives + false_positives
    return true_positives / denominator if denominator > 0 else 0.0

def emotion_recall(_: str, expected: List[EmotionType], output: EmotionClassification) -> float:
    expected_set = set(expected)
    output_set = set(output.emotions)
    true_positives = len(expected_set & output_set)
    false_negatives = len(expected_set - output_set)
    denominator = true_positives + false_negatives
    return true_positives / denominator if denominator > 0 else 0.0

def emotion_f1(_: str, expected: List[EmotionType], output: EmotionClassification) -> float:
    prec = emotion_precision("", expected, output)
    rec = emotion_recall("", expected, output)
    return 2 * (prec * rec) / (prec + rec) if (prec + rec) > 0 else 0.0

## Running evaluations

Finally, let's set up our evaluation pipeline using Braintrust:

In [35]:
def run_evaluations(num_samples: int = 100):
    # Random classifier experiment
    braintrust.Eval(
        "emotion-classification-cookbook-pydantic",  # The name of your project in Braintrust
        data=lambda: load_data(limit=num_samples),
        task=random_classifier,
        scores=[emotion_precision, emotion_recall, emotion_f1],
        metadata={"classifier_type": "random"},
        experiment_name="random-classifier",
    )

    # LLM classifier experiment
    braintrust.Eval(
        "emotion-classification-cookbook-pydantic",
        data=lambda: load_data(limit=num_samples),
        task=llm_classifier,
        scores=[emotion_precision, emotion_recall, emotion_f1],
        metadata={"classifier_type": "llm", "model": "gpt-4o"},
        experiment_name="llm-classifier",
    )


if __name__ == "__main__":
    run_evaluations(num_samples=100) # Adjust the number of samples as needed

fatal: Not a valid object name origin/main
Experiment random-classifier-0518e57d is running at https://www.braintrust.dev/app/braintrustdata.com/p/emotion-classification-cookbook-pydantic/experiments/random-classifier-0518e57d
fatal: Not a valid object name origin/main
Experiment llm-classifier-23de7340 is running at https://www.braintrust.dev/app/braintrustdata.com/p/emotion-classification-cookbook-pydantic/experiments/llm-classifier-23de7340


emotion-classification-cookbook-pydantic [experiment_name=random-classifier] (data): 100it [00:01, 68.07it/s]
emotion-classification-cookbook-pydantic [experiment_name=llm-classifier] (data): 100it [00:00, 114.54it/s]0 [00:00<?, ?it/s]
emotion-classification-cookbook-pydantic [experiment_name=random-classifier] (tasks): 100%|██████████| 100/100 [00:07<00:00, 13.83it/s]
emotion-classification-cookbook-pydantic [experiment_name=llm-classifier] (tasks): 100%|██████████| 100/100 [00:10<00:00,  9.63it/s]


random-classifier-0518e57d compared to llm-classifier-01a6f55e:
06.33% (-23.97%) 'emotion_f1'        score	(8 improvements, 41 regressions)
05.50% (-20.83%) 'emotion_precision' score	(8 improvements, 41 regressions)
08.33% (-31.17%) 'emotion_recall'    score	(7 improvements, 40 regressions)

1736828297.81s start
1736828303.81s end
0.99s (-1537.40%) 'duration'	(100 improvements, 0 regressions)

See results for random-classifier-0518e57d at https://www.braintrust.dev/app/braintrustdata.com/p/emotion-classification-cookbook-pydantic/experiments/random-classifier-0518e57d






llm-classifier-23de7340 compared to random-classifier-0518e57d:
26.33% (+20.83%) 'emotion_precision' score	(41 improvements, 8 regressions)
30.30% (+23.97%) 'emotion_f1'        score	(41 improvements, 8 regressions)
39.50% (+31.17%) 'emotion_recall'    score	(40 improvements, 7 regressions)

1736828298.48s start
1736828305.68s end
139.08tok prompt_tokens
88.53tok completion_tokens
227.61tok total_tokens

See results for llm-classifier-23de7340 at https://www.braintrust.dev/app/braintrustdata.com/p/emotion-classification-cookbook-pydantic/experiments/llm-classifier-23de7340


## Analyzing the results

Once you run the evaluations, you'll see the results in your Braintrust dashboard. The LLM classifier should significantly outperform the random baseline across all metrics.

![results.png](assets/results.png)

Key features to examine:
- Compare precision and recall scores between our runs
- Look at specific examples where the LLM fails 
- Analyze cases where multiple emotions are present




## Where to go from here

There are several ways to improve this emotion classifier:
- Experiment with different prompts and instructions. Maybe even a series of prompts?
- Try other models like Grok 2 or OpenAI's o1.
- Add more sophisticated scoring functions