# BrainTrust Classification Tutorial (Article Titles)

<a target="_blank" href="https://colab.research.google.com/github/braintrustdata/braintrust-examples/blob/main/classify/BrainTrust-Classify-Tutorial.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Welcome to [BrainTrust](https://www.braintrustdata.com/)! This tutorial will teach you the basics of working with BrainTrust to evaluate a text classification use case, including creating a project, running experiments, and analyzing their results.

Before starting, please make sure that you have a BrainTrust account. If you do not, please [sign up](https://www.braintrustdata.com) or [get in touch](mailto:info@braintrustdata.com). After this tutorial, feel free to dig deeper by visiting [the docs](http://www.braintrustdata.com/docs).

In [None]:
# NOTE: Replace YOUR_OPENAI_KEY with your OpenAI API Key and YOUR_BRAINTRUST_API_KEY with your BrainTrust API key. Do not put it in quotes.
%env OPENAI_API_KEY=YOUR_OPENAI_KEY
%env BRAINTRUST_API_KEY=YOUR_BRAINTRUST_API_KEY

We'll start by installing some dependencies, setting up the environment, and downloading the dataset.

In [None]:
%pip install braintrust guidance openai datasets

In [None]:
import asyncio
import braintrust
import guidance
import time

from datasets import load_dataset

MODEL = "gpt-3.5-turbo"
guidance.llm = guidance.llms.OpenAI(MODEL)

# Load dataset from  Huggingface.
dataset = load_dataset("ag_news", split="train")

# Shuffle and trim to 20 datapoints. Restructure our dataset
# slightly so that each item in the list contains the title
# itself ("text") and the expected category index ("label").
trimmed_dataset = dataset.shuffle()[:20]
articles = [{
    "text": trimmed_dataset["text"][i],
    "label": trimmed_dataset["label"][i],
    } for i in range(len(trimmed_dataset["text"]))]

# Extract category names from the dataset and build a map from index to
# category name. We will use this to compare the expected categories to
# those produced by the model.
category_names = dataset.features['label'].names
category_map = dict([i for i in enumerate(category_names)])

## Writing the initial prompts

Let's analyze the first example, and build up a prompt for categorizing a title. We're using a library called [Guidance](https://github.com/microsoft/guidance), which makes it easy to template prompts and cache results. With BrainTrust, can use any library you'd like -- Guidance, LangChain, or even just direct calls to an LLM.

The prompt provides the article's title to the model, and asks it to generate a category.

In [None]:
one_article = articles[0]
print(one_article["text"])
print(one_article["label"])

In [None]:
classify_article = guidance('''
{{#system~}}
You are an editor in a newspaper who helps writers identify the right category for their news articles,
by reading the article's title. The category should be one of the following: World, Sports, Business
or Sci-Tech. Reply with one word corresponding to the category.
{{~/system}}

{{#user~}}
Article title: {{article_title}}
{{~/user}}

{{#assistant~}}
{{gen 'category' max_tokens=500}}
{{~/assistant}}''')

out = classify_article(article_title=one_article["text"])

## Running across the dataset

Now that we have automated the process of classifying titles, we can test the full set of articles. This block uses Python's async features to generate and grade in parallel, effectively making your OpenAI account's rate limit the limiting factor.

As it runs, it compares the generated category to the expected one from the dataset. Once this loop completes, you can view the results in BrainTrust.

In [None]:
async def evaluate_article(article):
  full_output = classify_article(article_title=article["text"])
  category = full_output["category"].strip().lower()
  expected = category_map[article["label"]].strip().lower()
  return (article, full_output, expected, category)


async def run_on_all_articles():
    start = time.time()
    tasks = [asyncio.create_task(evaluate_article(article)) for article in articles]
    category_grades = [await t for t in tasks]
    end = time.time()
    print("Took", end - start, "seconds")
    return category_grades

valid_categories = set(x.strip().lower() for x in category_map.values())
def analyze_experiment(data):
  for (article, full_output, expected, category) in categories:
    experiment.log(
        inputs={"title": article["text"]},
        output=category,
        expected=expected,
        scores={
            "match": 1 if category == expected else 0,
            "valid": 1 if category in valid_categories else 0,
        },
        metadata={
           "prompt": str(full_output)
        }
    )

  print(experiment.summarize())

# This line assumes there is an async event loop initialized but that the
# notebook is not running inside of an async task. This is true on Google Colab,
# but other notebook environments may vary. If you see errors while trying this
# code in a different tool, try changing this to `await run_on_all_articles()`, or
# initializing an event loop above with `asyncio.get_event_loop()`.
categories = asyncio.run(run_on_all_articles())
experiment = braintrust.init(
  project="classify-article-titles",
  experiment="original-prompt"
)
analyze_experiment(categories)

## Pause and analyze the results in BrainTrust!

The cell above will print a link to the BrainTrust experiment. Go check it out (NOTE: it may take up to a minute to synchronize the data for viewing).

## Reproducing an example

Now, let's pull down an issue corresponding to the "Sci/Tech" category and reproduce the evaluation:

In [None]:
sci_tech_category_index = category_names.index("Sci/Tech")
sci_tech_article = [a for a in articles if a["label"] == sci_tech_category_index][0]
print(sci_tech_article["text"])
print(sci_tech_article["label"])

out = classify_article(article_title=sci_tech_article["text"])
print(out)

### Fixing the prompt

Have you spotted the issue yet? Looks like we have mispelled one of our categories in our prompt. The dataset's categories are "World, Sports, Business and *Sci/Tech*" - but we are using *Sci-Tech* in our prompt. Let's fix it:

In [None]:
classify_article = guidance('''
{{#system~}}
You an editor in a newspaper who helps writers identify the right category for their news articles,
by reading the article's title. The category should be one of the following: World, Sports, Business
or Sci/Tech. Reply with one word corresponding to the category.
{{~/system}}

{{#user~}}
Article title: {{article_title}}
{{~/user}}

{{#assistant~}}
{{gen 'category' max_tokens=500}}
{{~/assistant}}''')

out = classify_article(article_title=sci_tech_article["text"])
print(out)

### Assessing the change

This time around, the model generated the expected categories. But how do we know how it affects the overall dataset? Let's run the new prompt on our full set of titles, and take a look at the experiment.

In [None]:
categories = asyncio.run(run_on_all_articles())
experiment = braintrust.init(
  project="classify-article-titles",
  experiment="fixed-categories"
)
analyze_experiment(categories)

## Summary

Click into the new experiment, and check it out! You should notice a few things:

* BrainTrust will automatically compare the new experiment to your previous one.
* You should see an increase in scores, and can click around to look at exactly which values improved.
* You can also filter down to the examples who still have a winner score of 1, and further iterate on the prompt to try and improve them.

Last but not least... This tutorial is designed to teach you about BrainTrust, but in practice, you probably will have to iterate with more than just one prompt change to improve results.

Happy evaling!