# BrainTrust Summarization Tutorial (GitHub Issues)

<a target="_blank" href="https://colab.research.google.com/github/braintrustdata/braintrust-examples/blob/main/github-issues/BrainTrust-GH-Summarization-Tutorial.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Welcome to [BrainTrust](https://www.braintrustdata.com/)! This tutorial will teach you the basics of working with BrainTrust to evaluate a text summarization use case, including creating a project, running experiments, and analyzing their results.

In this notebook, we'll build an application that suggests better titles for issues in a GitHub repo, using their page content. We'll use a technique called **model graded evaluation** to automatically evaluate these titles against the original titles, and improve our prompt based on what we find.

Before starting, please make sure that you have a BrainTrust account. If you do not, please [sign up](https://www.braintrustdata.com) or [get in touch](mailto:info@braintrustdata.com). After this tutorial, feel free to dig deeper by visiting [the docs](http://www.braintrustdata.com/docs).

In [None]:
# NOTE: Replace YOUR_OPENAI_API_KEY with your OpenAI API Key and YOUR_BRAINTRUST_API_KEY with your BrainTrust API key. Do not put it in quotes.
%env OPENAI_API_KEY=YOUR_OPENAI_API_KEY
%env BRAINTRUST_API_KEY=sk-YOUR_BRAINTRUST_API_KEY

By default, this notebook uses the [Supabase](https://github.com/supabase/supabase) repository. You can tweak these `GITHUB_` environment variables below if you'd like to use a different one (including your own!).

In [None]:
# Optional: feel free to tweak this to a repo of your choice! If you set it to
# a new repo, then make sure to set the GITHUB_PERSONAL_ACCESS_TOKEN
%env GITHUB_REPO=supabase/supabase
%env GITHUB_PERSONAL_ACCESS_TOKEN=

We'll start by installing some dependencies, setting up the environment, and downloading the GitHub issues.

In [None]:
%pip install autoevals braintrust guidance langchain requests openai tiktoken

In [None]:
import guidance
import gzip
import json
import os
from pathlib import Path
import requests
import tiktoken


REPO = os.environ["GITHUB_REPO"]

MODEL = "gpt-3.5-turbo"
guidance.llm = guidance.llms.OpenAI(MODEL)


CACHE_PATH = Path("cache")
os.makedirs(CACHE_PATH, exist_ok=True)
repo_fname = REPO.replace("/", "-") + ".json"
repo_cache = CACHE_PATH / repo_fname
repo_url = f"https://braintrust-public.s3.amazonaws.com/{repo_fname}.gz"
if repo_cache.exists():
    with open(repo_cache, "r") as f:
        issues = [json.loads(l) for l in f]
elif requests.head(repo_url).ok:
    with open(repo_cache, "wb") as f:
        f.write(gzip.decompress(requests.get(repo_url).content))
    with open(repo_cache, "r") as f:
        issues = [json.loads(l) for l in f]
elif "GITHUB_PERSONAL_ACCESS_TOKEN" in os.environ:
    # Use langchain to load the issues if they do not already exist
    # NOTE: This loader appears to get the issue, and its summary, but not the comments.
    from langchain.document_loaders import GitHubIssuesLoader
    loader = GitHubIssuesLoader(repo=REPO)
    issues = [d.__dict__ for d in loader.load()]
    with open(repo_cache, "w") as f:
        for d in issues:
            print(json.dumps(d), file=f)
else:
    raise Exception("Please set GITHUB_PERSONAL_ACCESS_TOKEN to explore a new repo")

print(f"Loaded {len(issues)} issues from {REPO}")

N_ISSUES = 20
issues.sort(key=lambda d: d["metadata"]["created_at"])

# Remove PRs. Technically you can do this in the "loader", but we might as well download them so we can play with them later
issues = [d for d in issues if not d["metadata"]["is_pull_request"]]
issues = issues[:N_ISSUES]

# This notebook currently does not split/chunk/etc. the documents, so we need to make sure they are not too long
tokenizer = tiktoken.encoding_for_model(MODEL)
assert max(len(tokenizer.encode(d["page_content"])) for d in issues) < 3000, max(len(tokenizer.encode(d["page_content"])) for d in issues)

## Writing the initial prompts

Let's analyze the first example, and build up a prompt for generating a new title. We're using a library called [Guidance](https://github.com/microsoft/guidance), which makes it easy to template prompts and cache results. With BrainTrust, can use any library you'd like -- Guidance, LangChain, or even just direct calls to an LLM.

The prompt provides the issue's description to the model, and asks it to generate a title. Note that it does _not_ provide the original title.

In [None]:
issue = issues[0]

print(issue["metadata"]["url"], "\n")
print(issue["metadata"]["title"])
print("-"*len(issue["metadata"]["title"]))
print(issue["page_content"])

In [None]:
create_title = guidance('''
{{#system~}}
You are a technical project manager who helps software engineers generate better titles for their GitHub issues,
by looking at their issue descriptions. The titles should be clear and concise one-line statements.
{{~/system}}

{{#user~}}
GitHub issue: {{input}}
{{~/user}}

{{#assistant~}}
{{gen 'title' max_tokens=500}}
{{~/assistant}}''', async_mode=True)

out = await create_title(input=issue["page_content"])

## Grading the new title

Ok cool! The new title looks pretty good. But how do we consistently and automatically evaluate whether the new titles are better than the old ones?

With subjective problems, like summarization, one great technique is to use an LLM to grade the outputs. This is known as model graded evaluation. Below, we'll use a [summarization prompt](https://github.com/braintrustdata/autoevals/blob/main/templates/summary.yaml)
from Braintrust's open source [autoevals](https://github.com/braintrustdata/autoevals) library. We encourage you to use these prompts, but also to copy/paste them, modify them, and create your own!

The prompt uses [Chain of Thought](https://arxiv.org/abs/2201.11903) which dramatically improves a model's performance on grading tasks. Later, we'll see how it helps us debug the model's outputs too.

In [None]:
from autoevals import Summary
scorer = Summary()

score = scorer(input=issue["page_content"], output=out["title"], expected=issue["metadata"]["title"])
print(score.metadata['rationale'])
print(f"Score={score.score}")

## Running across the dataset

Now that we have automated the process of generating titles and grading them, we can test the full set of `N_ISSUES` issues. This block uses the `Eval` framework in Braintrust to run the evaluation.

Braintrust will automatically use Python's asynchronous runtime to evaluate the examples in parallel to optimize performance. Once the experiment is run, we can view the results in the Braintrust UI.

In [None]:
from braintrust import Eval

data = [
    {"input": issue["page_content"], "expected": issue["metadata"]["title"], "metadata": {"doc": issue["metadata"]}}
    for issue in issues
]

async def wrap_create_title(input, hooks):
    out = await create_title(input=input)
    return out["title"]

await Eval(
    "gh-issues",
    data = data,
    task=wrap_create_title,
    scores=[Summary],
)

## Pause and analyze the results in BrainTrust!

The cell above will print a link to the BrainTrust experiment. Go check it out (NOTE: it may take up to a minute to synchronize the data for viewing).

Click around, and specifically look at the `Rationale` field for some of the records with a `summary` score of 0. Can you spot a trend? Look at the task with `metadata.number` 4833. We're going to explore it below,and see if we can improve the prompt for it and similar examples.

## Reproducing an example

Now, let's pull down issue #4833 and reproduce the evaluation.

In [None]:
issue = [x for x in issues if x["metadata"]["number"] == 4833][0]

print(issue["metadata"]["url"], "\n")
print(issue["metadata"]["title"])
print(issue["page_content"])

In [None]:
score = scorer(input=issue["page_content"], output=out["title"], expected=issue["metadata"]["title"])
print(score.metadata['rationale'])
print(f"Score={score.score}")

### Fixing the prompt

Have you spotted the issue yet? It seems that some of our new titles are missing key details, and perhaps optimizing for brevity over accuracy. Let's tweak the prompt and see if we can improve. Note the last sentence in this prompt: _"Make sure the title is accurate and comprehensive, over being concise."_

In [None]:
create_title = guidance('''
{{#system~}}
You are a technical project manager who helps software engineers generate better titles for their GitHub issues,
by looking at their issue descriptions. Make sure the title is accurate and comprehensive, over being concise.
{{~/system}}

{{#user~}}
GitHub issue: {{input}}
{{~/user}}

{{#assistant~}}
{{gen 'title' max_tokens=500}}
{{~/assistant}}''')

out = create_title(input=issue["page_content"])

score = scorer(input=issue["page_content"], output=out["title"], expected=issue["metadata"]["title"])
print(score.metadata['rationale'])
print(f"Score={score.score}")

## Assessing the change

Awesome! The new method picked the generated title. But how do we know how it affects the overall dataset? Let's run the new prompt on our full set of issues, and take a look at the experiment.

Once this finishes, we'll get a new link that allows us to compare the two experiments. Let's take a look at the results.

In [None]:
await Eval(
    "gh-issues",
    data = data,
    task=wrap_create_title,
    scores=[Summary],
)

## Summary

Click into the new experiment, and check it out! You should notice a few things:

* BrainTrust will automatically compare the new experiment to your previous one.
* You should see an increase in scores, and can click around to look at exactly which values improved.
* You can also filter down to the examples who still have a `summary` score of 0, and further iterate on the prompt to try and improve them.

Last but not least... This tutorial is designed to teach you about BrainTrust, but in practice, you probably will have to iterate with more than just one prompt change to improve results. In fact, I changed the prompt 4 or 5 times and iterated with BrainTrust before nailing down this solution.

Happy evaling!