# BrainTrust TriviaBot

<a target="_blank" href="https://colab.research.google.com/github/braintrustdata/braintrust-examples/blob/main/classify/python/BrainTrust-Classify-Tutorial.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Welcome to [BrainTrust](https://www.braintrustdata.com/)! This is a quick tutorial on how to build and evaluate a Trivia bot using BrainTrust.

Before starting, make sure that you have a BrainTrust account. If you do not, please [sign up](https://www.braintrustdata.com) or [get in touch](mailto:info@braintrustdata.com). After this tutorial, learn more by visiting [the docs](http://www.braintrustdata.com/docs).

In [6]:
import csv
import braintrust
import openai
from autoevals.string import *


BT_API_KEY=""
OPENAI_API_KEY=""

openai.api_key = OPENAI_API_KEY

# 1. Load our evaluation data
First, we will load part of the `trivia_qa` [dataset on Hugging Face](https://huggingface.co/datasets/trivia_qa).

This includes question, answer pairs that look like:
```
Question: Which President was in office when the first Peanuts cartoon was published?
Answer: ['Presidency of Harry S. Truman', 'Hary truman', 'Harry Shipp Truman', \"Harry Truman's\", 'Harry S. Truman', 'Harry S.Truman', 'Harry S Truman', 'H. S. Truman', 'President Harry Truman', 'Truman administration', 'Presidency of Harry Truman', 'Mr. Citizen', 'HST (president)', 'H.S. Truman', 'Mary Jane Truman', 'Harry Shippe Truman', 'S truman', 'Harry Truman', 'President Truman', '33rd President of the United States', 'Truman Administration', 'Harry Solomon Truman', 'Harold Truman', 'Harry truman', 'H. Truman']"
```

In [7]:
evaluation_set = []
with open('evaluation_set.csv', 'r') as file:
    reader = csv.reader(file)
    evaluation_set = list(reader)

[['question', 'answer'], ['Which President was in office when the first Peanuts cartoon was published?', '[\'Presidency of Harry S. Truman\', \'Hary truman\', \'Harry Shipp Truman\', "Harry Truman\'s", \'Harry S. Truman\', \'Harry S.Truman\', \'Harry S Truman\', \'H. S. Truman\', \'President Harry Truman\', \'Truman administration\', \'Presidency of Harry Truman\', \'Mr. Citizen\', \'HST (president)\', \'H.S. Truman\', \'Mary Jane Truman\', \'Harry Shippe Truman\', \'S truman\', \'Harry Truman\', \'President Truman\', \'33rd President of the United States\', \'Truman Administration\', \'Harry Solomon Truman\', \'Harold Truman\', \'Harry truman\', \'H. Truman\']'], ['In 1930, which American-born Sinclair was awarded the Nobel Prize for Literature?', "['(Harry) Sinclair Lewis', 'Harry Sinclair Lewis', 'Lewis, (Harry) Sinclair', 'Grace Hegger', 'Sinclair Lewis']"], ['In what part of England was Dame Judi Dench born?', "['Park Grove (1895)', 'York UA', 'Yorkish', 'UN/LOCODE:GBYRK', 'York, 

## 2. Define an evaluation function

Next, we will define an evaluation function that makes it easy to evaluate different prompts for a TriviaBot.

We will use a mix of scoring functions we define and scoring functions that come included with BrainTrust's [autoevals library](https://github.com/braintrustdata/autoevals).

In [8]:
# Run evaluation function
def runEvaluation(generationFn, experimentName):
    #Initialize a BrainTrust experiment
    experiment = braintrust.init(project="Chroma-Trivia-Retrieval", api_key=BT_API_KEY, experiment=experimentName)
    for testCase in evaluation_set:
        question, expected = testCase
        # get the cheat sheet for the input
        generatedAnswer, prompt = generationFn(question)
        
        # 1. Use a custom eval
        def evalAppears(expected, output):
            result = output in expected
            return (1 if result else 0)
        answerAppears = evalAppears(expected, generatedAnswer)

        #2. use BrainTrust's Levenshtein scorer
        levenEvaluator = LevenshteinScorer()
        levenEvaluations = map(lambda x: levenEvaluator(x, generatedAnswer).score, expected)
        maxLevenEvaluation = max(list(levenEvaluations))

        scores = {
            "answerAppears": answerAppears,
            "levenshteinScore": maxLevenEvaluation,
        }

        # Log to BrainTrust
        experiment.log(
            input=question,
            output=generatedAnswer,
            expected=expected,
            scores=scores,
            metadata={"prompt": prompt},
        )

    return experiment.summarize()

## 3. Compare prompts

First, we will try a very simple prompt and see how much we can improve it.

```
Input: {question}
Output:
```
After we run this block, we will get a link to the BrainTrust web UI so we can dig into where our app can be improved.

In [10]:
def simplePrompt(question):
    try:
        prompt = """
        Input: {question}
        
        Output:
        """.format(question=question)
        messages = [{"role": "system", "content": prompt}]
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=messages,
            temperature=0,
            max_tokens=500,
        )

        result = response["choices"][0]["message"]["content"]
        return result, messages
    except Exception as e:
        print(e)
        return []
results = runEvaluation(simplePrompt, "simple")
print(results)


simple-020 compared to simple-019:
See results for all experiments in Chroma-Trivia-Retrieval at https://www.braintrustdata.com/app/braintrustdata.com/p/Chroma-Trivia-Retrieval
See results for simple-020 at https://www.braintrustdata.com/app/braintrustdata.com/p/Chroma-Trivia-Retrieval/simple-020


![simple.png](simple.png)
Our initial score is not great. Let's make some changes to our prompt and make it a bit more sophisticated and re-run again.

In [11]:
def complexPrompt(question):
    try:
        prompt = """
        ---
        You are the world's best trivia player. You will be given a question - respond to it as factually correct as possible.
        Respond with just the answer as concisely as possible.
        Input: {question}
        
        Output:
        """.format(question=question)
        messages = [{"role": "system", "content": prompt}]
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=messages,
            temperature=0,
            max_tokens=500,
        )

        result = response["choices"][0]["message"]["content"]
        return result, messages
    except Exception as e:
        print(e)
        return []
results = runEvaluation(complexPrompt, "complex")
print(results)


simple-021 compared to simple-020:
58.82% (+49.02%) 'answerAppears'    score	(25 improvements, 0 regressions)
12.13% (+09.77%) 'levenshteinScore' score	(43 improvements, 1 regressions)

See results for all experiments in Chroma-Trivia-Retrieval at https://www.braintrustdata.com/app/braintrustdata.com/p/Chroma-Trivia-Retrieval
See results for simple-021 at https://www.braintrustdata.com/app/braintrustdata.com/p/Chroma-Trivia-Retrieval/simple-021


![final.png](complex.png)
Our new prompt performs much better!

We can verify that our pipeline changes actually improved our performance using BrainTrust. Next, you can continue to make prompt and pipeline changes to improve the score even more.

Now, you are on your journey of building reliable AI apps with BrainTrust.
Learn more on our docs @ [https://www.braintrustdata.com/docs](https://www.braintrustdata.com/docs).