# BrainTrust QA Chat Tutorial

<a target="_blank" href="https://colab.research.google.com/github/braintrustdata/braintrust-examples/blob/main/">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Welcome to [BrainTrust](https://www.braintrustdata.com/)! This is a quick tutorial on how to build and evaluate an AI question and answer chat assistant. The assistant answers questions based on the user's information that is already saved in a vector DB (Chroma).

Before starting, make sure that you have a BrainTrust account. If you do not, please [sign up](https://www.braintrustdata.com) or [get in touch](mailto:info@braintrustdata.com). After this tutorial, learn more by visiting [the docs](http://www.braintrustdata.com/docs).

In [8]:
import json
import braintrust
import chromadb
import openai
from autoevals.string import *
from autoevals.llm import *

PROJECTNAME = "QAchatbot"
OPENAI_API_KEY = "YOURAPIKEY"
BT_API_KEY="YOURAPIKEY"

openai.api_key = OPENAI_API_KEY

## 1. Load in datasets

First, we'll load two datasets:
1. An evaluation dataset to test our pipeline. This includes input, output pairs like:
```
    {"input": "What is my full name?", "output": "John Smith. -BT"}
```
2. A user context dataset to give to the AI assistant as context. This will be stored in a vector db and contains rows like:
```
    {"category": " address", "detail": "123 Main Street, Anytown, USA"}
```

In [2]:

# Input, output pairs to evaluate our QAChatbot.
# Upload these to a dataset in BrainTrust so your teammates can also easily use and manage this dataset.
dataset = braintrust.init_dataset(PROJECTNAME, name="Basic Evaluation", api_key=BT_API_KEY)
with open('evaluationDataset.jsonl') as f:
    for line in f:
        testCase = json.loads(line)
        dataset.insert(
            input=testCase['input'],
            output=testCase['output'],
        )

# List of facts about a fake user. These will be given as context to the QAChatbot.
userContextDataset = []
with open('userContextDataset.jsonl') as f:
    for line in f:
        userContextDataset.append(json.loads(line))

## 2. Store the user context dataset
Next, we will embed the personal dataset of information into a vector DB. We will use [Chroma](https://www.trychroma.com/) to embed, store, and retrieve the relevant user context.

In [4]:
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="user-context")

for i,c in enumerate(userContextDataset):
    fact = c["detail"]
    collection.add(
        documents = [fact],
        ids = [str(i)]
)

## 3. Write an evaluation function

Now, we will define a general evaluation function to test different prompts and pipelines using BrainTrust. This makes it easy to iterate and improve our AI app.

In [5]:

# This is an evaluation script that tests a prompt and pipeline combination and logs the results to BrainTrust.
def runEvaluation(generationFn, experimentName):
    # Initialize a BrainTrust experiment
    experiment = braintrust.init(project=PROJECTNAME, api_key=BT_API_KEY, experiment=experimentName, dataset=dataset)
 
    for testCase in dataset:
        input_data = str(testCase["input"])
        expected = str(testCase["output"])
    
        output, prompt =  generationFn(input_data) # Generate an answer using the pipeline given
 
        # Score factuality using BrainTrust's factuality scorer
        # Learn more here: https://github.com/braintrustdata/autoevals/blob/main/templates/factuality.yaml
        factuality = Factuality()
        factualityScore = factuality(output, expected, input=input_data)

        # Score the output using BrainTrust's Levenshtein scorer
        levenEvaluator = LevenshteinScorer()
        levenScore = levenEvaluator(output, expected, input=input_data)
        
        # Define two simple custom scorers to ensure the LLM is saying the right words
        def BadWordScorer(output):
            if "ai language model" in output.lower():
                return 0 # Bad LLM :(
            if "sorry" in output.lower():
                return 0 # Bad LLM :(
            if "user" in output.lower():
                return 0 # Bad LLM :(
            else:
                return 1
        
        def GoodWordScorer(output):
            if "-BT" in output:
                return 1 # Good LLM :)
            else:
                return 0
        
        # Log the results to BrainTrust
        experiment.log(
            input_data,
            output,
            expected,
            scores={
                factualityScore.name: factualityScore.score,
                levenScore.name: levenScore.score,
                "saysBadWords": BadWordScorer(output),
                "saysThankYou": GoodWordScorer(output)
                
            }, # Scores dictionary
            metadata={
                factualityScore.name: factualityScore.metadata,
                "prompt": prompt,
            }, # Metadata dictionary
            dataset_record_id=testCase["id"],
        )
 
    # Summarize the results
    summary = experiment.summarize(summarize_scores=True)
    return summary

## 4. Evaluate and improve :)!

Finally, we will test two different prompts and pipelines.

Pipeline A uses a simple prompt and 1 relevant fact.

In [6]:
# 4. Run against different prompts and pipelines

# A very simple pipeline that uses 1 fact as context
def llmPipelineA(question):
    # Get a relevant fact from the user context dataset
    relevant = collection.query(
        query_texts=[question],
        n_results=1
    )
    relevant_text = ','.join(relevant["documents"][0])
    prompt = """
        You are an assistant called BT. Help the user. Do not say you are an AI language model. Follow up with -BT.
        Relevant information: {relevant}
        Question: {question}
        Answer:
        """.format(question=question, relevant=relevant_text)
    messages = [{"role": "system", "content": prompt}]
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=messages,
        temperature=0,
        max_tokens=100,
    )

    result = response["choices"][0]["message"]["content"]
    return result, prompt

resultsA = runEvaluation(llmPipelineA, "pipelineA")
print(resultsA)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Once we run the block above, we can click on the link to the BrainTrust web UI to see how our pipeline performs.

![web-ui.png](1.png)

We seem to fail multiple test cases because our AI app apologizes and says "I'm sorry" too much. We can easily fix this with a prompt change. Let's also see if improving the number of relevant facts can improve our score.

Let's define pipeline B below which uses 5 relevant facts as context and includes an updated prompt.

In [7]:
# A pipeline that uses 5 facts as context
def llmPipelineB(question):
    # Get relevant facts from the user context dataset
    relevant = collection.query(
        query_texts=[question],
        n_results=5
    )
    relevant_text = ','.join(relevant["documents"][0])
    prompt = """
        You are a very helpful assistant called BT. Respond concisely. Do not say you are an AI language model and do not apologize. End your answer with -BT.
        Relevant information: {relevant}
        Question: {question}
        Your answer:
        """.format(question=question, relevant=relevant_text)
    messages = [{"role": "system", "content": prompt}]
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=messages,
        temperature=0,
        max_tokens=100,
    )

    result = response["choices"][0]["message"]["content"]
    
    return result, prompt

resultsB = runEvaluation(llmPipelineB, "pipelineB")
print(resultsB)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Once we run the block above, we can click on the link to the BrainTrust web UI to see how our pipeline performs.

![web-ui-final.png](2.png)

We can verify that our pipeline changes actually improved our performance! Next, you can continue to make prompt and pipeline changes to improve the score even more.

Now, you are on your journey of building reliable AI apps with BrainTrust.

Learn more on our docs @ [https://www.braintrustdata.com/docs](https://www.braintrustdata.com/docs).