# Prompt versioning

Successfully moving an AI project from prototype to production requires systematic testing of different prompts. A common challenge that causes lots of iteration is ensuring the responses are accurate and aligned with your brand's guidelines. This process involves creating variations, measuring their effectiveness, and sometimes returning to previous versions that performed better.

In this cookbook, we'll build a support chatbot and walk through the complete cycle of prompt development. Starting with a basic implementation, we'll create increasingly sophisticated prompts, evaluate their performance, and demonstrate how to revert to previous versions when needed.

## Getting started

Before getting started, make sure you have a [Braintrust account](https://www.braintrust.dev/signup) and an API key for [OpenAI](https://platform.openai.com/signup). Make sure to plug the OpenAI key into your Braintrust account's [AI provider configuration](https://www.braintrust.dev/app/settings?subroute=secrets). 

Once you have your Braintrust account set up with an OpenAI API key, install the following dependencies:

In [None]:
!pip install braintrust autoevals openai asyncio nest_asyncio 


Next, we'll import the libraries we need and authenticate with Braintrust. To do this, export your `BRAINTRUST_API_KEY` as an environment variable:
```bash
export BRAINTRUST_API_KEY="YOUR_API_KEY_HERE"
```
<Callout type="info">
Exporting your API key is a best practice, but to make it easier to follow along with this cookbook, you can also hardcode it into the code below.
</Callout>

Once the API key is set, we initialize the OpenAI client using the AI proxy:

In [None]:
import os
import subprocess
import asyncio
import nest_asyncio
from openai import OpenAI
from braintrust import Eval, wrap_openai, invoke
from autoevals import LLMClassifier

# Apply nest_asyncio to allow asyncio.run() in a notebook
nest_asyncio.apply()


os.environ["BRAINTRUST_API_KEY"] = "YOUR_API_KEY_HERE"
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY_HERE"

# Initialize OpenAI client with Braintrust wrapper
client = wrap_openai(
    OpenAI(
        base_url="https://api.braintrust.dev/v1/proxy",
        api_key=os.environ["BRAINTRUST_API_KEY"],
    )
)

# Project name for consistency
project_name = "SupportChatbot"

## Creating a dataset

We'll create a dataset of customer inquiries to evaluate our prompts. In a production application, you'll want to use real customer interactions.


In [3]:
dataset = [
    {
        "input": "Why did my package disappear after tracking showed it was delivered?",
        "metadata": {"category": "shipping"},
    },
    {
        "input": "Your product smells like burnt rubber - what’s wrong with it?",
        "metadata": {"category": "product"},
    },
    {
        "input": "I ordered 3 items but only got 1, where’s the rest?",
        "metadata": {"category": "shipping"},
    },
    {
        "input": "Why does your app crash every time I try to check out?",
        "metadata": {"category": "tech"},
    },
    {
        "input": "My refund was supposed to be here 2 weeks ago - what’s the holdup?",
        "metadata": {"category": "returns"},
    },
    {
        "input": "Your instructions say ‘easy setup’ but it took me 3 hours!",
        "metadata": {"category": "product"},
    },
    {
        "input": "Why does your delivery guy keep leaving packages at the wrong house?",
        "metadata": {"category": "shipping"},
    },
    {
        "input": "The discount code you sent me doesn’t work - fix it!",
        "metadata": {"category": "sales"},
    },
    {
        "input": "Your support line hung up on me twice - what’s going on?",
        "metadata": {"category": "support"},
    },
    {
        "input": "Why is your website saying my account doesn’t exist when I just made it?",
        "metadata": {"category": "tech"},
    },
]

## Creating a scoring function

When evaluating support responses, we need to measure more than just accuracy. To do this, we use an [LLMClassifier](https://github.com/braintrustdata/autoevals?tab=readme-ov-file#python-3) to assess if responses maintain the right tone while solving customer problems:

In [4]:
brand_alignment_scorer = LLMClassifier(
    name="BrandAlignment",
    prompt_template="""
    Evaluate if the response aligns with our brand guidelines (Y/N):
    1. **Positive Tone**: Uses upbeat language, avoids negativity (e.g., "We’re thrilled to help!" vs. "That’s your problem").
    2. **Proactive Approach**: Offers a clear next step or solution (e.g., "We’ll track it now!" vs. vague promises).
    3. **Apologetic When Appropriate**: Acknowledges issues with empathy (e.g., "So sorry for the mix-up!" vs. ignoring the complaint).
    4. **Solution-Oriented**: Focuses on fixing the issue for the customer (e.g., "Here’s how we’ll make it right!" vs. excuses).
    5. **Professionalism**: There should be no profanity, or emojis.
    
    Response: {{output}}


    Only give a Y if all the criteria are met. If one is missing and it should be there, give a N.
    """,
    choice_scores={"Y": 1, "N": 0},
    use_cot=True,
)

Brand alignment scorer defined with specific guidelines.


## Creating a prompt

Let's start with a basic prompt that focuses on providing direct responses to customer inquiries:

In [None]:
prompt_v1_content = """
import braintrust

project = braintrust.projects.create(name="SupportChatbot")

prompt_v1 = project.prompts.create(
    name="Brand Support V1",
    slug="brand-support-v1",
    description="Simple support prompt",
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "{{{input}}}"}
    ]
)
"""

# Save to file
with open("prompt_v1.py", "w") as f:
    f.write(prompt_v1_content)

# Push to Braintrust with overwrite and verbosity
push_command = "braintrust push prompt_v1.py --if-exists replace --verbose"
result = subprocess.run(push_command, shell=True, capture_output=True, text=True)
print(f"Pushing prompt_v1.py to Braintrust:")
print(f"CLI Output: {result.stdout}")
if result.stderr:
    print(f"CLI Error: {result.stderr}")
if result.returncode != 0:
    print(f"Push failed with return code {result.returncode}")
else:
    print("Push succeeded.")

After pushing the prompt, you'll see it in the Braintrust UI.

![prompts](./assets/prompts.png)

## Running our first evaluation

With our prompt ready, we'll create a task function and run our first evaluation:

In [None]:
# Define task using invoke with correct input
def task_v1(input):
    result = invoke(
        project_name=project_name,
        slug="brand-support-v1",
        input={"input": input},  # Matches {{{input}}} in our prompt
    )
    return result


eval_task = Eval(
    project_name,
    data=lambda: dataset,
    task=task_v1,
    scores=[brand_alignment_scorer],
    experiment_name="prompt_v1",
)

## Improving our prompt

Our initial evaluation showed that there is room for improvement. Let's create a more sophisticated prompt that incorporates our brand guidelines:

In [None]:
prompt_v2_content = """
import braintrust

project = braintrust.projects.create(name="SupportChatbot")

prompt_v2 = project.prompts.create(
    name="Brand Support V2",
    slug="brand-support-v2",
    description="Brand-aligned support prompt",
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You’re a cheerful, proactive assistant for Sunshine Co. Always use a positive tone, apologize for issues with empathy, and offer clear solutions to delight customers!"},
        {"role": "user", "content": "{{{input}}}"}
    ]
)
"""

with open("prompt_v2.py", "w") as f:
    f.write(prompt_v2_content)

push_command = "braintrust push prompt_v2.py --if-exists replace --verbose"
result = subprocess.run(push_command, shell=True, capture_output=True, text=True)
print(f"Pushing prompt_v2.py to Braintrust:")
print(f"CLI Output: {result.stdout}")
if result.stderr:
    print(f"CLI Error: {result.stderr}")
if result.returncode != 0:
    print(f"Push failed with return code {result.returncode}")
else:
    print("Push succeeded.")

It's important to note that we're pointing to the slug of our new prompt in our task function since that's what we're evaluating.

In [None]:
def task_v2(input):
    result = invoke(
        project_name=project_name,
        slug="brand-support-v2",
        input={"input": input},
    )
    return result


eval_task = Eval(
    project_name,
    data=lambda: dataset,
    task=task_v2,
    scores=[brand_alignment_scorer],
    experiment_name="prompt_v2",
)

## Experimenting with tone

Building on these improvements, we'll push the brand voice into a more energetic direction, testing how increased enthusiasm affects response quality:

In [9]:
prompt_v3_content = """
import braintrust

project = braintrust.projects.create(name="SupportChatbot")

prompt_v3 = project.prompts.create(
    name="Brand Support V3",
    slug="brand-support-v3",
    description="Over-enthusiastic support prompt with middling performance",
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You’re a SUPER EXCITED Sunshine Co. assistant! SHOUT IN ALL CAPS WITH LOTS OF EXCLAMATIONS!!!! SAY SORRY IF SOMETHING’S WRONG BUT KEEP IT VAGUE AND FUN!!! Make customers HAPPY with BIG ENERGY, even if solutions are UNCLEAR!!!!"},
        {"role": "user", "content": "{{{input}}}"}
    ]
)
"""


with open("prompt_v3.py", "w") as f:
    f.write(prompt_v3_content)


push_command = "braintrust push prompt_v3.py --if-exists replace --verbose"
result = subprocess.run(push_command, shell=True, capture_output=True, text=True)
print(f"Pushing prompt_v3.py to Braintrust:")
print(f"CLI Output: {result.stdout}")
if result.stderr:
    print(f"CLI Error: {result.stderr}")
if result.returncode != 0:
    print(f"Push failed with return code {result.returncode}")
else:
    print("Push succeeded.")


def task_v3(input):
    result = invoke(
        project_name=project_name,
        slug="brand-support-v3",
        input={"input": input},
    )
    return result


eval_task = Eval(
    project_name,
    data=lambda: dataset,
    task=task_v3,
    scores=[brand_alignment_scorer],
    experiment_name="prompt_v3",
)

Unfortunately, the new prompt doesn't perform as well as the previous one. In a production application, this is where you'll want to run more evaluations to find an optimal prompt.

## Managing prompt versions

After running evaluations on all three versions, we found that our second prompt achieved the highest score. While we've iterated on the prompt, Braintrust makes it simple to revert to this high-performing version:

In [None]:
def task_reverted(input):
    result = invoke(
        project_name=project_name,
        slug="brand-support-v2",
        input={"input": input},
    )
    return result


eval_task = Eval(
    project_name,
    data=lambda: dataset,
    task=task_reverted,
    scores=[brand_alignment_scorer],
    experiment_name="prompt_v2_reverted",
)

If you happen to not change the slug of your prompt, don't worry. Braintrust keeps track of changes allowing you to revert to any previous version.

![versions](./assets/versions.png)

## Next steps

- Now that you have some prompts saved, you can rapidly test them with new models in our [prompt playground](/docs/guides/playground).
- Learn more about [evaluating a chat assistant](/docs/cookbook/recipes/EvaluatingChatAssistant).
- Think about how you might add more sophisticated [scoring functions](/docs/guides/evals/write#scorers) to your evals.