# Basic Promptfoo Evaluation Setup and Execution

## Welcome to GlobalMart's AI Innovation Team!

Congratulations on joining GlobalMart's AI Innovation Team! As a leading e-commerce platform, GlobalMart receives thousands of customer emails daily, covering a wide range of topics from order inquiries to technical support requests. Your first major project is to develop and evaluate an AI-powered email classification system that will revolutionize our customer support operations.

### Your Mission

As the newest data scientist on the team, you've been tasked with setting up and running the initial evaluation for our email classification system using Promptfoo and Amazon Bedrock. This system aims to automatically categorize incoming customer emails into four main categories:

1. Order Issues
2. Product Inquiries
3. Technical Support
4. Returns/Refunds

Your evaluation will help us understand the current performance of our AI model and identify areas for improvement. The insights you gather will be crucial in enhancing our customer service efficiency and response times.

### What's at Stake

The success of this project could lead to:
- Faster response times to customer inquiries
- More efficient allocation of customer support resources
- Improved customer satisfaction rates
- Potential cost savings in our customer service department

Your work will directly impact GlobalMart's operational efficiency and customer experience. Are you ready to dive in and make a real difference?

## Introduction

In this lab, we'll set up and run our first evaluation for the GlobalMart Email Classification System using Promptfoo and Amazon Bedrock. We'll go through the process of defining our evaluation objectives, creating prompts and test cases, and analyzing the results.

Let's get started on this exciting journey to optimize GlobalMart's customer support!

## 1. Preparing for Evaluation

### Review of Promptfoo configuration

Before we begin, make sure you have Promptfoo installed and configured to work with Amazon Bedrock. 

### Defining evaluation objectives

Our goal is to create an email classification system for GlobalMart that can accurately categorize customer emails into the following categories:
1. Order Issues
2. Product Inquiries
3. Technical Support
4. Returns/Refunds

We'll evaluate our system's ability to correctly classify emails into these categories.

## 2. Creating a Focused Evaluation

### A. Designing a prompt template

We'll create a prompt template in a separate Python file called `prompts.py`. This approach offers several advantages:
1. Modularity: Keeps our prompts separate from configuration, making them easier to manage and update.
2. Reusability: Allows us to easily reuse prompts across different evaluations or projects.
3. Version control: Makes it easier to track changes to our prompts over time.

Let's create our `prompts.py` file with the `classify_email` function below:

First install Promptfoo

In [None]:
!npm install -g promptfoo --loglevel=error --no-fund

In [None]:
%%writefile prompts.py
# prompts.py

def classify_email(email_content):
    return f"""You are an AI assistant for GlobalMart's customer support team. Your task is to classify the following email into one of these categories: Order Issues, Product Inquiries, Technical Support, or Returns/Refunds.

Email content: {email_content}

Provide your classification as a single word or phrase, choosing from the categories mentioned above. Do not include any explanation or additional text.

Classification:"""

# You can add more prompt functions here in the future

### B. Developing test cases

We'll create our test cases in a CSV file named `dataset.csv`. Using a CSV file for test cases offers several benefits:
1. Easy to read and edit: CSV files are human-readable and can be edited with various tools, including spreadsheet software.
2. Scalability: CSV files can handle large numbers of test cases efficiently.
3. Separation of concerns: Keeps test data separate from code and configuration.

Let's update our `dataset.csv` file with the content below:

In [None]:
%%writefile dataset.csv
email_content,__expected
"Hi, I ordered a laptop last week, but I haven't received any shipping update. Can you help?",Order Issues
"I'm having trouble logging into my account. It keeps saying my password is incorrect even though I'm sure it's right.",Technical Support
"Do you have the latest iPhone model in stock? I couldn't find it on your website.",Product Inquiries
"I received my order yesterday, but the shirt is the wrong size. I'd like to return it for a refund.",Returns/Refunds
"Can you tell me when the next sale is? I'm looking to buy a new TV.",Product Inquiries
"My order arrived damaged. What should I do?",Returns/Refunds
"How do I track my recent order?",Order Issues
"I bought a blender from your store, but it's not working. Is there a warranty?",Technical Support
"I want to change the shipping address for my recent order. Is that possible?",Order Issues
"What's your return policy for electronics?",Returns/Refunds

### C. Create your propmptfoo `promptfooconfig.yaml` config file
Your configuration file brings everything together, specifying how the evaluation should run: description: "GlobalMart Email Classification Evaluation"

The Promptfoo config file below has 4 key items:
1. A clear description of what we're testing
2. References to our prompt `python.py` function
3. A Provider configuration (in this case Amazon's Nova Micro and Anthropic's Haiku 3.5)
4. Test cases. In this case the test cases are the CSV abovefile with variables (`email_content`) and expected outputs (`__expected`)


In [None]:
%%writefile promptfooconfig.yaml
description: "GlobalMart Email Classification Evaluation"

prompts:
  - prompts.py:classify_email

providers:
  - id: bedrock:us.amazon.nova-lite-v1:0
    label: "Nova Lite"
    config:
      region: us-west-2 # change to us-east-1 depending on your deployment region

  - id: bedrock:us.anthropic.claude-3-5-haiku-20241022-v1:0
    label: "Haiku 3.5"
    config:
      region: us-west-2 # change to us-east-1 depending on your deployment region

tests:
  - file://dataset.csv

## 3. Running Your First Evaluation
Now that we have set up all the components of our email classification system, it's time to run our first evaluation. We'll use the PromptFoo command line interface with some specific settings for our lab environment.
Enter the following command to run the evaluation:

`promptfoo eval --no-progress-bar --no-cache`

Let's understand what these command flags do and why we're using them:

**--no-progress-bar:** We're disabling the progress bar because we're running this in a Jupyter notebook environment. While progress bars are helpful when running evaluations in a terminal, they can create visual clutter in our notebook output.

**--no-cache:** This flag tells PromptFoo to generate fresh results for every evaluation rather than using any cached responses. While caching is valuable in production environments to save time and reduce API costs, for learning purposes we want to see new results each time we run our evaluation. This helps us:

- Observe how the model's responses might vary
- Ensure we're not looking at stored results from previous runs
- Get a true sense of the model's performance

**NOTE:** In a real-world scenario, you might omit these flags to take advantage of PromptFoo's progress tracking and caching features. However, for this learning exercise, getting fresh, clear results helps us better understand how our classification system performs.

After running this command, PromptFoo will:
1. Process each email in our test dataset
2. Send them to our configured Bedrock models
3. Apply our Python-based assertions to evaluate the responses
4. Generate a detailed report of the results

Let's run the evaluation and examine the results...

In [None]:
!promptfoo eval --no-progress-bar --no-cache

**IMPORTANT**: In order to share the results of your evaluation and to see a more indept analysis of the resutls you'll need to create an **free** account with [PromptFoo](https://www.promptfoo.app/welcome). You'll only need to run this command once for thie workshop.

The commmand is prepared below you just need to insert your API key in the command below. Be sure to remove the `<` and `>` when you paste in your API key

In [None]:
!promptfoo auth login --host https://api.promptfoo.app --api-key <insert your api key here>

### Sharing Your Evaluation Results
Once you've run your evaluation, and have completed the `promptfoo auth` authentication command above, you can share the results with teammates or save them for later reference. PromptFoo provides a convenient way to do this using the share command:

`promptfoo share`

When you run this command, PromptFoo will:

- Generate a unique, publicly accessible URL for your evaluation results
- Display the URL in your terminal like this:
```plaintext
View results: https://app.promptfoo.dev/eval/f:91b9ea8a-174c-4129-9c52-774c34c96ea4
```
When you visit this URL, you'll see the same detailed evaluation results that you would see in the local web viewer, but now they're accessible to anyone with the link. This is especially useful for:

- Collaborating with team members on improving the classification system
- Documenting your evaluation results for future reference
- Comparing results across different iterations of your prompts
- Creating snapshots of your system's performance at different stages

Keep in mind that these shared results will remain accessible via the URL, so you can always refer back to them even after making changes to your evaluation setup.

In [None]:
!promptfoo share

This will open a web interface where you can explore the results in depth.

## 4. Identifying strengths and weaknesses

Examine the results in the web viewer. Pay attention to:
1. Overall accuracy: What percentage of emails were correctly classified?
2. Performance by category: Are some categories more accurately classified than others?
3. Misclassifications: Look at emails that were incorrectly categorized. Can you identify any patterns?

## Conclusion

In this lab, we've set up and run our first evaluation of the GlobalMart Email Classification System using Promptfoo and Amazon Bedrock. We've learned how to create prompts, define test cases, and analyze the results of our evaluation. 