# Prompt Optimization through A/B Testing

## üéØ Learning Objectives
By the end of this notebook, you will be able to:
1. Design and conduct A/B tests to compare prompt variants
2. Implement a manual evaluation workflow for LLM responses
3. Analyze feedback data to determine which prompt performs better
4. Apply statistical thinking to prompt engineering decisions

## üí° Key Concept: A/B Testing for Prompts

**What is A/B Testing?**  
A/B testing (also called split testing) is a method of comparing two versions of something to determine which performs better. In prompt engineering, we:
- Create two prompt variants (A and B)
- Generate multiple responses from each
- Collect human feedback on quality
- Analyze which variant produces better results

**Why Manual Evaluation?**  
While automated metrics exist, human judgment is often the gold standard for evaluating:
- Creativity and originality
- Brand voice alignment
- Subtle quality differences
- Real-world usefulness

**When to Use A/B Testing:**
- Optimizing prompts for production systems
- Deciding between few-shot vs zero-shot approaches
- Testing different instruction phrasings
- Validating prompt improvements before deployment

In [1]:
import ipywidgets as widgets
from IPython.display import display
import pandas as pd
from src.openai_client import generate_text
import time
import random

## Section 1: Setup and Imports

Import the necessary libraries for our A/B testing workflow.

In [2]:
# Define two variants of the prompt
prompt_A = """Product description: A pair of shoes that can fit any foot size.
Themes: adaptable, fit, omni-fit.
Product names:

# output format:
comma separated list."""

prompt_B = """Product description: A home milkshake maker.
Themes: fast, healthy, compact, flexible.
Product names: HomeShaker, Fit Shaker, QuickShake, Shake Maker

Product description: A watch that can tell accurate time in space.
Themes: astronaut, space-hardened, eliptical orbit, outer space.
Product names: AstroTime, SpaceGuard, Orbit-Accurate, EliptoTime.

Product description: A pair of shoes that can fit any foot size.
Themes: adaptable, fit, omni-fit.
Product names:

# output format:
comma separated list."""

def delay(ll_delay=1, ul_delay=2):
    return round(random.uniform(ll_delay, ul_delay), 2)


## Section 2: Define Prompt Variants

### üî¨ Experimental Design

We'll test two approaches for generating product names:

**Variant A (Zero-shot):**  
Minimal context, direct instruction

**Variant B (Few-shot with examples):**  
Provides 2 examples to guide the model's creative process

### ü§î Hypothesis
Few-shot prompting (Variant B) will generate more creative and contextually appropriate product names compared to zero-shot (Variant A).

**Key Variables to Control:**
- Same task (generate product names for adaptive shoes)
- Same themes provided
- Same output format requested
- Only difference: presence/absence of examples

In [3]:
output = generate_text(prompt_A)
print(output)

FlexiFit, AdaptStep, OmniShoe, FitAll, VersatileWalk, SizeFlex, AnyFit, ShapeShifter, UniversalStride, ComfortAdjust.


### Quick Test: Preview Variant A

Let's see a single response from Variant A to understand what we're evaluating.

In [12]:
# Iterate through the prompts and get responses
test_prompts = [prompt_A, prompt_B]
responses = []
num_tests_per_prompt = 6
max_retries = 3

for idx, prompt in enumerate(test_prompts):
    # prompt number as a letter
    var_name = chr(ord('A') + idx)

    for i in range(num_tests_per_prompt):
        # Get a response from the model
        response = None
        retries = 0
        while response is None and retries < max_retries:
            response = generate_text(prompt)
            # print(response)
            if response is None:
                seconds = delay()
                print(f"{var_name}: {i}: delaying for {seconds} seconds...")
                time.sleep(seconds)
                retries += 1
                continue
            data = {
                "variant": var_name,
                "prompt": prompt,
                "response": response
                }
            responses.append(data)
            # print(data)


# Convert responses into a DataFrame
df = pd.DataFrame(responses)


## Section 3: Generate Test Responses

### üìä Sample Size Considerations

We'll generate **6 responses per variant** (12 total). This gives us:
- Enough variety to see response diversity
- Manageable review time (~2-3 minutes)
- Statistical signal for comparison

**Why multiple samples?**  
LLMs are non-deterministic. A single response might not represent typical quality. Multiple samples help us:
- Identify consistent patterns
- Spot outliers
- Make more reliable comparisons

**Note:** The code includes retry logic and delays to handle API rate limits gracefully.

In [43]:
df

Unnamed: 0,variant,prompt,response,feedback
0,A,Product description: A pair of shoes that can ...,"FlexiFit, OmniStep, AdaptFit, SizeWise, Univer...",
1,A,Product description: A pair of shoes that can ...,"FlexiFit, OmniShoe, AdaptStep, FitAll, ShapeSh...",
2,B,Product description: A home milkshake maker.\n...,"FlexiFit, OmniStep, Adaptable Sole, FitAll Sho...",
3,B,Product description: A home milkshake maker.\n...,"FlexiFit, OmniShoe, Adaptable Sole, FitAll, Un...",
4,A,Product description: A pair of shoes that can ...,"FitFlex, AdaptFit, OmniShoe, SizeWise, FlexiFi...",
5,A,Product description: A pair of shoes that can ...,"OmniFit Shoes, Adaptable Step, FlexiFit Footwe...",
6,B,Product description: A home milkshake maker.\n...,"FitAll, OmniShoe, AdaptFit, FlexiStep, AnySize...",
7,B,Product description: A home milkshake maker.\n...,"FlexiFit, OmniShoe, Adaptable Steps, FitAll, S...",
8,B,Product description: A home milkshake maker.\n...,"FitFlex, OmniShoe, AdaptFit, UniversalStride, ...",
9,A,Product description: A pair of shoes that can ...,"FlexiFit, OmniShoes, AdaptStep, FitAll, ShapeS...",


### View Generated Responses

The DataFrame contains:
- `variant`: Which prompt version (A or B)
- `prompt`: The actual prompt text
- `response`: The model's generated product names
- `feedback`: Will be populated during manual review

---

## Section 4: Manual Review Process

### üé® Human Evaluation Workflow

**Why shuffle responses?**  
To avoid bias! If all Variant A responses appear first, you might:
- Get fatigued and rate later ones lower
- Notice patterns and adjust your ratings
- Compare within groups instead of independently

**Evaluation Criteria (consider these when rating):**
- **Creativity**: Are the names unique and memorable?
- **Relevance**: Do they relate to the "adaptive fit" theme?
- **Marketability**: Would these work as real product names?
- **Clarity**: Is the meaning clear without explanation?

**Instructions:**
1. Run cell 8 to initialize the review state
2. Run cell 9 to display the review interface
3. Click üëç for good responses, üëé for poor ones
4. Your ratings are saved immediately

In [42]:
# ============================================================================
# REVIEW STATE INITIALIZATION
# Run this cell to initialize or RESET the review process
# ============================================================================

# Ensure df exists
try:
    df  # noqa: F821
except NameError:
    import pandas as pd
    df = pd.DataFrame(columns=["variant", "prompt", "response", "feedback"])
    print("‚ö†Ô∏è  df not found. Created empty DataFrame.")

# Shuffle if needed
if len(df) > 0 and "feedback" not in df.columns:
    df = df.sample(frac=1).reset_index(drop=True)
    print(f"‚úì Shuffled {len(df)} responses")

# Clear existing feedback (for re-runs)
if "feedback" in df.columns:
    df["feedback"] = pd.Series(dtype="str")
    print("‚úì Cleared previous feedback (ready for new review)")
else:
    df["feedback"] = pd.Series(dtype="str")
    print("‚úì Added feedback column")

# RESET global state
response_index = 0
buttons_to_reset = []

print(f"\n‚úì State initialized: response_index = 0, {len(df)} responses ready for review")
print("=" * 80)

# Initialize UI components
response = widgets.HTML()
count_label = widgets.Label()

def update_response():
    """Update the displayed response and counter."""
    global response_index
    if response_index < len(df) and len(df) > 0:
        new_response = df.iloc[response_index]["response"]
        new_response = f"<p>{new_response}</p>" if pd.notna(new_response) else "<p>No response</p>"
        response.value = new_response
        count_label.value = f"Response: {response_index + 1} / {len(df)}"
    else:
        response.value = "<p>No responses available.</p>"
        count_label.value = f"Response: {response_index} / {len(df)}"

def on_button_clicked(b):
    global response_index
    from IPython.display import Javascript, display as js_display

    if response_index < len(df):
        user_feedback = 1 if b.description == "üëç" else 0
        df.at[response_index, "feedback"] = user_feedback
        response_index += 1
        print(f"Feedback recorded: {response_index - 1} ‚Üí {response_index}")

        # Blur the clicked button to remove focus state
        js_display(Javascript("document.activeElement.blur();"))
        
        if response_index < len(df):
            update_response()
        else:
            response.value = "<p>‚úÖ Review complete! All responses evaluated.</p>"
            count_label.value = f"Response: {response_index} / {len(df)}"
    else:
        response.value = "<p>All responses reviewed.</p>"
        count_label.value = f"Response: {response_index} / {len(df)}"


‚úì Cleared previous feedback (ready for new review)

‚úì State initialized: response_index = 0, 12 responses ready for review


### Step 1: Initialize Review State

**‚ö†Ô∏è Important:** Run this cell to start a fresh review or reset your ratings.

This cell:
- Shuffles responses to prevent bias
- Clears any previous feedback
- Resets the review counter to 0
- Prepares the UI components

In [44]:
from IPython.display import clear_output

# ============================================================================
# REVIEW UI SETUP
# Run this cell AFTER cell 8 to display the review interface
# ============================================================================

print(f"üìä Starting review UI for {len(df)} responses")
print("=" * 80)

update_response()

# Create buttons
thumbs_down_button = widgets.Button(description="üëé", tooltip="Not good")
thumbs_down_button.on_click(on_button_clicked)

thumbs_up_button = widgets.Button(description="üëç", tooltip="Good")
thumbs_up_button.on_click(on_button_clicked)

# Store button references for state management
global buttons_to_reset
buttons_to_reset = [thumbs_down_button, thumbs_up_button]

# Arrange buttons horizontally
button_box = widgets.HBox([thumbs_up_button, thumbs_down_button])

# Display UI elements
display(response, button_box, count_label)

print("\nüí° Tips:")
print("  ‚Ä¢ Click üëç or üëé to record feedback and move to next response")
print("  ‚Ä¢ To restart review: Re-run cell 8 (state will reset), then re-run this cell")
print("  ‚Ä¢ To save results: Run cell 11 (Save DataFrame) and cell 12 (Summary)")


üìä Starting review UI for 12 responses


HTML(value='<p>FlexiFit, OmniStep, AdaptFit, SizeWise, Universal Sole, FitAll, ShapeShift, EverFit, TotalComfo‚Ä¶

HBox(children=(Button(description='üëç', style=ButtonStyle(), tooltip='Good'), Button(description='üëé', style=But‚Ä¶

Label(value='Response: 1 / 12')


üí° Tips:
  ‚Ä¢ Click üëç or üëé to record feedback and move to next response
  ‚Ä¢ To restart review: Re-run cell 8 (state will reset), then re-run this cell
  ‚Ä¢ To save results: Run cell 11 (Save DataFrame) and cell 12 (Summary)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Step 2: Display Review Interface

Run this cell to show the interactive review UI. Click the buttons to rate each response.

In [40]:
df

Unnamed: 0,variant,prompt,response,feedback
0,A,Product description: A pair of shoes that can ...,"FlexiFit, OmniStep, AdaptFit, SizeWise, Univer...",0
1,A,Product description: A pair of shoes that can ...,"FlexiFit, OmniShoe, AdaptStep, FitAll, ShapeSh...",0
2,B,Product description: A home milkshake maker.\n...,"FlexiFit, OmniStep, Adaptable Sole, FitAll Sho...",1
3,B,Product description: A home milkshake maker.\n...,"FlexiFit, OmniShoe, Adaptable Sole, FitAll, Un...",1
4,A,Product description: A pair of shoes that can ...,"FitFlex, AdaptFit, OmniShoe, SizeWise, FlexiFi...",1
5,A,Product description: A pair of shoes that can ...,"OmniFit Shoes, Adaptable Step, FlexiFit Footwe...",1
6,B,Product description: A home milkshake maker.\n...,"FitAll, OmniShoe, AdaptFit, FlexiStep, AnySize...",0
7,B,Product description: A home milkshake maker.\n...,"FlexiFit, OmniShoe, Adaptable Steps, FitAll, S...",1
8,B,Product description: A home milkshake maker.\n...,"FitFlex, OmniShoe, AdaptFit, UniversalStride, ...",1
9,A,Product description: A pair of shoes that can ...,"FlexiFit, OmniShoes, AdaptStep, FitAll, ShapeS...",0


### View Results with Feedback

Check your ratings - the `feedback` column now shows 1 (üëç) or 0 (üëé) for each response.

In [80]:
# Save the DataFrame as a CSV file
csv_file = "../data/responses.csv"
df.to_csv(csv_file, index=False)

---

## Section 5: Analyze Results

### üíæ Save Your Data

Persist the feedback data to CSV for future reference or sharing with team members.

In [41]:
print("A/B testing completed. Here are the results:")

if "variant" in df.columns:
    summary_df = df.groupby("variant").agg(
        count=("feedback", "count"),
        score=("feedback", "mean")
    ).reset_index()

    display(summary_df)
    print("Summary displayed!")  # DEBUG
else:
    print("No 'variant' column found. Summary cannot be generated.")

A/B testing completed. Here are the results:


Unnamed: 0,variant,count,score
0,A,6,0.5
1,B,6,0.833333


Summary displayed!


### üìà Statistical Summary

Compare the two variants using aggregated metrics:
- **count**: Number of responses evaluated per variant
- **score**: Average rating (0.0 to 1.0, higher is better)

**Interpreting Results:**
- Score difference of 0.2+ suggests meaningful quality gap
- Small differences (<0.1) might be noise
- Consider both average score AND consistency

---

## üéì Reflection and Next Steps

### Questions to Consider:
1. **Did few-shot prompting (Variant B) outperform zero-shot (Variant A)?**
2. **What quality patterns did you notice in the better-performing variant?**
3. **Were there any unexpected results or outliers?**
4. **How would you refine the winning prompt further?**

### üöÄ Extensions and Experiments:

**Try These Variations:**
- Test 3+ variants simultaneously
- Vary the number of examples in few-shot prompts
- Add temperature parameter testing (creativity vs consistency)
- Use different evaluation criteria (speed, cost, etc.)

**Advanced Techniques:**
- **Inter-rater reliability**: Have multiple people rate the same responses
- **Automated metrics**: Use semantic similarity or other LLM-based judges
- **Larger sample sizes**: 20-50 responses per variant for statistical significance
- **Sequential testing**: Continuously improve prompts based on feedback

### üí° Key Takeaways:
- A/B testing provides empirical evidence for prompt improvements
- Human evaluation captures nuances automated metrics miss
- Small prompt changes can have significant quality impact
- Always test with representative samples before deploying to production

### üìö Additional Resources:
- [A/B Testing Best Practices](https://en.wikipedia.org/wiki/A/B_testing)
- [Prompt Engineering Guide - Evaluation](https://www.promptingguide.ai/)