# Notebook 2 – Automated Evaluation & A/B Testing
Goals:
1. Run controlled A/B prompt experiments
2. Log results with **PromptLayer** (or fallback logging)
3. Integrate with **OpenAI Evals** style harness


In [None]:
!pip -q install promptlayer openai

## 1. Environment Setup
Make sure you have set the environment variable `OPENAI_API_KEY` before running.

In [None]:
import os, openai, json, uuid, time
from promptlayer import promptlayer
openai.api_key = os.getenv('OPENAI_API_KEY', 'sk-...')


### Helper: `run_prompt(prompt)`

In [None]:
def run_prompt(prompt, model='gpt-3.5-turbo', temperature=0):
    response = openai.ChatCompletion.create(
        model=model,
        messages=[{'role':'user','content': prompt}],
        temperature=temperature)
    return response['choices'][0]['message']['content']

## 2. A/B Experiment Function

In [None]:
def ab_test(prompt_a, prompt_b, n=5):
    results = []
    for i in range(n):
        out_a = run_prompt(prompt_a)
        out_b = run_prompt(prompt_b)
        results.append({'trial': i, 'A': out_a, 'B': out_b})
    return results


Run your own prompts below:

In [None]:
prompt1 = 'Summarise the following text in one sentence: ${input}'
prompt2 = 'Provide a concise one‑sentence summary: ${input}'
test_results = ab_test(prompt1, prompt2)
test_results

## 3. Quick Metric – ROUGE on A/B Outputs
Replace `$reference` below with ground‑truth summary.

In [None]:
from rouge_score import rouge_scorer
reference = 'Your reference summary here.'
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
for row in test_results:
    score_a = scorer.score(reference, row['A'])['rougeL'].fmeasure
    score_b = scorer.score(reference, row['B'])['rougeL'].fmeasure
    row['rougeL_A'] = score_a
    row['rougeL_B'] = score_b
test_results