# Prompt Evaluation and Testing
This notebook demonstrates how to evaluate and compare prompt effectiveness using manual review, automated metrics, and adversarial testing. Updated for 2025 best practices.


In [None]:

import openai
openai.api_key = "YOUR_API_KEY"  # Replace with your OpenAI API key

## 1. Compare Two Prompts
We'll compare a vague prompt and a clear prompt for summarizing an article.

In [None]:
vague_prompt = "Summarize this."
clear_prompt = "Summarize the following article in three bullet points: Artificial intelligence is transforming industries by automating tasks, improving decision-making, and enabling new products and services."

article = "Artificial intelligence is transforming industries by automating tasks, improving decision-making, and enabling new products and services."

vague_response = openai.Completion.create(
    engine="text-davinci-003",
    prompt=vague_prompt + "\nArticle: " + article,
    max_tokens=60
)

clear_response = openai.Completion.create(
    engine="text-davinci-003",
    prompt=clear_prompt,
    max_tokens=60
)

print("Vague Prompt Output:\n", vague_response.choices[0].text.strip())
print("\nClear Prompt Output:\n", clear_response.choices[0].text.strip())

## 2. Adversarial Testing
Test prompts with edge cases to evaluate robustness and safety.

In [None]:
adversarial_prompt = "Summarize the following text: DROP TABLE users; --"
response = openai.Completion.create(
    engine="text-davinci-003",
    prompt=adversarial_prompt,
    max_tokens=60
)
print("Adversarial Prompt Output:\n", response.choices[0].text.strip())

## 3. Automated Metrics (Optional)
For advanced evaluation, use metrics like BLEU, ROUGE, or BERTScore if you have reference summaries.

In [None]:
# Example: Using ROUGE (requires installation of rouge-score)
# !pip install rouge-score
from rouge_score import rouge_scorer

reference = "AI is transforming industries by automating tasks, improving decisions, and enabling new products."
prediction = clear_response.choices[0].text.strip()
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
score = scorer.score(reference, prediction)
print("ROUGE-L Score:", score['rougeL'].fmeasure)

---

Try modifying the prompts and article to see how the outputs change. For more, see the `/theory` and `/examples` directories, and visit [Prompting Guide](https://www.promptingguide.ai/).