<a href="https://colab.research.google.com/github/appliedcode/mthree-c422/blob/main/Exercises/day-12/Prompt_Production/Prompt_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## ## Lab Exercises: Integrating Prompts \& Measuring Prompt Performance


> Add blockquote


***

### Setup:

- Use OpenAI API or any language model API accessible in your Colab.
- If API keys are needed, ensure they are safely added as environment variables or input by the user.

In [41]:
from google.colab import userdata
import os

# Set your OpenAI API key securely in Colab Secrets (once)
# userdata.set("OPENAI_API_KEY", "your-api-key-here")

# Retrieve key in your notebook
openai_api_key = userdata.get("OPENAI_API_KEY")
if openai_api_key:
    os.environ["OPENAI_API_KEY"] = openai_api_key
    print("✅ OpenAI API key loaded safely")
else:
    print("❌ OpenAI API key not found. Please set it using Colab Secrets.")

✅ OpenAI API key loaded safely


In [42]:

!pip install --quiet openai -q
# Create client
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

In [43]:
# Helper Function to Send Prompts
def generate_response(prompt, model="gpt-4o-mini", temperature=0.7):
    """
    Send a prompt to the OpenAI model and return the response text.
    """
    try:
        completion = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature
        )
        return completion.choices[0].message.content.strip()
    except Exception as e:
        return f"Error: {e}"


***

# Exercise 1: Integrating Prompts in Applications and Pipelines

**Objective:** Learn how to integrate prompts into application workflows with versioning, automated testing, and CI/CD principles.

**Tasks:**

1. **Simulate prompt versioning:**
    - Create two versions of a prompt for a customer support chatbot (e.g., one generic, one with explicit instructions).
    - Store them as Python strings and switch between them in your code.

In [None]:
prompt_v1 = "Answer customer queries about returns."
prompt_v2 = "You are a helpful assistant answering customer return questions clearly and politely."

# Use either version
prompt = prompt_v2

response = generate_response(prompt + "\nCustomer: How do I return a product?")
print(response)

2. **Build a simple automated prompt test:**
    - Write a small function that runs a set of input prompts through your prompt versions.
    - Checks if responses include key expected phrases (e.g., "return policy", "contact support").
    - Output pass/fail for each test.

In [None]:
test_inputs = [
    "How do I return my purchase?",
    "What is the refund policy?"
]
expected_keywords = ["return policy", "contact support"]

def test_prompt(prompt):
    print(f"Testing prompt:\n{prompt}\n")
    for input_text in test_inputs:
        response = generate_response(prompt + "\nCustomer: " + input_text)
        print(f"Input: {input_text}\nResponse: {response}\n")
        passed = all(keyword in response.lower() for keyword in expected_keywords)
        print("Pass" if passed else "Fail", "\n")

# Test both versions
test_prompt(prompt_v1)
test_prompt(prompt_v2)

3. **Discuss integration with CI/CD:**
    - Outline how this kind of automated prompt testing can be triggered on every prompt update in a CI/CD pipeline to ensure prompt quality before deployment.
    - Optionally, simulate prompt update by changing the prompt string and re-running tests.

***

# Exercise 2: Prompt Performance Measurement and Evaluation

**Objective:** Measure prompt quality using key metrics and understand how to evaluate and improve prompts systematically.

**Tasks:**

1. **Define and implement simple metrics:**
    - For each prompt response, measure:
        - **Relevance**: Does the response include certain keywords or meet a relevance criterion?
        - **Length**: Count the number of words or tokens.
        - **Consistency**: Run the same prompt multiple times and compare responses for similarity (string match or cosine similarity using embeddings if available).

Example:

In [None]:
import difflib

prompt = "Explain the benefits of renewable energy."

# Run same prompt 3 times
responses = [generate_response(prompt) for _ in range(3)]

print("Responses:")
for i, resp in enumerate(responses):
    print(f"Run {i+1}: {resp}\n")

# Simple consistency check: ratio of similarity between runs 1 and 2
similarity = difflib.SequenceMatcher(None, responses[^0], responses[^1]).ratio()
print(f"Similarity between first two runs: {similarity:.2f}")

2. **Automated evaluation function:**
    - Build a function that takes a prompt and expected keywords (or a reference text) and computes pass/fail or similarity score.

In [None]:
def evaluate_prompt(prompt, expected_keywords):
    response = generate_response(prompt)
    print(f"Response:\n{response}\n")
    relevance = all(keyword.lower() in response.lower() for keyword in expected_keywords)
    length = len(response.split())
    print(f"Relevance: {'Passed' if relevance else 'Failed'}")
    print(f"Response length (words): {length}\n")

evaluate_prompt("Explain the importance of water conservation.", ["water", "conservation", "importance"])

3. **Incorporate user feedback (simulated):**
    - Ask students to simulate or collect simple user feedback scores (e.g., 1 to 5 rating).
    - Show how this qualitative data complements automated metrics.
4. **Discuss trade-offs:**
    - Latency vs. response quality
    - Length vs. clarity
    - Consistency vs. creativity

***

## Summary for Students:

- Practice managing prompt versions and testing integrations as mini automation.
- Use keyword checks and repeated runs for simple prompt effectiveness measurement.
- Understand that prompt evaluation involves both quantitative metrics and qualitative feedback.
- Consider embedding prompt evaluation into automated pipelines for continuous prompt improvement.

***