# Lesson 10.4: Prompt Optimization and A/B Testing

---

In previous lessons, we explored basic and advanced Prompt Engineering techniques. However, designing a perfect prompt from the start is rare. **Prompt optimization** is an iterative and continuous process. This lesson will delve into the prompt optimization process, tools and techniques for testing and comparing prompt variations, an introduction to A/B testing for prompts in production, and ethical and safety considerations in Prompt Engineering.

## 1. Iterative Prompt Optimization Process

Prompt optimization is a continuous cycle of experimentation, learning, and improvement.

1.  **Define Goals and Evaluation Criteria:**
    * What do you want the LLM to do? (e.g., summarize, classify, answer questions).
    * How will you know the LLM is performing well? (e.g., accuracy, relevance, fluency, lack of hallucinations).
2.  **Design Initial Prompt:**
    * Start with a simple, clear, and specific prompt.
    * Apply basic Prompt Engineering principles (context, formatting, etc.).
3.  **Create Test Dataset:**
    * Assemble a set of input examples representative of real-world use cases.
    * Include both normal cases and edge cases or difficult scenarios.
    * If possible, include "ground truth" answers for automated evaluation.
4.  **Run Experiments and Collect Results:**
    * Run your prompt on the test dataset.
    * Collect the LLM's responses.
    * Record metrics like latency and token cost.
5.  **Evaluate Results:**
    * **Manual:** Read and score responses based on predefined criteria.
    * **Automated:** Use automated metrics (ROUGE, BLEU, BERTScore) or LLM-as-a-Judge.
    * Identify common errors and cases where the prompt performs poorly.
6.  **Analyze and Refine Prompt:**
    * Based on evaluation results, identify the root causes of issues.
    * Adjust the prompt:
        * Add/remove context.
        * Change wording, phrasing.
        * Add few-shot examples.
        * Modify formatting instructions.
        * Add constraints or rules.
        * Adjust `temperature` or other LLM parameters.
7.  **Iterate:** Go back to step 4 and repeat the process until desired performance is achieved.

![An iterative optimization loop diagram](https://placehold.co/600x400/ccddff/ffffff?text=Iterative+Optimization+Loop)


---

## 2. Tools and Techniques for Testing and Comparing Prompt Variations

To support the iterative optimization process, we need effective tools and techniques.

* **LangChain Runnables and LCEL:**
    * LangChain Expression Language (LCEL) allows you to build runnable chains in a modular and flexible way.
    * You can easily swap out prompts, LLMs, or other components in your chain and test them.
* **LangSmith:**
    * As learned in Lesson 9.4, LangSmith is an indispensable tool.
    * **Tracing:** Logs every LLM call, prompt, and response, helping you visualize and debug each step.
    * **Experimentation:** Allows you to create "experiments" to compare the performance of different prompts or models on the same dataset. You can run multiple prompt variations in parallel and compare metrics (accuracy, latency, cost).
    * **Evaluation:** Integrates automated evaluators (LLM-as-a-Judge) and manual annotation capabilities to quantify output quality.
* **Evaluation Datasets:**
    * Creating and managing high-quality test datasets is crucial.
    * These datasets should include diverse input cases and, if possible, "ground truth" answers for automated evaluation.
* **Unit Testing and Regression Testing:**
    * Write automated tests for critical prompts to ensure that changes do not degrade performance on known cases.
    * Run these tests after every prompt or model change.

![Person using a dashboard to compare different prompt versions](https://placehold.co/600x400/ddeeffcc/ffffff?text=Prompt+Comparison+Dashboard)


---

## 3. Introduction to A/B Testing for Prompts in Production

Once you have optimized your prompts in a development/testing environment, the next step to validate their effectiveness is A/B testing in a production environment.

* **Concept:** Deploy two or more versions of the application (where the main difference is the prompt) to different groups of real users.
* **Purpose:** Measure the impact of prompt changes on business metrics and user experience under real-world conditions.
* **Steps:**
    1.  **Define Variations:** Version A (current prompt) and Version B (newly optimized prompt).
    2.  **Split User Groups:** Randomly assign users to groups A and B.
    3.  **Collect Data:** Monitor user interaction metrics (e.g., satisfaction rate, conversion rate, session duration, error rate) for both groups.
    4.  **Statistical Analysis:** Compare metrics between groups to determine if the new prompt significantly improved performance.
    5.  **Decision Making:** Based on the results, decide whether to roll out the new prompt to all users.
* **Benefits:** Provides empirical evidence of prompt effectiveness, minimizes risk when rolling out major changes.
* **Challenges:** Requires A/B testing infrastructure, needs sufficient traffic for statistically significant results, can be time-consuming.

![An A/B testing setup with two user groups](https://placehold.co/600x400/aaccaa/ffffff?text=A/B+Testing+Prompts)


---

## 4. Ethical and Safety Considerations in Prompt Engineering

Prompt Engineering is not just about performance but also about ensuring LLM applications are used responsibly.

* **Bias Mitigation:**
    * **Identification:** LLMs can reflect biases from training data. Prompts and responses need to be checked for signs of bias (e.g., gender, racial, religious discrimination).
    * **Techniques:**
        * **Neutral Prompting:** Use neutral language, avoid words that might suggest bias.
        * **Ethical Guidelines:** Add instructions to the prompt requiring the LLM to provide fair, objective responses.
        * **Robustness Testing:** Test responses with input variations (e.g., changing names, genders) to see if the LLM responds unfairly.
* **Harmful Content Prevention:**
    * **Identification:** LLMs can generate harmful content (hate speech, violence, sexual content, self-harm, misinformation).
    * **Techniques:**
        * **Moderation Systems:** Use content moderation models or APIs to filter inputs and outputs.
        * **Safety Instructions in Prompt:** Instruct the LLM not to generate harmful, unethical content.
        * **Scope Limitation:** Restrict the LLM to respond only within a safe topic range.
* **Privacy and Sensitive Data Protection:**
    * **Identification:** LLMs can inadvertently reveal personal or sensitive information if they were trained on such data or if the prompt contains such information.
    * **Techniques:**
        * **Anonymization/Redaction:** Remove or mask sensitive information from input data.
        * **Privacy Instructions:** Instruct the LLM not to store or disclose personal information.
        * **Secure Architecture:** Ensure data is processed and stored securely.
* **Transparency and Explainability:**
    * **Purpose:** Helps users understand why the LLM produced a particular response, especially in critical applications.
    * **Techniques:**
        * **Chain-of-Thought:** Encourage the LLM to articulate its reasoning steps.
        * **Source Citation:** Ask the LLM to cite the information sources it used (especially in RAG).

![Ethical guidelines for AI](https://placehold.co/600x400/ccffcc/ffffff?text=AI+Ethics+Guidelines)


---

## 5. Practical: Refining an Existing Prompt to Improve Output

We will take a simple prompt and try to refine it to improve its accuracy and output format for an information extraction task.

**Problem:** Extract product name, quantity, and price from an order description sentence.

**Preparation:**
* Ensure you have the `langchain-openai` library installed.
* Set the `OPENAI_API_KEY` environment variable.

In [None]:
# Install libraries if not already installed
# pip install langchain-openai openai

import os
import json
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Set environment variable for OpenAI API key
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0) # Use low temperature for consistent results

# Test data
order_descriptions = [
    "Tôi muốn mua 2 cái áo thun màu xanh với giá 150.000 VNĐ mỗi cái.", # I want to buy 2 blue t-shirts for 150,000 VND each.
    "Đơn hàng gồm 1 chiếc quần jean giá 300.000 VNĐ.", # The order includes 1 pair of jeans for 300,000 VND.
    "Vui lòng thêm 3 đôi tất đen, giá 50.000 VNĐ/đôi và 1 áo khoác da giá 1.200.000 VNĐ.", # Please add 3 pairs of black socks, priced at 50,000 VND/pair, and 1 leather jacket for 1,200,000 VND.
    "Mua 5 quyển sách 'Lập trình Python' với giá 250.000 VNĐ." # Buy 5 books 'Python Programming' for 250,000 VND. # Can be confusing about quantity
]

# --- 1. Initial Prompt (Baseline) ---
print("--- 1. Initial Prompt (Baseline) ---")
initial_prompt_template = ChatPromptTemplate.from_messages([
    ("system", "You are an information extraction assistant. Extract product name, quantity, and price from the order description."),
    ("user", "Order description: {description}")
])
initial_chain = initial_prompt_template | llm | StrOutputParser()

for i, desc in enumerate(order_descriptions):
    print(f"\n--- Order {i+1} ---")
    print(f"Description: {desc}")
    response = initial_chain.invoke({"description": desc})
    print(f"Initial response:\n{response}")

# --- 2. Refined Prompt (Add JSON format and examples) ---
print("\n--- 2. Refined Prompt (Add JSON format and examples) ---")
refined_prompt_template = ChatPromptTemplate.from_messages([
    ("system", """You are a professional information extraction assistant.
    From the order description, extract PRODUCT NAME, QUANTITY, and PRICE.
    If there are multiple products, list all of them.
    Respond in JSON format. Here is an example:

    Description: Tôi muốn mua 2 cái áo thun màu xanh với giá 150.000 VNĐ mỗi cái.
    JSON:
    ```json
    [
      {
        "ten_san_pham": "áo thun màu xanh",
        "so_luong": 2,
        "gia": 150000,
        "don_vi_gia": "VNĐ"
      }
    ]
    ```

    Description: Vui lòng thêm 3 đôi tất đen, giá 50.000 VNĐ/đôi và 1 áo khoác da giá 1.200.000 VNĐ.
    JSON:
    ```json
    [
      {
        "ten_san_pham": "tất đen",
        "so_luong": 3,
        "gia": 50000,
        "don_vi_gia": "VNĐ"
      },
      {
        "ten_san_pham": "áo khoác da",
        "so_luong": 1,
        "gia": 1200000,
        "don_vi_gia": "VNĐ"
      }
    ]
    ```
    """),
    ("user", "Order description: {description}\nJSON:") # Ask LLM to start with JSON
])
refined_chain = refined_prompt_template | llm | StrOutputParser()

for i, desc in enumerate(order_descriptions):
    print(f"\n--- Order {i+1} ---")
    print(f"Description: {desc}")
    response_raw = refined_chain.invoke({"description": desc})
    print(f"Refined response (raw):\n{response_raw}")
    
    # Try to parse JSON to check format
    try:
        # Extract JSON part from response if it's wrapped by ```json
        json_str = response_raw.split("```json")[1].split("```")[0].strip() if "```json" in response_raw else response_raw.strip()
        parsed_json = json.loads(json_str)
        print(f"Refined response (parsed JSON):\n{json.dumps(parsed_json, indent=2, ensure_ascii=False)}")
    except (json.JSONDecodeError, IndexError) as e:
        print(f"JSON parsing error: {e}")
        print("Response is not in valid JSON format.")

print("\n--- End of Practical ---")