# Module 8: Prompt Engineering

Prompt engineering is the art and science of communicating effectively with large language models (LLMs). It is the **interface between humans and AI** -- the way we translate our intent into instructions that a model can act on.

Why does prompt engineering matter?

- **Same model, vastly different results**: A well-crafted prompt can turn a mediocre output into an excellent one, without changing the model or fine-tuning.
- **Cost efficiency**: Better prompts mean fewer retries, less post-processing, and lower API costs.
- **Reliability**: Systematic prompt engineering produces consistent, predictable outputs -- critical for production systems.
- **Unlocking capabilities**: Models have latent abilities (reasoning, structured output, role-play) that only emerge with the right prompting techniques.

### What you'll learn

1. **Prompt anatomy** -- system, user, and assistant messages
2. **Zero-shot prompting** -- direct instructions without examples
3. **Few-shot prompting** -- teaching by example
4. **Chain-of-thought (CoT) prompting** -- eliciting step-by-step reasoning
5. **Role prompting** -- setting persona and expertise
6. **Structured output** -- getting reliable JSON, tables, and formatted data
7. **Common failures and mitigations** -- hallucinations, instruction-following issues
8. **Iterative refinement** -- the workflow for evolving prompts
9. **Prompt templates** -- reusable, parameterized prompts

Let's get started!

---
## 2. Setup

In [None]:
!pip install -q openai python-dotenv

In [None]:
from dotenv import load_dotenv
import os
import json

load_dotenv("/home/amir/source/.env")

In [None]:
from openai import OpenAI

client = OpenAI()


def chat(messages, model="gpt-4o-mini", temperature=0.7):
    """Helper function to call the OpenAI Chat Completions API.
    
    Args:
        messages: List of message dicts with 'role' and 'content' keys.
        model: Model identifier (default: gpt-4o-mini).
        temperature: Sampling temperature (0 = deterministic, higher = more creative).
    
    Returns:
        The assistant's response as a string.
    """
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature
    )
    return response.choices[0].message.content


# Quick test
print(chat([{"role": "user", "content": "Say hello in one sentence."}]))

---
## 3. Prompt Anatomy

The OpenAI Chat Completions API (and similar APIs) use a **messages** format with three role types:

| Role | Purpose | When to use |
|------|---------|-------------|
| **system** | Sets behavior, role, constraints, and tone | Once, at the start of the conversation |
| **user** | The actual request or question | Every turn |
| **assistant** | The model's response (or a synthetic example) | Few-shot examples, multi-turn context |

Think of it like directing an actor:
- **System message** = the character description and stage directions
- **User message** = the scene prompt
- **Assistant message** = the actor's previous lines (for continuity)

In [None]:
# Example: All three message types in action

messages = [
    {
        "role": "system",
        "content": "You are a concise technical writer. Always respond in exactly 2 sentences."
    },
    {
        "role": "user",
        "content": "What is a neural network?"
    }
]

response = chat(messages)
print("Response:", response)
print("---")

# Now continue the conversation with the assistant's prior response
messages.append({"role": "assistant", "content": response})
messages.append({"role": "user", "content": "How is it trained?"})

response2 = chat(messages)
print("Follow-up:", response2)

In [None]:
# The system message dramatically changes behavior
# Same question, different system messages

question = "Explain what a database index is."

# Version 1: Expert audience
expert_response = chat([
    {"role": "system", "content": "You are a database expert speaking to senior engineers. Be technical and precise."},
    {"role": "user", "content": question}
])

# Version 2: Beginner audience
beginner_response = chat([
    {"role": "system", "content": "You are a patient teacher explaining to a 10-year-old. Use simple analogies."},
    {"role": "user", "content": question}
])

print("=== Expert version ===")
print(expert_response)
print()
print("=== Beginner version ===")
print(beginner_response)

---
## 4. Zero-Shot Prompting

**Zero-shot prompting** means giving the model a direct instruction **without any examples**. The model relies entirely on its pre-trained knowledge to understand and complete the task.

This is the simplest form of prompting and works surprisingly well for many tasks.

**Tips for effective zero-shot prompts:**
- Be specific about the desired output format
- Use clear, unambiguous language
- Specify constraints (length, style, format)
- Tell the model what NOT to do if needed

In [None]:
# Zero-shot: Sentiment Classification

review = "The battery life is incredible and the camera quality exceeded my expectations, but the phone heats up during gaming."

response = chat([
    {"role": "system", "content": "You are a sentiment analysis classifier."},
    {"role": "user", "content": f"Classify the sentiment of the following review as POSITIVE, NEGATIVE, or MIXED.\n\nReview: {review}\n\nSentiment:"}
], temperature=0)

print(f"Review: {review}")
print(f"Sentiment: {response}")

In [None]:
# Zero-shot: Text Summarization

article = """
Researchers at MIT have developed a new AI system that can predict protein structures 
with unprecedented accuracy. The system, called ProteinFlow, uses a novel graph neural 
network architecture that models amino acid interactions at multiple scales simultaneously. 
In benchmarks against existing methods including AlphaFold, ProteinFlow achieved a 15% 
improvement in prediction accuracy for proteins with more than 500 residues. The team 
believes this could accelerate drug discovery by reducing the time needed to understand 
protein-drug interactions from months to hours. The research was published in Nature 
Methods and the code has been released as open source.
"""

summary = chat([
    {"role": "user", "content": f"Summarize the following article in exactly 2 sentences:\n\n{article}"}
], temperature=0)

print("Summary:")
print(summary)

In [None]:
# Zero-shot: Translation

text = "The quick brown fox jumps over the lazy dog. This sentence contains every letter of the alphabet."

for language in ["French", "Japanese", "Spanish"]:
    translation = chat([
        {"role": "user", "content": f"Translate the following English text to {language}. Output only the translation, nothing else.\n\n{text}"}
    ], temperature=0)
    print(f"{language}: {translation}")
    print()

---
## 5. Few-Shot Prompting

**Few-shot prompting** provides the model with **2-3 examples** of the desired input-output behavior before presenting the actual task. This is one of the most powerful techniques because:

- It **demonstrates** the expected format and style
- It reduces **ambiguity** in the instruction
- It acts as **implicit fine-tuning** at inference time

The examples can be provided either as user/assistant message pairs or within a single prompt.

In [None]:
# Few-shot: Sentiment Classification (same task as zero-shot above)

messages = [
    {"role": "system", "content": "You are a sentiment classifier. Respond with exactly one word: POSITIVE, NEGATIVE, or MIXED."},
    # Example 1
    {"role": "user", "content": "Review: This laptop is amazing! Fast, lightweight, and the screen is gorgeous."},
    {"role": "assistant", "content": "POSITIVE"},
    # Example 2
    {"role": "user", "content": "Review: Terrible product. Broke after one week and customer service was unhelpful."},
    {"role": "assistant", "content": "NEGATIVE"},
    # Example 3
    {"role": "user", "content": "Review: The food was great but the service was slow and the restaurant was too noisy."},
    {"role": "assistant", "content": "MIXED"},
    # Actual task
    {"role": "user", "content": "Review: The battery life is incredible and the camera quality exceeded my expectations, but the phone heats up during gaming."}
]

response = chat(messages, temperature=0)
print(f"Few-shot sentiment: {response}")

In [None]:
# Few-shot: Custom entity extraction
# This task is hard for zero-shot because the output format is very specific

messages = [
    {"role": "system", "content": "Extract product names and their associated sentiment from reviews. Format: PRODUCT: sentiment"},
    # Example 1
    {"role": "user", "content": "I love my new MacBook Pro but the Magic Mouse is uncomfortable."},
    {"role": "assistant", "content": "MacBook Pro: positive\nMagic Mouse: negative"},
    # Example 2
    {"role": "user", "content": "The AirPods Max sound quality is decent for the price. My old Sony WH-1000XM4 were better though."},
    {"role": "assistant", "content": "AirPods Max: neutral\nSony WH-1000XM4: positive"},
    # Actual task
    {"role": "user", "content": "Switched from Slack to Microsoft Teams and I'm really struggling. At least the Outlook integration works well."}
]

response = chat(messages, temperature=0)
print(response)

In [None]:
# Comparing zero-shot vs few-shot on the same task:
# Classifying whether a headline is about technology, sports, or politics

headlines = [
    "New Quantum Chip Breaks Speed Record for Complex Calculations",
    "City Council Approves $2B Infrastructure Bill After Marathon Debate",
    "Underdog Team Clinches Championship in Overtime Thriller",
]

print("=== Zero-shot ===")
for h in headlines:
    resp = chat([
        {"role": "user", "content": f"Classify this headline into one category: TECHNOLOGY, SPORTS, or POLITICS.\n\nHeadline: {h}\n\nCategory:"}
    ], temperature=0)
    print(f"  {h[:50]}... -> {resp}")

print()
print("=== Few-shot ===")
for h in headlines:
    resp = chat([
        {"role": "system", "content": "Classify headlines. Respond with one word only."},
        {"role": "user", "content": "Headline: Apple Unveils New M4 Processor at WWDC Keynote"},
        {"role": "assistant", "content": "TECHNOLOGY"},
        {"role": "user", "content": "Headline: Senate Passes Bipartisan Climate Legislation"},
        {"role": "assistant", "content": "POLITICS"},
        {"role": "user", "content": "Headline: World Cup Final Draws Record 1.5 Billion Viewers"},
        {"role": "assistant", "content": "SPORTS"},
        {"role": "user", "content": f"Headline: {h}"}
    ], temperature=0)
    print(f"  {h[:50]}... -> {resp}")

### Exercise 1: Zero-Shot vs Few-Shot Comparison on Classification

Classify 5 customer reviews as **POSITIVE**, **NEGATIVE**, or **NEUTRAL** using both zero-shot and few-shot prompting. Compare the results.

The reviews below have ground-truth labels. Your goal is to see which approach produces more accurate classifications.

In [None]:
# Exercise 1: TODO

# Test data with ground truth labels
test_reviews = [
    {"text": "Absolutely love this product! Best purchase I've made all year.", "label": "POSITIVE"},
    {"text": "The item arrived damaged and the return process was a nightmare.", "label": "NEGATIVE"},
    {"text": "It works as described. Nothing special but gets the job done.", "label": "NEUTRAL"},
    {"text": "Waste of money. Stopped working after two days.", "label": "NEGATIVE"},
    {"text": "Pretty good overall. The quality is decent for the price point.", "label": "POSITIVE"},
]

# TODO: Implement zero-shot classification
# For each review, call chat() with a zero-shot prompt and collect the predicted label.
zero_shot_predictions = None  # Replace with a list of predicted labels

# TODO: Implement few-shot classification
# For each review, call chat() with a few-shot prompt (include 2-3 examples as
# user/assistant pairs) and collect the predicted label.
few_shot_predictions = None  # Replace with a list of predicted labels

# TODO: Compare accuracy
# Calculate accuracy for each approach by comparing predictions to ground truth labels.
zero_shot_accuracy = None  # Replace with calculated accuracy
few_shot_accuracy = None  # Replace with calculated accuracy

# TODO: Print a comparison table
# Print each review, its ground truth, zero-shot prediction, and few-shot prediction.

### Solution

In [None]:
# Exercise 1: Solution

test_reviews = [
    {"text": "Absolutely love this product! Best purchase I've made all year.", "label": "POSITIVE"},
    {"text": "The item arrived damaged and the return process was a nightmare.", "label": "NEGATIVE"},
    {"text": "It works as described. Nothing special but gets the job done.", "label": "NEUTRAL"},
    {"text": "Waste of money. Stopped working after two days.", "label": "NEGATIVE"},
    {"text": "Pretty good overall. The quality is decent for the price point.", "label": "POSITIVE"},
]

# --- Zero-shot ---
zero_shot_predictions = []
for review in test_reviews:
    resp = chat([
        {"role": "system", "content": "You are a sentiment classifier. Respond with exactly one word: POSITIVE, NEGATIVE, or NEUTRAL."},
        {"role": "user", "content": f"Classify the sentiment of this review:\n\n{review['text']}"}
    ], temperature=0)
    # Extract just the label (strip whitespace and take first word in case of extra text)
    pred = resp.strip().split()[0].upper()
    zero_shot_predictions.append(pred)

# --- Few-shot ---
few_shot_predictions = []
for review in test_reviews:
    resp = chat([
        {"role": "system", "content": "You are a sentiment classifier. Respond with exactly one word: POSITIVE, NEGATIVE, or NEUTRAL."},
        # Example 1
        {"role": "user", "content": "Review: This is hands down the best phone I have ever owned."},
        {"role": "assistant", "content": "POSITIVE"},
        # Example 2
        {"role": "user", "content": "Review: Terrible experience. The product broke immediately and support ignored me."},
        {"role": "assistant", "content": "NEGATIVE"},
        # Example 3
        {"role": "user", "content": "Review: It's okay. Does what it says, nothing more nothing less."},
        {"role": "assistant", "content": "NEUTRAL"},
        # Actual review
        {"role": "user", "content": f"Review: {review['text']}"}
    ], temperature=0)
    pred = resp.strip().split()[0].upper()
    few_shot_predictions.append(pred)

# --- Compare accuracy ---
ground_truth = [r["label"] for r in test_reviews]
zero_shot_correct = sum(1 for gt, pred in zip(ground_truth, zero_shot_predictions) if gt == pred)
few_shot_correct = sum(1 for gt, pred in zip(ground_truth, few_shot_predictions) if gt == pred)

zero_shot_accuracy = zero_shot_correct / len(test_reviews)
few_shot_accuracy = few_shot_correct / len(test_reviews)

# --- Print comparison table ---
print(f"{'Review (truncated)':<55} {'Truth':<10} {'Zero-shot':<12} {'Few-shot':<10}")
print("-" * 87)
for review, zs, fs in zip(test_reviews, zero_shot_predictions, few_shot_predictions):
    text = review['text'][:52] + "..." if len(review['text']) > 52 else review['text']
    match_zs = "ok" if zs == review['label'] else "WRONG"
    match_fs = "ok" if fs == review['label'] else "WRONG"
    print(f"{text:<55} {review['label']:<10} {zs:<6}{match_zs:<6} {fs:<6}{match_fs}")

print()
print(f"Zero-shot accuracy: {zero_shot_accuracy:.0%} ({zero_shot_correct}/{len(test_reviews)})")
print(f"Few-shot accuracy:  {few_shot_accuracy:.0%} ({few_shot_correct}/{len(test_reviews)})")

---
## 6. Chain-of-Thought (CoT) Prompting

**Chain-of-thought prompting** encourages the model to **show its reasoning step by step** before arriving at a final answer. This technique was introduced by Wei et al. (2022) and dramatically improves performance on tasks requiring:

- Mathematical reasoning
- Multi-step logic
- Word problems
- Common-sense reasoning

There are two variants:
1. **Zero-shot CoT**: Simply add "Let's think step by step" to the prompt
2. **Few-shot CoT**: Provide examples that include the reasoning steps

In [None]:
# Demo: Standard prompting vs CoT on a math word problem

problem = """A store sells apples for $2 each and oranges for $3 each. 
If Sarah buys 4 apples and 3 oranges, and she pays with a $20 bill, 
how much change does she receive?"""

# Standard prompting
standard_response = chat([
    {"role": "user", "content": f"{problem}\n\nAnswer:"}
], temperature=0)

print("=== Standard Prompting ===")
print(standard_response)
print()

# Zero-shot CoT: Just add "Let's think step by step"
cot_response = chat([
    {"role": "user", "content": f"{problem}\n\nLet's think step by step."}
], temperature=0)

print("=== Zero-shot CoT ===")
print(cot_response)

In [None]:
# A harder problem where CoT really shines

hard_problem = """A farmer has a rectangular field that is 120 meters long and 80 meters wide.
He wants to build a fence around the entire field, plus a fence down the middle 
dividing it into two equal halves (parallel to the shorter side).
If fencing costs $15 per meter, how much will the total fencing cost?"""

# Standard
print("=== Standard ===")
resp = chat([{"role": "user", "content": f"{hard_problem}\n\nProvide just the final answer."}], temperature=0)
print(resp)
print()

# CoT
print("=== Chain-of-Thought ===")
resp = chat([{"role": "user", "content": f"{hard_problem}\n\nLet's think step by step."}], temperature=0)
print(resp)
print()

# The correct answer:
# Perimeter = 2*(120+80) = 400m
# Middle fence (parallel to shorter side = 80m) = 80m
# Total fencing = 400 + 80 = 480m
# Cost = 480 * $15 = $7,200
print("Correct answer: $7,200")

In [None]:
# Few-shot CoT: Providing reasoning examples

messages = [
    {"role": "system", "content": "Solve math word problems step by step. Show your reasoning, then give the final answer on the last line as 'ANSWER: <number>'."},
    # Example with reasoning
    {"role": "user", "content": "If a train travels at 60 km/h for 2.5 hours, how far does it go?"},
    {"role": "assistant", "content": "Step 1: I need to find distance using the formula: distance = speed x time.\nStep 2: speed = 60 km/h, time = 2.5 hours\nStep 3: distance = 60 x 2.5 = 150 km\n\nANSWER: 150 km"},
    # Another example
    {"role": "user", "content": "A shirt costs $40. It's on sale for 25% off. What's the sale price?"},
    {"role": "assistant", "content": "Step 1: Calculate the discount amount: 25% of $40 = 0.25 x 40 = $10\nStep 2: Subtract the discount from the original price: $40 - $10 = $30\n\nANSWER: $30"},
    # Actual problem
    {"role": "user", "content": "A water tank holds 500 liters. It is currently 60% full. If water is added at a rate of 20 liters per minute, how many minutes until the tank is completely full?"}
]

response = chat(messages, temperature=0)
print("=== Few-shot CoT ===")
print(response)
print()
print("Correct answer: 10 minutes (need 200 liters at 20 L/min)")

### Exercise 2: Chain-of-Thought for Math Word Problems

Test 5 math problems with **standard prompting** vs **chain-of-thought prompting**. Compare the accuracy of each approach.

Use `temperature=0` for reproducible results.

In [None]:
# Exercise 2: TODO

math_problems = [
    {
        "question": "A bookstore sells 3 books for $12 each and 2 books for $8 each. What is the total cost?",
        "answer": 52
    },
    {
        "question": "If you have 156 marbles and give away 1/3 of them, then receive 20 more, how many do you have?",
        "answer": 124
    },
    {
        "question": "A car travels 180 miles in 3 hours. If it then speeds up by 20 mph, how far will it travel in the next 2 hours?",
        "answer": 160
    },
    {
        "question": "A rectangular garden is 15m long and 8m wide. If you want to put a 1m wide path around the entire garden, what is the area of just the path?",
        "answer": 52
    },
    {
        "question": "Three friends split a dinner bill. The total was $87. They each add a 20% tip on the original total. How much does each person pay in total (bill + tip)?",
        "answer": 34.8
    },
]

# TODO: For each problem, get a response using standard prompting.
# Instruct the model to reply with just the numeric answer.
standard_results = None  # Replace with list of responses

# TODO: For each problem, get a response using CoT prompting.
# Add "Let's think step by step" and ask for the final answer on the last line.
cot_results = None  # Replace with list of responses

# TODO: Extract numeric answers from both and compare to ground truth.
# Print a comparison showing which approach got each problem right.

### Solution

In [None]:
# Exercise 2: Solution
import re

math_problems = [
    {
        "question": "A bookstore sells 3 books for $12 each and 2 books for $8 each. What is the total cost?",
        "answer": 52
    },
    {
        "question": "If you have 156 marbles and give away 1/3 of them, then receive 20 more, how many do you have?",
        "answer": 124
    },
    {
        "question": "A car travels 180 miles in 3 hours. If it then speeds up by 20 mph, how far will it travel in the next 2 hours?",
        "answer": 160
    },
    {
        "question": "A rectangular garden is 15m long and 8m wide. If you want to put a 1m wide path around the entire garden, what is the area of just the path?",
        "answer": 52
    },
    {
        "question": "Three friends split a dinner bill. The total was $87. They each add a 20% tip on the original total. How much does each person pay in total (bill + tip)?",
        "answer": 34.8
    },
]


def extract_number(text):
    """Extract the last number from a string (likely the final answer)."""
    numbers = re.findall(r'[\d,]+\.?\d*', text.replace(',', ''))
    if numbers:
        return float(numbers[-1])
    return None


# Standard prompting
standard_results = []
for p in math_problems:
    resp = chat([
        {"role": "user", "content": f"{p['question']}\n\nRespond with only the numeric answer."}
    ], temperature=0)
    standard_results.append(resp)

# CoT prompting
cot_results = []
for p in math_problems:
    resp = chat([
        {"role": "user", "content": f"{p['question']}\n\nLet's think step by step. After your reasoning, provide the final answer on the last line as ANSWER: <number>."}
    ], temperature=0)
    cot_results.append(resp)

# Compare
print(f"{'Problem':<6} {'Correct':<10} {'Standard':<12} {'CoT':<12} {'Std OK?':<10} {'CoT OK?'}")
print("-" * 62)

std_correct = 0
cot_correct = 0

for i, p in enumerate(math_problems):
    std_num = extract_number(standard_results[i])
    cot_num = extract_number(cot_results[i])
    
    std_ok = abs(std_num - p["answer"]) < 0.1 if std_num else False
    cot_ok = abs(cot_num - p["answer"]) < 0.1 if cot_num else False
    
    if std_ok:
        std_correct += 1
    if cot_ok:
        cot_correct += 1
    
    print(f"{i+1:<6} {p['answer']:<10} {str(std_num):<12} {str(cot_num):<12} {'Yes' if std_ok else 'No':<10} {'Yes' if cot_ok else 'No'}")

print()
print(f"Standard accuracy: {std_correct}/{len(math_problems)} ({std_correct/len(math_problems):.0%})")
print(f"CoT accuracy:      {cot_correct}/{len(math_problems)} ({cot_correct/len(math_problems):.0%})")

# Show full CoT for one problem
print("\n=== Example CoT reasoning (Problem 4) ===")
print(cot_results[3])

---
## 7. Role Prompting

**Role prompting** sets a specific **persona** in the system message. This is powerful because it implicitly brings in:

- Domain-specific vocabulary and knowledge
- Appropriate level of detail
- The right tone and communication style
- Relevant frameworks and mental models

Think of it as casting the model in a role before it starts performing.

In [None]:
# Same question, different roles

question = "How should I handle errors in my Python code?"

roles = {
    "Senior Python Developer": "You are a senior Python developer with 15 years of experience. You write clean, production-quality code and follow best practices. Be specific and include code examples.",
    "Patient Teacher": "You are a patient, encouraging programming teacher for beginners. Use simple language, relatable analogies, and avoid jargon. Keep it short.",
    "Code Reviewer": "You are a strict code reviewer at a top tech company. Focus on what can go wrong, edge cases, and potential security issues. Be direct."
}

for role_name, system_msg in roles.items():
    print(f"=== {role_name} ===")
    resp = chat([
        {"role": "system", "content": system_msg},
        {"role": "user", "content": question}
    ])
    # Print just the first 400 chars to keep output manageable
    print(resp[:400] + "..." if len(resp) > 400 else resp)
    print()

In [None]:
# Role prompting for creative tasks

topic = "the importance of testing in software development"

# As a poet
poem = chat([
    {"role": "system", "content": "You are a witty poet who writes short, clever poems about technical topics. Write a 4-line poem."},
    {"role": "user", "content": f"Write a poem about: {topic}"}
], temperature=0.9)

# As a stand-up comedian
joke = chat([
    {"role": "system", "content": "You are a stand-up comedian who specializes in tech humor. Tell a short joke (2-3 sentences max)."},
    {"role": "user", "content": f"Tell a joke about: {topic}"}
], temperature=0.9)

print("=== Poem ===")
print(poem)
print()
print("=== Joke ===")
print(joke)

---
## 8. Structured Output

One of the most practical prompt engineering skills is getting the model to produce **reliably structured output** -- especially JSON. This is critical for:

- Building pipelines where LLM output feeds into downstream code
- API responses
- Data extraction and transformation
- Automated workflows

**Key techniques:**
1. Specify the exact JSON schema in the system message
2. Show an example of the desired output
3. Tell the model to output **only** JSON (no extra text)
4. Always validate with `json.loads()`

In [None]:
# Getting JSON output

text = "John Smith ordered 3 laptops at $999 each on January 15, 2025. The shipping address is 123 Main St, Springfield, IL 62701."

response = chat([
    {"role": "system", "content": """Extract structured data from text. Return ONLY valid JSON, no other text.
Use this exact schema:
{
  "customer_name": "string",
  "items": [{"product": "string", "quantity": number, "unit_price": number}],
  "total": number,
  "date": "YYYY-MM-DD",
  "shipping_address": "string"
}"""}, 
    {"role": "user", "content": text}
], temperature=0)

print("Raw response:")
print(response)
print()

# Parse and validate
try:
    # Handle case where model wraps JSON in markdown code blocks
    cleaned = response.strip()
    if cleaned.startswith("```"):
        cleaned = cleaned.split("\n", 1)[1].rsplit("```", 1)[0]
    
    data = json.loads(cleaned)
    print("Parsed successfully!")
    print(json.dumps(data, indent=2))
except json.JSONDecodeError as e:
    print(f"JSON parse error: {e}")

In [None]:
# Getting tabular output

response = chat([
    {"role": "system", "content": "You format data as clean markdown tables. Only output the table, nothing else."},
    {"role": "user", "content": "Compare Python, JavaScript, and Rust across these dimensions: typing (static/dynamic), speed, learning curve, and primary use case. Keep entries brief (1-3 words each)."}
], temperature=0)

print(response)

In [None]:
# Getting list output with specific formatting

response = chat([
    {"role": "system", "content": """You generate structured lists. For each item, use this exact format:
[NUMBER]. TERM -- DEFINITION (one sentence)"""},
    {"role": "user", "content": "List the 5 most important machine learning concepts for beginners."}
], temperature=0)

print(response)

### Exercise 3: Design Prompts for Reliable JSON Output

Create a prompt that extracts structured data from invoice text and returns valid JSON. The JSON should contain: `name`, `date`, `items` (list), and `total_amount`.

Test it on multiple invoice texts and validate that the output is always parseable JSON.

In [None]:
# Exercise 3: TODO

invoices = [
    "Invoice #1042 for Acme Corp, dated March 5, 2025. Items: 10 widgets at $5.99 each, 3 gadgets at $24.99 each. Total: $134.87.",
    "Bill to: Jane Doe. Date: 2025-02-28. Services rendered: Website redesign ($2,500), SEO audit ($800), Content writing ($1,200). Grand total: $4,500.00",
    "Receipt from Cloud Services Inc. on 01/15/2025 -- Monthly hosting $49.99, SSL certificate $12.00, Domain renewal $15.00. Amount due: $76.99.",
]

# TODO: Create a system message that instructs the model to extract invoice data as JSON.
# The JSON schema should be:
# {
#   "name": "customer or company name",
#   "date": "YYYY-MM-DD",
#   "items": [{"description": "string", "amount": number}],
#   "total_amount": number
# }
system_message = None  # Replace with your system message string

# TODO: Loop through invoices, call chat(), parse JSON, and print results.
# Handle potential parsing errors gracefully.
# Track how many invoices produced valid JSON.

### Solution

In [None]:
# Exercise 3: Solution

invoices = [
    "Invoice #1042 for Acme Corp, dated March 5, 2025. Items: 10 widgets at $5.99 each, 3 gadgets at $24.99 each. Total: $134.87.",
    "Bill to: Jane Doe. Date: 2025-02-28. Services rendered: Website redesign ($2,500), SEO audit ($800), Content writing ($1,200). Grand total: $4,500.00",
    "Receipt from Cloud Services Inc. on 01/15/2025 -- Monthly hosting $49.99, SSL certificate $12.00, Domain renewal $15.00. Amount due: $76.99.",
]

system_message = """You are a data extraction assistant. Extract invoice information from the provided text and return ONLY valid JSON.

Use this exact schema (no additional fields, no missing fields):
{
  "name": "customer or company name (string)",
  "date": "YYYY-MM-DD (string)",
  "items": [
    {"description": "item description (string)", "amount": <total for this line item as a number>}
  ],
  "total_amount": <total amount as a number>
}

Rules:
- Output ONLY the JSON object. No markdown, no explanation, no code blocks.
- All amounts should be numbers (not strings), without dollar signs.
- Dates must be in YYYY-MM-DD format.
- If a line item has quantity and unit price, compute the line total for the "amount" field."""

valid_count = 0

for i, invoice_text in enumerate(invoices):
    print(f"=== Invoice {i+1} ===")
    print(f"Input: {invoice_text[:80]}...")
    print()
    
    response = chat([
        {"role": "system", "content": system_message},
        {"role": "user", "content": invoice_text}
    ], temperature=0)
    
    # Clean up potential markdown code blocks
    cleaned = response.strip()
    if cleaned.startswith("```"):
        cleaned = cleaned.split("\n", 1)[1].rsplit("```", 1)[0].strip()
    
    try:
        data = json.loads(cleaned)
        valid_count += 1
        print(f"  Valid JSON!")
        print(f"  Name: {data['name']}")
        print(f"  Date: {data['date']}")
        print(f"  Items: {len(data['items'])}")
        for item in data['items']:
            print(f"    - {item['description']}: ${item['amount']}")
        print(f"  Total: ${data['total_amount']}")
    except json.JSONDecodeError as e:
        print(f"  PARSE ERROR: {e}")
        print(f"  Raw response: {response[:200]}")
    print()

print(f"\nResults: {valid_count}/{len(invoices)} invoices produced valid JSON ({valid_count/len(invoices):.0%})")

---
## 9. Common Failures & Mitigations

Even well-crafted prompts can fail. Understanding common failure modes helps you build more robust prompts.

### Failure 1: Hallucination
The model confidently states things that are **factually incorrect**.

**Mitigations:**
- Ask the model to cite sources or say "I don't know"
- Add: "Only use information that is definitely true. If unsure, say so."
- Use retrieval-augmented generation (RAG) to ground responses in real data

### Failure 2: Instruction Following Failures
The model ignores or misinterprets part of the prompt.

**Mitigations:**
- Break complex instructions into numbered steps
- Use delimiters (```, ---, XML tags) to separate sections
- Repeat critical constraints
- Test with adversarial inputs

### Failure 3: Verbosity / Off-topic Responses
The model produces much more text than needed or goes off on tangents.

**Mitigations:**
- Specify exact length: "in 2 sentences", "in under 50 words"
- Add: "Be concise. Do not include preamble or explanation."
- Use structured output format to constrain the response

### Failure 4: Context Window Limits
For very long inputs, the model may lose track of information in the middle.

**Mitigations:**
- Put the most important information at the beginning or end
- Summarize long documents before querying
- Use chunking strategies

In [None]:
# Demo: Hallucination risk

# Bad: Encourages hallucination
bad_response = chat([
    {"role": "user", "content": "Tell me about the scientific research published by Dr. James Wellington of Stanford on quantum gravity in 2023."}
], temperature=0)

print("=== Without guardrails (may hallucinate) ===")
print(bad_response[:300])
print()

# Better: Includes honesty constraints
good_response = chat([
    {"role": "system", "content": "You are a research assistant. Only state facts you are confident about. If you are not sure about something, explicitly say 'I'm not certain about this' or 'I don't have verified information on this'. Never fabricate details."},
    {"role": "user", "content": "Tell me about the scientific research published by Dr. James Wellington of Stanford on quantum gravity in 2023."}
], temperature=0)

print("=== With honesty guardrails ===")
print(good_response[:300])

In [None]:
# Demo: Improving instruction following with delimiters and numbered steps

# Bad: Vague instruction
bad_prompt = "Analyze this text and give me the key points and also translate it to French and summarize it."

# Good: Structured with clear delimiters
good_prompt = """Perform the following 3 tasks on the text enclosed in <text> tags.

<text>
Machine learning is transforming healthcare by enabling early disease detection through 
medical imaging analysis. Recent studies show AI systems can identify certain cancers 
with accuracy comparable to experienced radiologists.
</text>

Tasks:
1. KEY POINTS: List the 2-3 main points as bullet points.
2. FRENCH TRANSLATION: Translate the original text to French.
3. ONE-LINE SUMMARY: Summarize in exactly one sentence.

Format your response with clear headers for each task."""

response = chat([{"role": "user", "content": good_prompt}], temperature=0)
print(response)

---
## 10. Iterative Refinement Workflow

Great prompts are rarely written on the first try. The best prompt engineers follow an **iterative process**:

1. **Start simple** -- Write a basic prompt
2. **Test** -- Run it on several representative inputs
3. **Analyze failures** -- Identify where it goes wrong and why
4. **Refine** -- Add constraints, examples, or restructure
5. **Repeat** -- Until quality meets your threshold

Let's walk through this process for a real task.

In [None]:
# Task: Extract action items from meeting notes
# We'll iterate through 4 versions of the prompt

meeting_notes = """Team standup - Feb 10, 2025
Attendees: Alice, Bob, Charlie

Alice mentioned the API integration is behind schedule by 2 days. She'll reach out 
to the vendor today for updated documentation. Bob said he'll review the pull request 
for the auth module by EOD Wednesday. Charlie raised a concern about test coverage -- 
currently at 72%, target is 85%. He'll write unit tests for the payment module this week.
Alice also noted we need someone to update the deployment runbook before Friday's release.
Bob volunteered to handle that."""

# ---- Version 1: Naive ----
v1 = chat([
    {"role": "user", "content": f"What are the action items from these meeting notes?\n\n{meeting_notes}"}
], temperature=0)

print("=== V1: Naive prompt ===")
print(v1)
print()
print("Issues: Output format is inconsistent, may miss details like deadlines.")
print("="*60)

In [None]:
# ---- Version 2: Add structure ----
v2 = chat([
    {"role": "system", "content": "Extract action items from meeting notes. For each action item, include: WHO is responsible, WHAT they need to do, and WHEN it's due."},
    {"role": "user", "content": meeting_notes}
], temperature=0)

print("=== V2: Added structure (who/what/when) ===")
print(v2)
print()
print("Better! But format still varies. Let's enforce a specific format.")
print("="*60)

In [None]:
# ---- Version 3: Enforce format with example ----
v3 = chat([
    {"role": "system", "content": """Extract action items from meeting notes. Return ONLY a numbered list.
Each item MUST follow this exact format:
N. [OWNER] -- ACTION -- Due: DEADLINE

If no deadline is mentioned, write "Due: Not specified".
Do not include any other text before or after the list."""}, 
    {"role": "user", "content": meeting_notes}
], temperature=0)

print("=== V3: Enforced format ===")
print(v3)
print()
print("Great format! But let's also add priority and make it JSON for downstream use.")
print("="*60)

In [None]:
# ---- Version 4: JSON output with priority ----
v4 = chat([
    {"role": "system", "content": """Extract action items from meeting notes. Return ONLY a valid JSON array.

Each action item must have these fields:
- "owner": person responsible (string)
- "action": what needs to be done (string, concise)
- "deadline": when it's due (string, or "unspecified")
- "priority": "high" if time-sensitive or blocking, "medium" otherwise

Return ONLY the JSON array. No markdown formatting, no explanation."""}, 
    {"role": "user", "content": meeting_notes}
], temperature=0)

print("=== V4: JSON with priority ===")

# Clean and parse
cleaned = v4.strip()
if cleaned.startswith("```"):
    cleaned = cleaned.split("\n", 1)[1].rsplit("```", 1)[0].strip()

try:
    action_items = json.loads(cleaned)
    print(json.dumps(action_items, indent=2))
    print(f"\nExtracted {len(action_items)} action items. Valid JSON.")
except json.JSONDecodeError:
    print("Failed to parse. Raw response:")
    print(v4)

print()
print("This is production-ready! Each version improved on specific weaknesses.")

### Exercise 4: Prompt Evaluation Harness

Create a mini evaluation framework:
1. Define 5 prompt variants for the same task (summarizing a paragraph)
2. Run each prompt on 5 test inputs
3. Score outputs on a simple metric (e.g., length within target range)
4. Print a comparison table showing which prompt variant performs best

In [None]:
# Exercise 4: TODO

# Test paragraphs to summarize
test_paragraphs = [
    "Artificial intelligence has transformed the healthcare industry in remarkable ways. From early disease detection through medical imaging to drug discovery acceleration, AI systems are helping doctors make better decisions. Machine learning algorithms can now analyze thousands of medical images in minutes, identifying patterns that might take human experts hours to find. This has led to earlier diagnoses and improved patient outcomes across multiple specialties.",
    "The global shift toward remote work has fundamentally changed how companies operate. Organizations have had to invest heavily in digital infrastructure, collaboration tools, and cybersecurity measures. While many employees appreciate the flexibility, some struggle with isolation and blurred work-life boundaries. Companies are now experimenting with hybrid models that aim to combine the best of both in-office and remote environments.",
    "Climate change continues to be one of the most pressing challenges facing humanity. Rising global temperatures are causing more frequent extreme weather events, from devastating wildfires to unprecedented flooding. Scientists warn that without significant reductions in greenhouse gas emissions, many coastal cities could face severe flooding by 2050. International cooperation and technological innovation are both critical to addressing this crisis.",
    "The electric vehicle market has experienced explosive growth over the past five years. Major automakers have committed billions of dollars to EV development, and charging infrastructure is expanding rapidly. Battery technology improvements have increased range while reducing costs, making EVs more accessible to average consumers. However, challenges remain around raw material sourcing, grid capacity, and recycling of spent batteries.",
    "Quantum computing represents a paradigm shift in computational capability. Unlike classical computers that use bits, quantum computers use qubits that can exist in multiple states simultaneously. This allows them to solve certain problems exponentially faster than traditional machines. While still in early stages, quantum computing shows promise for cryptography, materials science, financial modeling, and drug discovery."
]

# TODO: Define 5 different prompt variants for summarization.
# Each should aim to produce a 1-2 sentence summary, but use different techniques.
prompt_variants = None  # Replace with a list of 5 system message strings

# TODO: Run each variant on all test paragraphs and collect results.
# results should be a dict: {variant_name: [summary1, summary2, ..., summary5]}
results = None

# TODO: Score each output. A good score means:
# - Length is between 20 and 60 words (target range for a 1-2 sentence summary)
# - Score 1 if within range, 0 if outside

# TODO: Print a comparison table with average scores per variant.

### Solution

In [None]:
# Exercise 4: Solution

test_paragraphs = [
    "Artificial intelligence has transformed the healthcare industry in remarkable ways. From early disease detection through medical imaging to drug discovery acceleration, AI systems are helping doctors make better decisions. Machine learning algorithms can now analyze thousands of medical images in minutes, identifying patterns that might take human experts hours to find. This has led to earlier diagnoses and improved patient outcomes across multiple specialties.",
    "The global shift toward remote work has fundamentally changed how companies operate. Organizations have had to invest heavily in digital infrastructure, collaboration tools, and cybersecurity measures. While many employees appreciate the flexibility, some struggle with isolation and blurred work-life boundaries. Companies are now experimenting with hybrid models that aim to combine the best of both in-office and remote environments.",
    "Climate change continues to be one of the most pressing challenges facing humanity. Rising global temperatures are causing more frequent extreme weather events, from devastating wildfires to unprecedented flooding. Scientists warn that without significant reductions in greenhouse gas emissions, many coastal cities could face severe flooding by 2050. International cooperation and technological innovation are both critical to addressing this crisis.",
    "The electric vehicle market has experienced explosive growth over the past five years. Major automakers have committed billions of dollars to EV development, and charging infrastructure is expanding rapidly. Battery technology improvements have increased range while reducing costs, making EVs more accessible to average consumers. However, challenges remain around raw material sourcing, grid capacity, and recycling of spent batteries.",
    "Quantum computing represents a paradigm shift in computational capability. Unlike classical computers that use bits, quantum computers use qubits that can exist in multiple states simultaneously. This allows them to solve certain problems exponentially faster than traditional machines. While still in early stages, quantum computing shows promise for cryptography, materials science, financial modeling, and drug discovery."
]

# 5 prompt variants -- different strategies for the same summarization task
prompt_variants = {
    "V1-Basic": "Summarize the following text.",
    "V2-Constrained": "Summarize the following text in exactly 1-2 sentences. Be concise.",
    "V3-WordLimit": "Summarize the following text in 20-50 words. Do not exceed 50 words.",
    "V4-RoleBased": "You are a news editor writing headlines and brief summaries. Summarize the following text in 1-2 crisp sentences suitable for a news digest.",
    "V5-Template": "Summarize the following text using this template: '[Topic] is [key development], resulting in [impact/consequence].' Fill in the bracketed parts. One sentence only."
}

# Run all variants on all paragraphs
results = {}
for variant_name, system_msg in prompt_variants.items():
    summaries = []
    for paragraph in test_paragraphs:
        resp = chat([
            {"role": "system", "content": system_msg},
            {"role": "user", "content": paragraph}
        ], temperature=0)
        summaries.append(resp)
    results[variant_name] = summaries

# Score: word count within target range (20-60 words)
TARGET_MIN = 20
TARGET_MAX = 60

print(f"{'Variant':<16} {'Avg Words':<12} {'In Range':<12} {'Score':<8} {'Word Counts'}")
print("-" * 80)

for variant_name, summaries in results.items():
    word_counts = [len(s.split()) for s in summaries]
    in_range = sum(1 for wc in word_counts if TARGET_MIN <= wc <= TARGET_MAX)
    avg_words = sum(word_counts) / len(word_counts)
    score = in_range / len(summaries)
    
    print(f"{variant_name:<16} {avg_words:<12.1f} {in_range}/{len(summaries):<10} {score:<8.0%} {word_counts}")

# Show the best variant's outputs
print("\n" + "=" * 60)
best_variant = max(results.keys(), key=lambda v: sum(
    1 for s in results[v] if TARGET_MIN <= len(s.split()) <= TARGET_MAX
))
print(f"Best variant: {best_variant}")
print("\nSample outputs:")
for i, summary in enumerate(results[best_variant][:3]):
    print(f"  [{i+1}] ({len(summary.split())} words) {summary}")
    print()

---
## 11. Prompt Templates & Parameterization

In production systems, you rarely write one-off prompts. Instead, you create **reusable templates** that accept parameters. This ensures:

- **Consistency** across many invocations
- **Maintainability** -- change the template in one place
- **Testability** -- systematically test with different inputs

Python f-strings and `.format()` are the simplest approach.

In [None]:
# Simple template with f-strings

def summarize(text, length="1-2 sentences", audience="general"):
    """Reusable summarization prompt template."""
    system_msg = f"""You are a summarization assistant. Write summaries for a {audience} audience.
Always respond in {length}. Be concise and informative."""
    
    return chat([
        {"role": "system", "content": system_msg},
        {"role": "user", "content": f"Summarize:\n\n{text}"}
    ], temperature=0)


sample_text = """The James Webb Space Telescope has captured unprecedented images of distant 
galaxies, revealing structures that formed just 300 million years after the Big Bang. 
These observations challenge existing models of galaxy formation and suggest that the 
early universe was more complex than previously thought."""

print("=== General audience, 1-2 sentences ===")
print(summarize(sample_text))
print()

print("=== Expert audience, 1 sentence ===")
print(summarize(sample_text, length="exactly 1 sentence", audience="astrophysics researcher"))
print()

print("=== Child audience, 2-3 sentences ===")
print(summarize(sample_text, length="2-3 simple sentences", audience="10-year-old child"))

In [None]:
# More advanced template: a reusable analyzer

ANALYSIS_TEMPLATE = """Analyze the following {content_type} and provide:
1. MAIN TOPIC: One sentence describing the primary subject.
2. KEY POINTS: 2-3 bullet points with the most important information.
3. TONE: The overall tone (e.g., formal, casual, urgent, informative).
4. TARGET AUDIENCE: Who this content is written for.
{extra_instructions}

Content to analyze:
---
{content}
---"""


def analyze_content(content, content_type="text", extra_instructions=""):
    """Reusable content analysis template."""
    prompt = ANALYSIS_TEMPLATE.format(
        content_type=content_type,
        content=content,
        extra_instructions=extra_instructions
    )
    return chat([{"role": "user", "content": prompt}], temperature=0)


# Use the template
email = """Hi team, just a heads up that we're pushing the release date to next Friday 
due to the critical bug found in the payment module. Please prioritize fixing issue #342 
and update your sprint boards accordingly. Let's sync at tomorrow's standup."""

print(analyze_content(
    content=email,
    content_type="internal team email",
    extra_instructions="5. URGENCY: Rate from 1-5 how urgent this communication is."
))

In [None]:
# Template pattern: Building a prompt library

PROMPT_LIBRARY = {
    "classify_sentiment": {
        "system": "You are a sentiment classifier. Respond with exactly one word: POSITIVE, NEGATIVE, or NEUTRAL.",
        "user": "Classify the sentiment of this text:\n\n{text}"
    },
    "extract_keywords": {
        "system": "Extract the {count} most important keywords from the text. Return them as a comma-separated list. No other text.",
        "user": "{text}"
    },
    "rewrite_tone": {
        "system": "Rewrite the given text in a {tone} tone. Keep the same meaning but change the style. Output only the rewritten text.",
        "user": "{text}"
    }
}


def run_prompt(template_name, **kwargs):
    """Run a prompt from the library with the given parameters."""
    template = PROMPT_LIBRARY[template_name]
    messages = [
        {"role": "system", "content": template["system"].format(**kwargs)},
        {"role": "user", "content": template["user"].format(**kwargs)}
    ]
    return chat(messages, temperature=0)


sample = "The new software update completely broke my workflow. Three hours wasted trying to fix compatibility issues."

print("Sentiment:", run_prompt("classify_sentiment", text=sample))
print("Keywords:", run_prompt("extract_keywords", text=sample, count=5))
print("Formal rewrite:", run_prompt("rewrite_tone", text=sample, tone="formal and professional"))

---
## 12. Summary & References

### Key Takeaways

1. **Prompt anatomy matters**: The system message sets behavior; user messages provide the task; assistant messages enable few-shot learning and multi-turn context.

2. **Zero-shot works for simple tasks**: Direct, specific instructions can get surprisingly good results without examples.

3. **Few-shot improves consistency**: Providing 2-3 examples dramatically improves format adherence and edge-case handling.

4. **Chain-of-thought unlocks reasoning**: Adding "Let's think step by step" or providing reasoning examples significantly improves accuracy on math, logic, and multi-step problems.

5. **Roles shape output**: Setting a persona in the system message implicitly adjusts vocabulary, depth, tone, and style.

6. **Structured output requires explicit schemas**: Specify the exact JSON schema, provide examples, and always validate with `json.loads()`.

7. **Iterate systematically**: Start simple, test on representative inputs, identify failure modes, and refine. Track quality at each iteration.

8. **Use templates in production**: Parameterized prompts ensure consistency, maintainability, and testability.

### References

**Papers:**
- Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (2022). [arXiv:2201.11903](https://arxiv.org/abs/2201.11903)
- Brown et al., "Language Models are Few-Shot Learners" (GPT-3, 2020). [arXiv:2005.14165](https://arxiv.org/abs/2005.14165)
- Kojima et al., "Large Language Models are Zero-Shot Reasoners" (2022). [arXiv:2205.11916](https://arxiv.org/abs/2205.11916)

**Guides:**
- [OpenAI Prompt Engineering Guide](https://platform.openai.com/docs/guides/prompt-engineering)
- [Anthropic Prompt Engineering Guide](https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview)
- [learnprompting.org](https://learnprompting.org/) -- Community-driven prompt engineering tutorials