# Prompt Engineering

In this notebook, you'll practice systematic prompt construction—treating prompt design as programming rather than conversation. Each exercise builds a specific prompt engineering skill with immediate, visible feedback.

**What you'll do:**
- Construct format specifications that constrain model output into bullet points, JSON, and markdown tables—and observe how format tokens anchor the output distribution
- Write role prompts that bias attention toward specific features of a code snippet, discovering that roles change *what* the model finds, not just *how* it phrases the answer
- Run a controlled experiment on few-shot example selection, comparing diversity vs quantity vs category bias across multiple trials
- Design a complete structured prompt for a real task (meeting summary), composing 3+ techniques and evaluating consistency across inputs

**For each exercise, PREDICT the output before running the cell.** Wrong predictions are more valuable than correct ones—they reveal gaps in your mental model.

In [None]:
# Setup — self-contained for Google Colab
!pip install -q openai

import os
import json
import textwrap
import random
from openai import OpenAI
import matplotlib.pyplot as plt
import numpy as np

# --- API Key Setup ---
# Option 1: Set your API key as an environment variable (recommended)
#   In Colab: go to the key icon in the left sidebar, add OPENAI_API_KEY
# Option 2: Paste it directly (less secure, don't commit this)
#   os.environ["OPENAI_API_KEY"] = "sk-..."

# You can also use any OpenAI-compatible API (e.g., local Ollama, Together AI)
# by changing the base_url:
#   client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

client = OpenAI()

# Use a small, cheap model for the exercises
MODEL = "gpt-4o-mini"

# Nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]

# Reproducible results where possible
random.seed(42)
np.random.seed(42)


def call_llm(prompt: str, temperature: float = 0.0, max_tokens: int = 300) -> str:
    """Call the LLM with a single prompt. Returns the response text."""
    response = client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature,
        max_tokens=max_tokens,
    )
    return response.choices[0].message.content.strip()


def call_llm_with_system(system_prompt: str, user_prompt: str,
                         temperature: float = 0.0, max_tokens: int = 300) -> str:
    """Call the LLM with a system prompt and user prompt. Returns the response text."""
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        temperature=temperature,
        max_tokens=max_tokens,
    )
    return response.choices[0].message.content.strip()


def print_wrapped(text: str, width: int = 80, prefix: str = ""):
    """Print text with word wrapping for readability."""
    for line in text.split("\n"):
        wrapped = textwrap.fill(line, width=width, initial_indent=prefix,
                                subsequent_indent=prefix)
        print(wrapped)


# Quick test to verify the API is working
test = call_llm("Say 'API connection successful' and nothing else.")
print(test)
print(f"\nUsing model: {MODEL}")
print("Setup complete.")

## Shared Data

All exercises in this notebook work with real text—invoices, code snippets, and reviews. The data is defined here so exercises can share it.

In [None]:
# --- Shared text for Exercise 1: Format Specification ---
INVOICE_TEXT = """On January 15, 2024, Acme Corporation issued invoice #INV-2024-0892
to Globex Industries for consulting services rendered in Q4 2023.
The total amount was $12,450.00, with payment terms of Net 30.
The invoice covered three line items: Strategic Planning ($5,200),
Market Analysis ($4,750), and Implementation Support ($2,500).
Contact: billing@acmecorp.com. Tax ID: 82-1234567."""


# --- Shared code snippet for Exercise 2: Role Prompting ---
CODE_SNIPPET = '''def get_user_data(user_id, db_connection):
    query = f"SELECT * FROM users WHERE id = {user_id}"
    result = db_connection.execute(query)
    users = []
    for row in result:
        user = {}
        for i in range(len(row)):
            user[result.description[i][0]] = row[i]
        users.append(user)
    data = users[0]
    data["full_name"] = data["first_name"] + " " + data["last_name"]
    data["age"] = 2024 - int(data["birth_year"])
    return data
'''

# Known issues in the code snippet (for reference during Exercise 2):
# Security:    SQL injection via f-string, SELECT * exposes all columns
# Performance: Iterating all rows when only one is needed, building dicts manually
#              instead of using fetchone() or row factories
# Style:       No type hints, no docstring, magic number 2024, no error handling,
#              index-based column access, mutating the dict in place


# --- Shared data for Exercise 3: Few-Shot Example Selection ---
# Simple text classification: Tech / Sports / Politics / Entertainment
LABELED_EXAMPLES = [
    # Tech
    ("Apple released a new M4 chip with improved neural engine performance", "Tech"),
    ("The latest Python 3.13 update includes a new JIT compiler", "Tech"),
    ("OpenAI announced GPT-5 with significantly improved reasoning", "Tech"),
    ("Samsung unveiled a foldable tablet at CES this year", "Tech"),
    # Sports
    ("The Lakers won the championship in a thrilling overtime game", "Sports"),
    ("Usain Bolt's world record still stands after fifteen years", "Sports"),
    ("The World Cup final drew over a billion viewers worldwide", "Sports"),
    ("Tennis star broke her ankle during the quarterfinal match", "Sports"),
    # Politics
    ("The Senate passed a bipartisan infrastructure bill today", "Politics"),
    ("New trade tariffs were announced affecting imports from Asia", "Politics"),
    ("The Supreme Court ruled on the landmark privacy case", "Politics"),
    ("Election results showed a surprising shift in rural voting patterns", "Politics"),
    # Entertainment
    ("The new Dune sequel broke opening weekend box office records", "Entertainment"),
    ("Grammy nominations surprised fans with several indie picks", "Entertainment"),
    ("Netflix reported record subscriber growth after releasing the hit series", "Entertainment"),
    ("Broadway ticket sales reached an all-time high this quarter", "Entertainment"),
]

# Test examples for classification (held out)
TEST_CLASSIFICATION = [
    ("Microsoft acquired a robotics startup for $2 billion", "Tech"),
    ("The marathon runner set a new personal best time", "Sports"),
    ("Congress debated the new healthcare reform proposal", "Politics"),
    ("The animated film won Best Picture at the festival", "Entertainment"),
    ("A new quantum computing breakthrough was published in Nature", "Tech"),
    ("The soccer team qualified for the next round of the tournament", "Sports"),
    ("Diplomatic talks between the two nations resumed after months", "Politics"),
    ("The rock band announced a reunion tour starting in June", "Entertainment"),
    ("The startup raised $50M in Series B for their AI platform", "Tech"),
    ("Olympic committee announced new rules for athlete eligibility", "Sports"),
]

CATEGORIES = ["Tech", "Sports", "Politics", "Entertainment"]

print(f"Invoice text: {len(INVOICE_TEXT)} chars")
print(f"Code snippet: {len(CODE_SNIPPET.strip().splitlines())} lines")
print(f"Labeled examples: {len(LABELED_EXAMPLES)} ({len(LABELED_EXAMPLES)//4} per category)")
print(f"Test examples: {len(TEST_CLASSIFICATION)}")
print("\nData loaded.")

---

## Exercise 1: Format Specification (Guided)

The lesson showed that format specification constrains the output distribution—format tokens in the prompt create structural anchors for attention, and autoregressive generation maintains format consistency once the first format token is generated.

In this exercise, you'll see that claim in action. You'll extract structured data from an invoice paragraph using four prompts:
1. A conversational prompt (no format specification)
2. A prompt requesting bullet points
3. A prompt requesting JSON
4. A prompt requesting a markdown table

The first format (bullet points) is fully worked. You'll construct the JSON and markdown table prompts yourself.

**Before running each cell, predict:**
- What format will the conversational prompt produce? Will it be the same every time?
- When you specify bullet points, will the model follow the format exactly?
- For JSON: what happens if you include an example schema vs just saying "return JSON"?

In [None]:
# --- Step 1: Conversational prompt (no format specification) ---
# This is the baseline. No structure, just a natural language request.

conversational_prompt = f"""Please extract the key information from the following
invoice text and organize it nicely.

{INVOICE_TEXT}"""

print("CONVERSATIONAL PROMPT (no format spec)")
print("=" * 60)
print("\nRunning 3 times with temperature=0.7 to see variation...\n")

for run in range(3):
    result = call_llm(conversational_prompt, temperature=0.7, max_tokens=400)
    print(f"--- Run {run + 1} ---")
    print_wrapped(result)
    print()

print("Notice: Does the format stay consistent across runs?")
print("The model might use bullets, paragraphs, bold text, tables...")
print("The instruction is clear English. But 'organize it nicely' is a wish, not a constraint.")

In [None]:
# --- Step 2: Bullet point format (fully worked) ---
# We specify the exact output structure: what fields to extract, in what format.

bullet_prompt = f"""Extract the following information from the invoice text below.
Return ONLY a bullet-point list with these exact fields:
- Vendor:
- Client:
- Invoice Number:
- Date:
- Total Amount:
- Payment Terms:
- Line Items: (list each with amount)

Invoice text:
{INVOICE_TEXT}"""

print("BULLET POINT FORMAT")
print("=" * 60)
print("\nRunning 3 times with temperature=0.7...\n")

bullet_results = []
for run in range(3):
    result = call_llm(bullet_prompt, temperature=0.7, max_tokens=400)
    bullet_results.append(result)
    print(f"--- Run {run + 1} ---")
    print_wrapped(result)
    print()

# Check consistency: do all 3 runs have the same structure?
all_have_vendor = all("Vendor:" in r for r in bullet_results)
all_have_amount = all("12,450" in r or "12450" in r for r in bullet_results)
print(f"All runs include 'Vendor:' field: {all_have_vendor}")
print(f"All runs include correct amount: {all_have_amount}")
print("\nThe format specification created structural anchors. The bullet-point")
print("template in the prompt constrains the output — each '- Field:' token")
print("creates an attention anchor that the model's generation follows.")

In [None]:
# --- Step 3: JSON format ---
# TODO: Write a prompt that extracts the same invoice data as JSON.
#
# Requirements:
# - Include an explicit JSON schema showing the expected structure
# - Include the fields: vendor, client, invoice_number, date, total_amount,
#   payment_terms, line_items (array of {description, amount})
# - Tell the model to return ONLY valid JSON (no markdown fences, no explanation)
#
# Hint: Showing the schema as an example is more powerful than describing it
# in words. The format tokens in the schema become attention anchors.

# TODO: Write the json_prompt (5-15 lines)
# YOUR CODE HERE
json_prompt = f"""YOUR PROMPT HERE

Invoice text:
{INVOICE_TEXT}"""

print("JSON FORMAT")
print("=" * 60)
print("\nRunning 3 times with temperature=0.7...\n")

json_results = []
for run in range(3):
    result = call_llm(json_prompt, temperature=0.7, max_tokens=500)
    json_results.append(result)
    print(f"--- Run {run + 1} ---")
    print_wrapped(result)
    print()

# Test: is the output valid JSON?
valid_count = 0
for i, r in enumerate(json_results):
    # Strip markdown code fences if present
    cleaned = r.strip()
    if cleaned.startswith("```"):
        cleaned = cleaned.split("\n", 1)[-1].rsplit("```", 1)[0].strip()
    try:
        parsed = json.loads(cleaned)
        valid_count += 1
        print(f"Run {i+1}: Valid JSON with keys: {list(parsed.keys())}")
    except json.JSONDecodeError as e:
        print(f"Run {i+1}: INVALID JSON — {e}")

print(f"\n{valid_count}/3 runs produced valid JSON.")
print("The JSON schema in the prompt creates strong format anchors.")
print("Once the model generates '{', autoregressive generation")
print("constrains all subsequent tokens to be JSON-valid.")

In [None]:
# --- Step 4: Markdown table format ---
# TODO: Write a prompt that extracts the same invoice data as a markdown table.
#
# Requirements:
# - The table should have columns: Field, Value
# - Line items should each have their own row
# - Show the expected table header in the prompt as a format anchor
#
# Hint: Including the first row of the table (e.g., the header + separator)
# as part of the prompt is a powerful anchoring technique.

# TODO: Write the table_prompt (5-15 lines)
# YOUR CODE HERE
table_prompt = f"""YOUR PROMPT HERE

Invoice text:
{INVOICE_TEXT}"""

print("MARKDOWN TABLE FORMAT")
print("=" * 60)
print("\nRunning 3 times with temperature=0.7...\n")

table_results = []
for run in range(3):
    result = call_llm(table_prompt, temperature=0.7, max_tokens=500)
    table_results.append(result)
    print(f"--- Run {run + 1} ---")
    print_wrapped(result)
    print()

# Check: do all runs contain table structure?
has_pipes = [("|" in r and "---" in r) for r in table_results]
print(f"Runs with table structure (pipes + separators): {sum(has_pipes)}/3")
print("\nThree different formats, same data, same model, same weights.")
print("The ONLY difference is the format tokens in the prompt.")
print("Format specification is not about being polite — it is about")
print("providing structural anchors that constrain the output distribution.")

**What just happened:** You extracted the same data from the same text using four different prompts. The conversational prompt produced inconsistent formats. The structured prompts—bullet points, JSON, markdown table—each produced consistent output matching the requested format.

The lesson explained why: format tokens in the prompt create **structural anchors** for attention. When the model sees `| Field | Value |` in the prompt, its attention mechanism picks up those pipe characters as a strong signal. And once autoregressive generation produces the first format token (a `{` for JSON, a `|` for a table), each subsequent token is constrained to follow the format. The first format token constrains everything after it.

This is the difference between a wish and a program. "Organize it nicely" is a wish—it allows any output format. An explicit schema is a program—it constrains the output space to one specific structure.

<details>
<summary>Solution</summary>

**Why including a schema is more powerful than describing the format in words:** A JSON schema like `{"vendor": "...", "date": "..."}` puts the actual format tokens into the context. The model's attention directly matches these tokens during generation. Describing the format in natural language ("return a JSON object with a vendor field") is weaker because the model must translate the description into format tokens—an extra step that introduces variance.

**JSON prompt:**
```python
json_prompt = f"""Extract information from the invoice text below.
Return ONLY valid JSON (no markdown, no explanation) matching this exact schema:

{{
  "vendor": "string",
  "client": "string",
  "invoice_number": "string",
  "date": "string",
  "total_amount": number,
  "payment_terms": "string",
  "line_items": [
    {{"description": "string", "amount": number}}
  ]
}}

Invoice text:
{INVOICE_TEXT}"""
```

**Markdown table prompt:**
```python
table_prompt = f"""Extract information from the invoice text below into a markdown table.
Use exactly this format—two columns, Field and Value.
Each line item gets its own row.

Start your response with this header:

| Field | Value |
| --- | --- |

Invoice text:
{INVOICE_TEXT}"""
```

**Common mistake:** Not including the actual format tokens in the prompt. Saying "return a table" is weaker than showing the table header. The format tokens themselves are the attention anchors.

</details>

---

## Exercise 2: Role Prompting Effects (Supported)

The lesson demonstrated that role prompts do not just change the *style* of the response—they change *what the model attends to* in the input. A security-focused role surfaces security issues that a generic review misses entirely.

In this exercise, you'll test that claim. You have a Python function with known issues spanning three categories: security vulnerabilities, performance problems, and style issues. You'll write prompts with three different roles and compare which issues each role finds.

The first role (security auditor) is provided. You'll write the performance engineer and code style reviewer roles, then observe how each role biases the model's attention toward different features of the same code.

**Before running, predict:**
- Will the security auditor find the SQL injection? What about the performance issues?
- Will the performance engineer mention security at all?
- What happens when you try a "combined" role that covers all three concerns?

In [None]:
# --- Role 1: Security Auditor (provided) ---

security_system = """You are a senior application security auditor. Your sole focus
is identifying security vulnerabilities in code. You look for: injection attacks,
authentication/authorization flaws, data exposure risks, and input validation
issues. You do NOT comment on code style, performance, or readability — only
security. List each vulnerability with a severity rating (Critical/High/Medium/Low)."""

review_user_prompt = f"""Review this Python function for issues:\n\n```python\n{CODE_SNIPPET}```"""

print("ROLE 1: SECURITY AUDITOR")
print("=" * 60)
security_result = call_llm_with_system(security_system, review_user_prompt, max_tokens=600)
print_wrapped(security_result)
print("\n" + "-" * 60)
print("Note which issues this role found. Does it mention performance?")
print("Does it mention style? The role constrains what the model attends to.")

In [None]:
# --- Role 2: Performance Engineer ---
# TODO: Write a system prompt for a performance engineer.
#
# Requirements:
# - The role should focus ONLY on performance issues (not security, not style)
# - Mention what performance concerns to look for: unnecessary iteration,
#   memory usage, database query efficiency, etc.
# - Ask for specific improvement suggestions
#
# The key insight: the role tokens bias attention toward performance-related
# features of the code. "Unnecessary iteration" in the system prompt makes
# the for-loop more salient to the attention mechanism.

# TODO: Write performance_system (3-6 lines of text)
# YOUR CODE HERE
performance_system = """YOUR SYSTEM PROMPT HERE"""

print("ROLE 2: PERFORMANCE ENGINEER")
print("=" * 60)
performance_result = call_llm_with_system(performance_system, review_user_prompt, max_tokens=600)
print_wrapped(performance_result)
print("\n" + "-" * 60)
print("Compare to the security auditor. Different issues from the same code?")
print("The role tokens shifted what the model attended to in the code.")

In [None]:
# --- Role 3: Code Style Reviewer ---
# TODO: Write a system prompt for a code style / readability reviewer.
#
# Requirements:
# - Focus ONLY on readability, naming, documentation, Pythonic idioms
# - Do NOT mention security or performance
# - Ask for specific style improvements following Python best practices

# TODO: Write style_system (3-6 lines of text)
# YOUR CODE HERE
style_system = """YOUR SYSTEM PROMPT HERE"""

print("ROLE 3: CODE STYLE REVIEWER")
print("=" * 60)
style_result = call_llm_with_system(style_system, review_user_prompt, max_tokens=600)
print_wrapped(style_result)
print("\n" + "-" * 60)
print("Three roles, three different sets of issues from the SAME code.")
print("The model's weights are identical. The role text in the system prompt")
print("biased attention toward different features of the input.")

In [None]:
# --- Comparison: What did each role find? ---
# Let's use the LLM itself to extract a structured comparison.

comparison_prompt = f"""I asked three different reviewers to review the same code.
Summarize what each found in a brief list. Use ONLY the issues they actually
mentioned — do not add issues they missed.

Security Auditor said:
{security_result}

Performance Engineer said:
{performance_result}

Code Style Reviewer said:
{style_result}

Format your response as:
SECURITY AUDITOR found: [brief list]
PERFORMANCE ENGINEER found: [brief list]
STYLE REVIEWER found: [brief list]
OVERLAP: [issues mentioned by more than one reviewer, or "none"]"""

print("COMPARISON ACROSS ROLES")
print("=" * 60)
comparison = call_llm(comparison_prompt, max_tokens=500)
print_wrapped(comparison)
print("\n" + "-" * 60)
print("\nThe code has the SAME issues regardless of who reviews it.")
print("But each role surfaces different issues because the role tokens")
print("bias attention toward different features of the code.")
print("\nRemember from the lesson: 'SFT teaches format, not knowledge.'")
print("Role prompts are the same principle at inference time — they shape")
print("HOW the model presents what it already knows, not WHAT it knows.")

**What just happened:** Three different roles produced three different reviews of the same code. The security auditor found injection vulnerabilities. The performance engineer found inefficient iteration and query patterns. The style reviewer found naming and documentation issues. The overlap was likely minimal—each role focused the model's attention on different features of the input.

This is not just a style change. The roles caused the model to *find different things*. The role tokens in the system prompt add entries to the K/V cache that bias the attention distribution. When the model's Q vectors process the code, they attend to both the code tokens AND the role tokens. The word "security" in the role makes security-related code patterns (like `f"SELECT..."`) more salient to the attention mechanism.

But notice: the role did not give the model any new knowledge about security or performance. It shifted the model's focus within what it already knows. This is exactly the lesson's point: role prompts shape focus, not knowledge.

<details>
<summary>Solution</summary>

**Why focused roles outperform generic "review everything" prompts:** A focused role creates stronger attention bias. If the system prompt contains both "security" and "performance" and "style" tokens, the attention is distributed across all three concerns. A focused role concentrates attention on one concern, making it more likely to find subtle issues in that category.

**Performance Engineer system prompt:**
```python
performance_system = """You are a senior performance engineer. Your sole focus is
identifying performance bottlenecks and inefficiencies in code. You look for:
unnecessary iteration, excessive memory allocation, inefficient database queries,
redundant computation, and missed optimization opportunities. You do NOT comment
on security, code style, or readability—only performance. For each issue,
explain the performance impact and suggest a more efficient alternative."""
```

**Code Style Reviewer system prompt:**
```python
style_system = """You are a Python code style reviewer focused on readability and
Pythonic idioms. You look for: missing type hints, missing docstrings, unclear
variable names, non-Pythonic patterns (e.g., index-based iteration instead of
unpacking), magic numbers, and violations of PEP 8. You do NOT comment on
security or performance—only style and readability. Suggest specific
improvements for each issue."""
```

**Common finding:** The security auditor finds SQL injection (Critical) and SELECT * data exposure. The performance engineer finds: iterating all rows when only one is needed, manual dict construction instead of row factories, and possibly the redundant string concatenation. The style reviewer finds: no type hints, no docstring, magic number 2024, no error handling, and index-based column access. Overlap is minimal.

</details>

---

## Exercise 3: Few-Shot Example Selection (Supported)

The lesson taught three principles for few-shot example selection: diversity over quantity, format consistency, and difficulty calibration. In this exercise, you'll test the first principle empirically.

You have a 4-category text classification task (Tech / Sports / Politics / Entertainment) with 16 labeled examples and 10 test examples. You'll compare four example selection strategies:

1. **3 random examples**—baseline
2. **3 diverse examples**—one from each of three different categories
3. **3 same-category examples**—all from one category (Tech)
4. **5 random examples**—more examples, but no diversity guarantee

For each strategy, you'll run 5 trials (with different random selections where applicable) and measure accuracy. The example selection function and evaluation loop are provided. Your job is to design the example sets and interpret the results.

**Before running, predict:**
- Will 3 diverse examples outperform 5 random examples? The lesson says diversity > quantity.
- What will 3 same-category examples do? They provide format consistency but no category diversity.
- How much variance will you see across the 5 trials for random selection?

In [None]:
# --- Helper functions (provided) ---

def build_classification_prompt(examples: list[tuple[str, str]], test_text: str) -> str:
    """Build a few-shot classification prompt from labeled examples."""
    lines = ["Classify the following text into one of these categories: "
             "Tech, Sports, Politics, Entertainment."]
    lines.append("Respond with ONLY the category name.\n")
    for text, label in examples:
        lines.append(f'Text: "{text}"\nCategory: {label}\n')
    lines.append(f'Text: "{test_text}"\nCategory:')
    return "\n".join(lines)


def extract_category(response: str) -> str:
    """Extract a category from a model response."""
    response_lower = response.lower().strip()
    for cat in CATEGORIES:
        if cat.lower() in response_lower:
            return cat
    return "Unknown"


def run_classification_trial(examples: list[tuple[str, str]],
                              test_data: list[tuple[str, str]]) -> float:
    """Run one classification trial. Returns accuracy."""
    correct = 0
    for text, true_label in test_data:
        prompt = build_classification_prompt(examples, text)
        response = call_llm(prompt)
        pred = extract_category(response)
        if pred == true_label:
            correct += 1
    return correct / len(test_data)


print("Helper functions loaded.")
print("\nSample classification prompt:")
sample = build_classification_prompt(LABELED_EXAMPLES[:2], "Test headline here")
print(sample)

In [None]:
# --- Define the four example selection strategies ---

rng = random.Random(42)

# Strategy 1: 3 random examples (5 different random draws)
random_3_trials = []
for _ in range(5):
    selected = rng.sample(LABELED_EXAMPLES, 3)
    random_3_trials.append(selected)

# Strategy 2: 3 diverse examples — one per category, covering 3 of 4 categories
# TODO: Create 5 sets of 3 examples, each covering 3 different categories.
# For each set, pick one example from each of 3 categories.
#
# Hint: You can group LABELED_EXAMPLES by category, then sample one from
# each of three different category groups.
#
# YOUR CODE HERE (5-12 lines)
diverse_3_trials = []
# Group examples by category
by_category = {}
for text, label in LABELED_EXAMPLES:
    by_category.setdefault(label, []).append((text, label))

# TODO: Create 5 diverse sets. For each, pick 3 categories and one example from each.


# Strategy 3: 3 same-category examples (all Tech)
# TODO: Create 5 sets of 3 examples, all from the Tech category.
# Use different Tech examples for each set when possible.
#
# YOUR CODE HERE (3-6 lines)
same_category_trials = []


# Strategy 4: 5 random examples (5 different random draws)
random_5_trials = []
for _ in range(5):
    selected = rng.sample(LABELED_EXAMPLES, 5)
    random_5_trials.append(selected)


print("Example sets created.")
print(f"  Strategy 1 (3 random):        {len(random_3_trials)} trial sets")
print(f"  Strategy 2 (3 diverse):        {len(diverse_3_trials)} trial sets")
print(f"  Strategy 3 (3 same-category):  {len(same_category_trials)} trial sets")
print(f"  Strategy 4 (5 random):         {len(random_5_trials)} trial sets")
print("\nSample diverse set categories:", [ex[1] for ex in diverse_3_trials[0]] if diverse_3_trials else "(empty — fill in the TODO)")
print("Sample same-category set categories:", [ex[1] for ex in same_category_trials[0]] if same_category_trials else "(empty — fill in the TODO)")

In [None]:
# --- Run all trials ---
# This will make 5 strategies x 5 trials x 10 test examples = 200 API calls.
# With gpt-4o-mini, this should take about 1-2 minutes.

strategies = {
    "3 Random": random_3_trials,
    "3 Diverse": diverse_3_trials,
    "3 Same-Cat": same_category_trials,
    "5 Random": random_5_trials,
}

results = {}
for name, trials in strategies.items():
    print(f"Running {name}...", end=" ")
    trial_accs = []
    for trial_examples in trials:
        acc = run_classification_trial(trial_examples, TEST_CLASSIFICATION)
        trial_accs.append(acc)
    results[name] = trial_accs
    mean_acc = np.mean(trial_accs)
    std_acc = np.std(trial_accs)
    print(f"mean={mean_acc:.0%}, std={std_acc:.0%}, trials={[f'{a:.0%}' for a in trial_accs]}")

print("\nAll trials complete.")

In [None]:
# --- Visualize the results ---

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: mean accuracy with error bars
strategy_names = list(results.keys())
means = [np.mean(results[s]) * 100 for s in strategy_names]
stds = [np.std(results[s]) * 100 for s in strategy_names]
colors = ["#f59e0b", "#6366f1", "#ef4444", "#10b981"]

bars = ax1.bar(strategy_names, means, color=colors, edgecolor="white",
               linewidth=0.5, width=0.6, yerr=stds, capsize=5,
               error_kw={"color": "white", "linewidth": 1.5})
for bar, val in zip(bars, means):
    ax1.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 3,
             f"{val:.0f}%", ha="center", va="bottom", fontsize=12,
             fontweight="bold", color="white")

ax1.set_ylabel("Accuracy (%)", fontsize=11)
ax1.set_title("Mean Accuracy by Example Selection Strategy", fontsize=13,
              fontweight="bold")
ax1.set_ylim(0, 110)
ax1.spines["top"].set_visible(False)
ax1.spines["right"].set_visible(False)

# Right plot: individual trial results (scatter + box)
positions = range(len(strategy_names))
for i, (name, color) in enumerate(zip(strategy_names, colors)):
    trial_accs = [a * 100 for a in results[name]]
    # Jittered scatter
    jitter = np.random.uniform(-0.15, 0.15, len(trial_accs))
    ax2.scatter([i + j for j in jitter], trial_accs, color=color,
                s=60, alpha=0.8, edgecolors="white", linewidth=0.5, zorder=3)
    # Mean line
    ax2.hlines(np.mean(trial_accs), i - 0.3, i + 0.3, color=color,
               linewidth=2, zorder=4)

ax2.set_xticks(list(positions))
ax2.set_xticklabels(strategy_names)
ax2.set_ylabel("Accuracy (%)", fontsize=11)
ax2.set_title("Individual Trial Results (5 trials each)", fontsize=13,
              fontweight="bold")
ax2.set_ylim(0, 110)
ax2.spines["top"].set_visible(False)
ax2.spines["right"].set_visible(False)

plt.suptitle("Example Selection: Diversity vs Quantity",
             fontsize=14, fontweight="bold", y=1.02)
plt.tight_layout()
plt.show()

# Print interpretation
print("\nInterpretation:")
diverse_mean = np.mean(results["3 Diverse"])
random5_mean = np.mean(results["5 Random"])
samecat_mean = np.mean(results["3 Same-Cat"])
random3_mean = np.mean(results["3 Random"])

print(f"  3 Diverse ({diverse_mean:.0%}) vs 5 Random ({random5_mean:.0%}):")
if diverse_mean >= random5_mean:
    print("    -> Diversity wins over quantity. 3 well-chosen > 5 random.")
else:
    print("    -> Quantity won this time. But check the variance.")

print(f"  3 Same-Cat ({samecat_mean:.0%}) vs 3 Diverse ({diverse_mean:.0%}):")
if diverse_mean >= samecat_mean:
    print("    -> Category diversity matters. Same-category examples")
    print("       create K/V patterns biased toward one category.")
else:
    print("    -> Same-category performed well, but likely only for")
    print("       test examples in that category.")

print(f"\n  Random 3 variance: {np.std(results['3 Random']):.0%}")
print(f"  Random 5 variance: {np.std(results['5 Random']):.0%}")
print("  Higher variance = more sensitivity to which examples are selected.")
print("  This connects to the ICL lesson's ordering sensitivity finding:")
print("  when the model's behavior depends on surface features of the prompt,")
print("  the CHOICE of examples matters as much as their ORDERING.")

**What just happened:** You ran a controlled experiment comparing four example selection strategies on the same classification task. The results should show that:

1. **Diverse examples outperform or match random selection**—even with fewer examples. Diversity creates richer K/V patterns in attention, covering more of the input space.
2. **Same-category examples create bias**—the model does well on that category but poorly on others, because the attention pattern is dominated by one type of content.
3. **More random examples help, but not as much as better examples**—5 random may beat 3 random, but 3 diverse often beats both.
4. **Random selection has high variance**—different random draws produce very different accuracy. This is the same fragility as ordering sensitivity: surface features of the prompt change behavior.

The lesson's principle: **diversity over quantity, format consistency over more examples.** You can now see the empirical evidence.

<details>
<summary>Solution</summary>

**Why diverse examples work:** Each example adds K/V entries to the attention context. Diverse examples create K/V patterns that cover the input space—the test input's Q vectors can find a good match regardless of its category. Same-category examples create a narrow K/V pattern that only matches one type of input well.

**Diverse trials:**
```python
diverse_3_trials = []
cat_list = list(by_category.keys())
for trial in range(5):
    # Pick 3 of 4 categories
    chosen_cats = rng.sample(cat_list, 3)
    trial_examples = []
    for cat in chosen_cats:
        trial_examples.append(rng.choice(by_category[cat]))
    diverse_3_trials.append(trial_examples)
```

**Same-category trials:**
```python
same_category_trials = []
tech_examples = by_category["Tech"]
for trial in range(5):
    selected = rng.sample(tech_examples, min(3, len(tech_examples)))
    same_category_trials.append(selected)
```

**Common finding:** With `gpt-4o-mini` on this simple classification task, diverse examples usually achieve 80-100% accuracy. Same-category examples often get Tech examples right but miss others. The variance tells the story: good selection is more reliable than good luck.

</details>

---

## Exercise 4: Build a Structured Prompt (Independent)

You have now practiced the three core techniques from this lesson: format specification (Exercise 1), role prompting (Exercise 2), and example selection (Exercise 3). In this exercise, you'll compose them into a single structured prompt for a real task.

**Your task:** Design a complete prompt that generates a meeting summary from raw meeting notes. Your prompt must use **at least 3 techniques** from the lesson:

- Format specification (define the output structure)
- Role / system prompt (set the behavioral frame)
- Few-shot example (show one input-output pair)
- Output constraints (specify length, tone, what to include/exclude)

**Specification:**
1. Write the structured prompt (system + user, or a single prompt with all components)
2. Test it on 3 different sets of meeting notes (provided below)
3. Evaluate consistency: does the output follow the same format each time?
4. Reflect: which components contributed most to consistency?

**No skeleton is provided.** Build the prompt from scratch. Think about which components you need and why. The solution is in the `<details>` block below.

In [None]:
# --- Three sets of raw meeting notes to test your prompt on ---

MEETING_NOTES_1 = """Team standup March 3
- Sarah: finished auth module, PR is up, needs review from David
- David: stuck on database migration, index creation taking too long on prod-size data,
  might need to do it in batches. Will pair with Maria tomorrow
- Maria: launched A/B test for new onboarding flow, results in 2 weeks.
  Also mentioned we need to update the API docs before the partner integration next month
- Jake: out sick, sent update via Slack — mobile bug fix is ready for QA
- Action items: David to create migration batching plan by Thursday.
  Sarah to update API docs. Maria to share A/B test setup with the team."""

MEETING_NOTES_2 = """Product planning meeting — Q2 roadmap review
Date: March 5
Attendees: Lisa (PM), Tom (eng lead), Priya (design), Ahmed (data)

Lisa presented Q2 priorities: 1) self-serve onboarding, 2) analytics dashboard,
3) API v2. Budget approved for 2 new hires (1 frontend, 1 data eng).

Tom raised concern about API v2 timeline — current architecture won't scale.
Needs to do a spike on new approach. 2 weeks.

Priya showed mockups for analytics dashboard. Team liked the direction but
wants more filtering options. Priya to revise by March 12.

Ahmed needs access to production data for analytics work. Lisa to file
security review request.

Decision: push API v2 start date to April to allow for architecture spike.
Decision: hire frontend engineer first, data eng in May."""

MEETING_NOTES_3 = """Incident retro — March 4 outage
What happened: deployment at 2:15pm triggered a memory leak in the notification
service. Service OOMed at 2:47pm. Users couldn't receive notifications for
52 minutes. Rollback completed at 3:39pm.

Root cause: new feature added an unbounded in-memory cache for user preferences.
No cache eviction policy. Cache grew until OOM.

What went well: monitoring caught it quickly (alert at 2:48pm, 1 min after OOM).
Rollback process worked smoothly.

What went wrong: no load testing on the new feature. Code review didn't catch
the missing eviction policy. No memory limits on the container.

Action items:
- Add memory limits to all containers (ops team, by March 8)
- Add load testing to CI pipeline (platform team, by March 15)
- Code review checklist update: add "cache eviction" check (eng leads, by March 7)
- Post-mortem doc to be shared with eng org (Tom, by March 6)"""

all_notes = [
    ("Team Standup", MEETING_NOTES_1),
    ("Product Planning", MEETING_NOTES_2),
    ("Incident Retro", MEETING_NOTES_3),
]

print("Three sets of meeting notes loaded:")
for name, notes in all_notes:
    print(f"  - {name} ({len(notes)} chars)")

In [None]:
# Your code here. Design and test your structured prompt.
#
# 1. Define your prompt (system prompt + user prompt, or a single combined prompt)
# 2. Test it on all 3 sets of meeting notes
# 3. Print each result
# 4. Evaluate: does the format stay consistent across different meeting types?
#
# Use at least 3 techniques from the lesson:
#   - Format specification
#   - Role / system prompt
#   - Few-shot example OR output constraints
#   - (bonus) Any other technique



In [None]:
# --- Reflection ---
# After testing your prompt, answer these questions:
#
# 1. Did the output format stay consistent across all 3 meeting types?
#    (standup vs planning vs retro)
#
# 2. Which component of your prompt contributed MOST to consistency?
#    (format spec? role? example?)
#
# 3. If you removed the format specification but kept everything else,
#    what would happen? (Try it if you want.)
#
# 4. What would you change if the task switched from "meeting summary"
#    to "extract only action items with owners and deadlines"?
#    Which components of your prompt would stay, which would change?

# Print your reflections here (or just think through them):
print("Reflection:")
print("  1. Format consistency across meeting types: ...")
print("  2. Most impactful component: ...")
print("  3. Without format spec: ...")
print("  4. For a different task: ...")

<details>
<summary>Solution</summary>

**The key insight:** A structured prompt for meeting summaries needs three things: (1) a role to set the behavioral frame, (2) a format specification to constrain the output, and (3) explicit instructions about what to include and exclude. The few-shot example is optional but powerful—it shows the model the exact contract you expect.

**Why this combination works:** The role ("executive assistant") biases attention toward summary-relevant information. The format spec (markdown with specific sections) constrains the output structure. The output constraints ("no more than 5 bullet points", "action items must have owners") narrow the distribution further. Together, these components create a prompt that produces consistent, useful summaries regardless of the meeting type.

```python
# System prompt: role + output constraints
summary_system = """You are an experienced executive assistant who writes concise,
actionable meeting summaries. You focus on decisions, action items, and key
discussion points. You omit small talk, redundant details, and anything not
actionable. You always follow the exact format specified by the user."""

# User prompt: format specification + one example + the actual notes
def build_summary_prompt(meeting_notes: str) -> str:
    return f"""Summarize the following meeting notes using this EXACT format:

## Summary
[1-2 sentence overview of what the meeting was about]

## Key Decisions
- [decision 1]
- [decision 2]
(if no decisions were made, write "No decisions recorded.")

## Action Items
- [ ] [task]—**[owner]**, due [date]
(every action item MUST have an owner and a date)

## Discussion Highlights
- [important point 1]
- [important point 2]
(max 4 bullet points)

---

Meeting notes:
{meeting_notes}"""


# Test on all 3
print("STRUCTURED MEETING SUMMARIES")
print("=" * 60)
for name, notes in all_notes:
    print(f"\n{'='*60}")
    print(f"MEETING: {name}")
    print(f"{'='*60}")
    prompt = build_summary_prompt(notes)
    result = call_llm_with_system(summary_system, prompt, max_tokens=600)
    print_wrapped(result)
```

**Techniques used:**
1. **Role prompt** (system): "experienced executive assistant" biases toward concise, actionable summaries
2. **Format specification**: Exact markdown template with sections (Summary, Key Decisions, Action Items, Discussion Highlights)
3. **Output constraints**: Max 4 bullet points for discussion, action items must have owner and date, "No decisions" fallback

**Reflection answers:**
1. The format should stay consistent—all three summaries follow the same markdown template, even though the meetings are very different types.
2. The format specification contributes the most to consistency (just like Exercise 1 showed). The markdown template with section headers is the strongest structural anchor.
3. Without the format spec, the role alone would produce reasonable summaries but in varying formats—sometimes paragraphs, sometimes bullets, sometimes with or without action items.
4. For "extract action items only": keep the role, change the format spec to just an action items list, add a constraint about ignoring non-actionable content. The role is reusable; the format spec is task-specific.

</details>

---

## Key Takeaways

1. **Format specification is the single most impactful technique.** Explicit output schemas (JSON, markdown, bullet templates) constrain the output distribution more reliably than any other component. Format tokens create structural anchors for attention, and autoregressive generation maintains format consistency. A wish produces variable output. A schema produces consistent output.

2. **Role prompts change what the model attends to, not what it knows.** The security auditor, performance engineer, and style reviewer found different issues in the same code—not because the model gained new knowledge, but because the role tokens biased its attention toward different features of the input. Roles shape focus. They do not add capabilities.

3. **Example diversity matters more than example quantity.** Three diverse examples covering different categories outperform five random examples on classification. Diverse examples create richer K/V patterns for attention, covering more of the input space. Same-category examples create bias. Random selection creates variance.

4. **Prompt engineering is composing techniques deliberately.** The meeting summary exercise combined role prompting, format specification, and output constraints into a single structured prompt. Each component serves a specific function in shaping the attention pattern. The skill is knowing which components to include for a given task—not memorizing templates.

5. **The prompt is a program; attention is the interpreter. Prompt engineering is writing better programs.** Every technique works because of attention. Understanding the mechanism—format tokens anchor attention, role tokens bias it, examples create retrieval patterns—lets you reason about which techniques will help, rather than relying on trial and error.