# In-Context Learning

In this notebook, you'll explore in-context learning (ICL) empirically — how transformers learn new tasks from examples in the prompt without any weight update, and where that ability breaks down.

**What you'll do:**
- Compare zero-shot and few-shot classification on the same sentiment task, measuring how examples in the prompt change model behavior
- Create novel mappings the model has never seen in training data, testing where ICL generalizes and where it fails
- Run a controlled experiment on example ordering, measuring the accuracy swings that reveal attention-based (not comprehension-based) behavior
- Compare ICL directly against a finetuned classifier on the same task, discovering when each approach wins

**For each exercise, PREDICT the output before running the cell.** Wrong predictions are more valuable than correct ones — they reveal gaps in your mental model.

In [None]:
# Setup — self-contained for Google Colab
!pip install -q openai

import os
import textwrap
import random
from openai import OpenAI
import matplotlib.pyplot as plt
import numpy as np

# --- API Key Setup ---
# Option 1: Set your API key as an environment variable (recommended)
#   In Colab: go to the key icon in the left sidebar, add OPENAI_API_KEY
# Option 2: Paste it directly (less secure, don't commit this)
#   os.environ["OPENAI_API_KEY"] = "sk-..."

# You can also use any OpenAI-compatible API (e.g., local Ollama, Together AI)
# by changing the base_url:
#   client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

client = OpenAI()

# Use a small, cheap model for the exercises
MODEL = "gpt-4o-mini"

# Nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]

# Reproducible results where possible
random.seed(42)
np.random.seed(42)


def call_llm(prompt: str, temperature: float = 0.0) -> str:
    """Call the LLM with a single prompt (completion-style). Returns the response text."""
    response = client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature,
        max_tokens=100,
    )
    return response.choices[0].message.content.strip()


def call_llm_with_system(system_prompt: str, user_prompt: str, temperature: float = 0.0) -> str:
    """Call the LLM with a system prompt and user prompt. Returns the response text."""
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        temperature=temperature,
        max_tokens=100,
    )
    return response.choices[0].message.content.strip()


def print_wrapped(text: str, width: int = 80, prefix: str = ""):
    """Print text with word wrapping for readability."""
    for line in text.split("\n"):
        wrapped = textwrap.fill(line, width=width, initial_indent=prefix, subsequent_indent=prefix)
        print(wrapped)


# Quick test to verify the API is working
test = call_llm("Say 'API connection successful' and nothing else.")
print(test)
print(f"\nUsing model: {MODEL}")
print("Setup complete.")

## Shared Data

We'll use a common set of sentiment examples across multiple exercises. These are simple, unambiguous movie reviews — the task itself is easy, so we can focus on how ICL behavior changes with different prompting strategies.

In [None]:
# --- Shared sentiment data ---
# 5 examples we'll use as few-shot demonstrations
FEW_SHOT_EXAMPLES = [
    ("This movie was absolutely amazing, I loved every minute", "Positive"),
    ("Terrible film, complete waste of time and money", "Negative"),
    ("A beautiful and moving story that touched my heart", "Positive"),
    ("The acting was awful and the plot made no sense", "Negative"),
    ("One of the best films I have seen this year", "Positive"),
]

# 10 test examples with ground truth labels
TEST_EXAMPLES = [
    ("The cinematography was stunning and the soundtrack was perfect", "Positive"),
    ("I fell asleep halfway through, incredibly boring", "Negative"),
    ("A masterpiece of modern cinema", "Positive"),
    ("The worst movie I have ever seen", "Negative"),
    ("Heartwarming and funny, great for the whole family", "Positive"),
    ("Predictable plot with flat characters", "Negative"),
    ("An unforgettable experience that left me speechless", "Positive"),
    ("Dull, lifeless, and painfully long", "Negative"),
    ("Brilliant performances from the entire cast", "Positive"),
    ("A complete disappointment from start to finish", "Negative"),
]


def build_few_shot_prompt(examples: list[tuple[str, str]], test_review: str) -> str:
    """Build a few-shot prompt from example pairs and a test review."""
    lines = []
    for review, label in examples:
        lines.append(f'Review: "{review}" -> {label}')
    lines.append(f'Review: "{test_review}" ->')
    return "\n".join(lines)


def extract_sentiment(response: str) -> str:
    """Extract 'Positive' or 'Negative' from a model response."""
    response_lower = response.lower().strip()
    if "positive" in response_lower:
        return "Positive"
    if "negative" in response_lower:
        return "Negative"
    return "Unknown"


def compute_accuracy(predictions: list[str], ground_truth: list[str]) -> float:
    """Compute accuracy as fraction of correct predictions."""
    correct = sum(1 for p, g in zip(predictions, ground_truth) if p == g)
    return correct / len(ground_truth)


print(f"Few-shot examples: {len(FEW_SHOT_EXAMPLES)}")
print(f"Test examples: {len(TEST_EXAMPLES)}")
print("\nSample few-shot prompt:")
print(build_few_shot_prompt(FEW_SHOT_EXAMPLES[:3], TEST_EXAMPLES[0][0]))

---

## Exercise 1: Zero-Shot vs Few-Shot Comparison (Guided)

The lesson showed a few-shot sentiment prompt and claimed that examples in the prompt genuinely change model behavior. Let's measure that claim.

You'll classify the same 10 reviews twice:
1. **Zero-shot:** Just the instruction, no examples
2. **Few-shot (3 examples):** Three input-output pairs before the test input

**Before running, predict:**
- Will the zero-shot approach get any wrong? Sentiment is a simple task — maybe zero-shot is already good enough?
- If few-shot is better, by how much? 5%? 20%? 50%?
- Will there be any reviews where zero-shot succeeds but few-shot fails?

In [None]:
# --- Zero-shot classification ---
# The model gets only an instruction, no examples.

ZERO_SHOT_TEMPLATE = (
    'Classify the sentiment of the following movie review as Positive or Negative.\n'
    'Respond with a single word: Positive or Negative.\n\n'
    'Review: "{review}"\n'
    'Sentiment:'
)

ground_truth = [label for _, label in TEST_EXAMPLES]

# Zero-shot predictions
print("ZERO-SHOT CLASSIFICATION")
print("=" * 60)
zero_shot_preds = []
for review, true_label in TEST_EXAMPLES:
    prompt = ZERO_SHOT_TEMPLATE.format(review=review)
    raw_response = call_llm(prompt)
    pred = extract_sentiment(raw_response)
    zero_shot_preds.append(pred)
    match = "correct" if pred == true_label else "WRONG"
    print(f"  [{match:>7}] {true_label:>8} | {pred:>8} | {review[:50]}...")

zero_shot_acc = compute_accuracy(zero_shot_preds, ground_truth)
print(f"\nZero-shot accuracy: {zero_shot_acc:.0%}")

In [None]:
# --- Few-shot classification (3 examples) ---
# The model sees 3 labeled examples before each test input.
# Same model, same weights. The only difference: examples in the prompt.

few_shot_examples = FEW_SHOT_EXAMPLES[:3]  # Use first 3 examples

print("FEW-SHOT CLASSIFICATION (3 examples)")
print("=" * 60)
few_shot_preds = []
for review, true_label in TEST_EXAMPLES:
    prompt = build_few_shot_prompt(few_shot_examples, review)
    raw_response = call_llm(prompt)
    pred = extract_sentiment(raw_response)
    few_shot_preds.append(pred)
    match = "correct" if pred == true_label else "WRONG"
    print(f"  [{match:>7}] {true_label:>8} | {pred:>8} | {review[:50]}...")

few_shot_acc = compute_accuracy(few_shot_preds, ground_truth)
print(f"\nFew-shot accuracy: {few_shot_acc:.0%}")

In [None]:
# --- Compare the two approaches ---

fig, ax = plt.subplots(figsize=(8, 5))

labels = ["Zero-Shot\n(instruction only)", "Few-Shot\n(3 examples)"]
accs = [zero_shot_acc * 100, few_shot_acc * 100]
colors = ["#f59e0b", "#6366f1"]

bars = ax.bar(labels, accs, color=colors, edgecolor="white", linewidth=0.5, width=0.5)
for bar, val in zip(bars, accs):
    ax.text(
        bar.get_x() + bar.get_width() / 2, bar.get_height() + 1,
        f"{val:.0f}%", ha="center", va="bottom", fontsize=14, fontweight="bold", color="white",
    )

ax.set_ylabel("Accuracy (%)", fontsize=12)
ax.set_title("Zero-Shot vs Few-Shot: Same Model, Same Task", fontsize=13, fontweight="bold")
ax.set_ylim(0, 110)
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
plt.tight_layout()
plt.show()

diff = few_shot_acc - zero_shot_acc
print(f"Difference: {diff:+.0%}")
print(f"\nThe model's weights did not change between these two runs.")
print(f"No gradients. No optimizer. No training. The only difference")
print(f"is what's in the prompt.")
print(f"\nIf the accuracy is similar: sentiment is 'easy enough' that the")
print(f"model's pretraining is sufficient. The examples help less when the")
print(f"task is already well-represented in training data.")
print(f"\nIf few-shot is clearly better: the examples provided a retrieval")
print(f"structure — the model's attention matched the test input against")
print(f"example inputs and retrieved the output pattern.")
print(f"\nEither result is informative. The question is not 'which is better'")
print(f"but 'what do the examples change about the model's computation?'")

**What just happened:** You measured the effect of in-context learning on a simple task. The model's weights are frozen — no gradients, no optimizer, no training loop. The only difference between zero-shot and few-shot is the content of the prompt.

The lesson explained the mechanism: when examples are in the prompt, the test input's query vectors attend to the example inputs (structural matching via QK) and retrieve the output pattern through the value vectors. The few-shot examples create a retrieval structure in the attention weights.

For a well-known task like sentiment classification, the improvement may be modest — the model already "knows" sentiment from pretraining. The real power of ICL shows on novel tasks. That's Exercise 2.

---

## Exercise 2: Novel Task ICL (Supported)

The lesson argued that ICL is more than memorized pattern matching — it can generalize to tasks the model has never seen in training data. The novel symbol mapping ("sdag" -> "happy") was the proof.

In this exercise, you'll test that claim yourself. You'll create made-up mappings of increasing complexity and see where ICL succeeds and where it fails. The boundary between success and failure reveals what attention can compute in a single forward pass.

Fill in the TODOs below. Each TODO is 1-3 lines.

<details>
<summary>Hint</summary>

For your custom mappings, think about functions of increasing complexity:
- A mapping the model could plausibly pattern-match (e.g., first letter -> a word starting with that letter)
- A mapping that requires a simple transformation (e.g., reverse the word, count the letters)
- A mapping that is arbitrary and has no learnable pattern from the examples alone

ICL should succeed on the first two (attention can match the structural pattern) and struggle on the third (arbitrary mappings require memorization, not computation).

</details>

In [None]:
# --- Mapping 1 (provided): Made-up words -> emotions ---
# This mapping is arbitrary but the OUTPUT domain is familiar (emotions).
# The model can't have seen these nonsense words mapped to emotions in training.

mapping_1 = {
    "name": "Nonsense words -> emotions",
    "examples": [
        ("sdag", "happy"),
        ("trel", "sad"),
        ("blix", "angry"),
        ("worp", "scared"),
    ],
    "test_input": "frum",
    "expected_behavior": "Should produce an emotion word (the specific emotion doesn't matter — the model should follow the PATTERN of nonsense->emotion)",
}


def test_mapping(mapping: dict) -> str:
    """Test an ICL mapping. Returns the model's response."""
    lines = []
    for inp, out in mapping["examples"]:
        lines.append(f"{inp} -> {out}")
    lines.append(f"{mapping['test_input']} ->")
    prompt = "\n".join(lines)
    return call_llm(prompt)


# Test mapping 1
print(f"MAPPING 1: {mapping_1['name']}")
print("=" * 50)
print("Prompt:")
for inp, out in mapping_1["examples"]:
    print(f"  {inp} -> {out}")
print(f"  {mapping_1['test_input']} -> ???")
print(f"\nExpected: {mapping_1['expected_behavior']}")

result_1 = test_mapping(mapping_1)
print(f"Model output: {result_1}")
print(f"\nDid it follow the pattern? The model has never seen 'sdag' mapped to")
print(f"'happy' in training data. If it produced an emotion word, ICL worked")
print(f"on a truly novel mapping.")

In [None]:
# --- Mapping 2: A learnable function ---
# TODO: Create a mapping where the output is a COMPUTABLE FUNCTION of the input.
# Examples: word -> its length as a digit, word -> first letter repeated,
#           word -> the word reversed, number -> number * 2, etc.
#
# Choose a function where you can verify the answer is correct.

# TODO: Fill in the mapping dict. Provide 4 examples and 1 test input.
# YOUR CODE HERE (6-10 lines)
mapping_2 = {
    "name": "",  # Give your mapping a descriptive name
    "examples": [
        # (input, output), ...
    ],
    "test_input": "",
    "expected_output": "",  # What the correct answer should be
    "expected_behavior": "",  # Describe what you expect
}

# Test it
print(f"MAPPING 2: {mapping_2['name']}")
print("=" * 50)
print("Prompt:")
for inp, out in mapping_2["examples"]:
    print(f"  {inp} -> {out}")
print(f"  {mapping_2['test_input']} -> ???")
print(f"\nExpected output: {mapping_2['expected_output']}")
print(f"Expected behavior: {mapping_2['expected_behavior']}")

result_2 = test_mapping(mapping_2)
print(f"Model output: {result_2}")

correct = mapping_2["expected_output"].lower() in result_2.lower()
print(f"\nCorrect? {'Yes' if correct else 'No'}")
if correct:
    print("The model learned a computable function from 4 examples — no weight updates.")
else:
    print("The model did not get the expected output. What does this tell you")
    print("about the limits of what attention can compute in a single forward pass?")

In [None]:
# --- Mapping 3: A harder or arbitrary mapping ---
# TODO: Create a mapping that is HARDER for ICL.
# Ideas:
#   - A completely arbitrary mapping (no learnable pattern)
#   - A function that requires multi-step reasoning
#   - A mapping where the output domain is unfamiliar
#
# The goal is to find a mapping where ICL FAILS or produces inconsistent results.
# Finding the boundary is more informative than finding another success.

# TODO: Fill in the mapping dict. Provide 4-5 examples and 1 test input.
# YOUR CODE HERE (6-10 lines)
mapping_3 = {
    "name": "",  # Give your mapping a descriptive name
    "examples": [
        # (input, output), ...
    ],
    "test_input": "",
    "expected_output": "",  # What the correct answer should be (or "any X" if pattern-based)
    "expected_behavior": "",  # What you think will happen
}

# Test it
print(f"MAPPING 3: {mapping_3['name']}")
print("=" * 50)
print("Prompt:")
for inp, out in mapping_3["examples"]:
    print(f"  {inp} -> {out}")
print(f"  {mapping_3['test_input']} -> ???")
print(f"\nExpected: {mapping_3['expected_behavior']}")

result_3 = test_mapping(mapping_3)
print(f"Model output: {result_3}")

In [None]:
# --- Summary of all three mappings ---

print("NOVEL TASK ICL: SUMMARY")
print("=" * 60)
print(f"\n  Mapping 1 ({mapping_1['name']}): {result_1}")
print(f"  Mapping 2 ({mapping_2['name']}): {result_2}")
print(f"  Mapping 3 ({mapping_3['name']}): {result_3}")
print()
print("What this tells you about ICL:")
print("- Mapping 1 (arbitrary but familiar output domain): ICL can follow")
print("  the pattern even when the input is nonsense, because the output")
print("  domain (emotions) is well-represented in pretraining.")
print("- Mapping 2 (computable function): ICL can learn simple functions")
print("  from examples — the attention mechanism computes the transformation.")
print("- Mapping 3 (harder / arbitrary): ICL struggles when the mapping")
print("  requires multi-step reasoning or has no learnable pattern.")
print("\nThe boundary: ICL works within the scope of what attention can")
print("compute in a single forward pass. Simple structural patterns and")
print("familiar output domains succeed. Complex multi-step reasoning or")
print("truly arbitrary mappings fail. This is consistent with the lesson's")
print("framing: ICL is attention-based computation, not comprehension.")

<details>
<summary>Solution</summary>

**Why the mapping complexity matters:** ICL works because attention creates retrieval patterns — the test input's queries match example inputs' keys, and the values carry the output pattern. This mechanism can compute structural transformations (reverse a word, count letters) but cannot compute arbitrary lookups (random input-output pairs with no pattern).

**Example Mapping 2 (computable function — word reversal):**
```python
mapping_2 = {
    "name": "Word reversal",
    "examples": [
        ("cat", "tac"),
        ("dog", "god"),
        ("star", "rats"),
        ("loop", "pool"),
    ],
    "test_input": "hello",
    "expected_output": "olleh",
    "expected_behavior": "Should output 'olleh' — the model can learn reversal from examples",
}
```

**Example Mapping 3 (arbitrary — random code mapping):**
```python
mapping_3 = {
    "name": "Arbitrary code (no pattern)",
    "examples": [
        ("apple", "7X"),
        ("banana", "3Q"),
        ("cherry", "9M"),
        ("date", "1F"),
        ("elderberry", "5R"),
    ],
    "test_input": "fig",
    "expected_output": "no correct answer possible",
    "expected_behavior": "Should fail or produce a plausible-looking but meaningless output. There is no pattern to learn — each mapping is arbitrary.",
}
```

**Common finding:** The model often succeeds on word reversal (a computable structural transformation) and produces a plausible-looking but incorrect output for the arbitrary mapping (e.g., "2K" — it follows the format but cannot know the correct answer). This is the boundary: ICL computes, it does not memorize arbitrary mappings.

**Alternative Mapping 2 (letter counting):**
```python
mapping_2 = {
    "name": "Letter counting",
    "examples": [
        ("cat", "3"),
        ("hello", "5"),
        ("a", "1"),
        ("elephant", "8"),
    ],
    "test_input": "python",
    "expected_output": "6",
    "expected_behavior": "Should output '6' — counting letters is a simple computable function",
}
```

</details>

---

## Exercise 3: Ordering Sensitivity Experiment (Supported)

The lesson claimed that reordering the same few-shot examples can swing accuracy by 20-30 percentage points. That is a strong claim. If the model truly "understood" the task, order would not matter — the same examples convey the same information regardless of arrangement.

In this exercise, you'll run a controlled experiment: same 5 examples, same 10 test inputs, 5 different orderings. You'll measure accuracy per ordering and plot the results.

Fill in the TODOs below. Each TODO is 1-3 lines.

<details>
<summary>Hint</summary>

Generate 5 random permutations of the `FEW_SHOT_EXAMPLES` list. For each permutation, classify all 10 test examples and compute accuracy. The key insight is whether accuracy varies significantly across orderings — if it does, the model's behavior depends on surface features (position, recency) rather than task comprehension.

</details>

In [None]:
# --- Generate 5 distinct orderings of the same 5 examples ---

# We use a fixed seed for reproducibility but generate genuine permutations
rng = random.Random(42)

orderings = []
seen = set()
base = list(range(len(FEW_SHOT_EXAMPLES)))

while len(orderings) < 5:
    perm = base[:]
    rng.shuffle(perm)
    perm_tuple = tuple(perm)
    if perm_tuple not in seen:
        seen.add(perm_tuple)
        orderings.append(perm)

print("5 orderings of the same 5 examples:")
for i, order in enumerate(orderings):
    labels = [FEW_SHOT_EXAMPLES[j][1][:3] for j in order]  # Pos/Neg abbreviated
    print(f"  Ordering {i + 1}: indices {order} -> labels [{', '.join(labels)}]")

print(f"\nNotice the different patterns: some end with Positive, some with")
print(f"Negative. Because of causal masking, the LAST example has the strongest")
print(f"recency effect — the test input always attends to it.")

In [None]:
# --- Run the experiment ---

ground_truth = [label for _, label in TEST_EXAMPLES]

ordering_results = []

print("ORDERING SENSITIVITY EXPERIMENT")
print("=" * 60)

for i, order in enumerate(orderings):
    # Build the reordered example list
    reordered_examples = [FEW_SHOT_EXAMPLES[j] for j in order]

    # TODO: Classify all 10 test examples using the reordered few-shot examples.
    # Use build_few_shot_prompt(reordered_examples, review) to build each prompt,
    # call_llm() to get the response, and extract_sentiment() to parse it.
    # Collect predictions in a list called `preds`.
    # YOUR CODE HERE (4-6 lines)


    # TODO: Compute accuracy using compute_accuracy(preds, ground_truth)
    # Store it in a variable called `acc`.
    # YOUR CODE HERE (1 line)


    ordering_results.append({
        "ordering_idx": i + 1,
        "order": order,
        "accuracy": acc,
        "predictions": preds,
    })

    last_label = FEW_SHOT_EXAMPLES[order[-1]][1]
    print(f"  Ordering {i + 1}: accuracy = {acc:.0%}  (last example: {last_label})")

# Summary statistics
accs = [r["accuracy"] for r in ordering_results]
print(f"\nAccuracy range: {min(accs):.0%} — {max(accs):.0%}")
print(f"Spread: {(max(accs) - min(accs)) * 100:.0f} percentage points")
print(f"Mean: {np.mean(accs):.0%}, Std: {np.std(accs):.0%}")

In [None]:
# --- Visualize the results ---

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Left: accuracy per ordering
ordering_labels = [f"Ordering {r['ordering_idx']}" for r in ordering_results]
accs_pct = [r["accuracy"] * 100 for r in ordering_results]
last_labels = [FEW_SHOT_EXAMPLES[r["order"][-1]][1] for r in ordering_results]
colors = ["#6366f1" if l == "Positive" else "#ef4444" for l in last_labels]

bars = ax1.bar(ordering_labels, accs_pct, color=colors, edgecolor="white", linewidth=0.5, width=0.6)
for bar, val in zip(bars, accs_pct):
    ax1.text(
        bar.get_x() + bar.get_width() / 2, bar.get_height() + 1,
        f"{val:.0f}%", ha="center", va="bottom", fontsize=12, fontweight="bold", color="white",
    )

ax1.set_ylabel("Accuracy (%)", fontsize=11)
ax1.set_title("Accuracy by Example Ordering", fontsize=13, fontweight="bold")
ax1.set_ylim(0, 110)
ax1.spines["top"].set_visible(False)
ax1.spines["right"].set_visible(False)

# Add legend for last-example color
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor="#6366f1", label="Last example: Positive"),
    Patch(facecolor="#ef4444", label="Last example: Negative"),
]
ax1.legend(handles=legend_elements, loc="lower right", fontsize=9)

# Right: per-test-example consistency across orderings
# How many orderings got each test example correct?
per_example_correct = []
for j in range(len(TEST_EXAMPLES)):
    correct_count = sum(
        1 for r in ordering_results
        if r["predictions"][j] == ground_truth[j]
    )
    per_example_correct.append(correct_count)

test_labels = [f"Test {j+1}" for j in range(len(TEST_EXAMPLES))]
bar_colors = ["#10b981" if c == 5 else "#f59e0b" if c >= 3 else "#ef4444" for c in per_example_correct]

bars2 = ax2.bar(test_labels, per_example_correct, color=bar_colors, edgecolor="white", linewidth=0.5, width=0.6)
ax2.set_ylabel("Orderings Correct (out of 5)", fontsize=11)
ax2.set_title("Per-Example Consistency Across Orderings", fontsize=13, fontweight="bold")
ax2.set_ylim(0, 6)
ax2.axhline(y=5, color="#10b981", linestyle="--", alpha=0.3)
ax2.spines["top"].set_visible(False)
ax2.spines["right"].set_visible(False)

legend2 = [
    Patch(facecolor="#10b981", label="Always correct (robust)"),
    Patch(facecolor="#f59e0b", label="Sometimes correct (fragile)"),
    Patch(facecolor="#ef4444", label="Mostly wrong (ordering-dependent)"),
]
ax2.legend(handles=legend2, loc="lower right", fontsize=9)

plt.suptitle(
    "Same Examples, Different Orders: ICL Is Not Comprehension",
    fontsize=14, fontweight="bold", y=1.02,
)
plt.tight_layout()
plt.show()

fragile = sum(1 for c in per_example_correct if c < 5)
print(f"\n{fragile} out of {len(TEST_EXAMPLES)} test examples changed their prediction")
print(f"depending on example ordering.")
print(f"\nIf the model 'understood' the task, ordering would not matter.")
print(f"The sensitivity to ordering is consistent with the attention-based")
print(f"mechanism: causal masking creates recency bias (later examples have")
print(f"stronger influence), and the position of positive vs negative examples")
print(f"changes the attention pattern over the context.")

<details>
<summary>Solution</summary>

**Why ordering matters from the attention perspective:** With causal masking, each token can only attend to tokens at the same position or earlier. The test input token attends to ALL example tokens, but the attention distribution is not uniform — tokens closer to the end of the context tend to have stronger influence due to positional encoding patterns and the sheer structure of causal attention.

The TODO code:

```python
# Classify all 10 test examples with this ordering
preds = []
for review, _ in TEST_EXAMPLES:
    prompt = build_few_shot_prompt(reordered_examples, review)
    response = call_llm(prompt)
    preds.append(extract_sentiment(response))

# Compute accuracy
acc = compute_accuracy(preds, ground_truth)
```

**What to look for in the results:**
- If the accuracy spread is large (10+ points), ordering sensitivity is confirmed.
- Check whether orderings ending with a Positive example bias the model toward Positive predictions (and vice versa). This is the recency effect from causal masking.
- The per-example consistency chart reveals which test examples are "easy" (always correct regardless of ordering) and which are "fragile" (ordering-dependent). Fragile examples are likely the ones closest to the decision boundary.

**Common finding:** With `gpt-4o-mini` on simple sentiment, the effect may be modest (5-10 points) because the task is too easy. On harder tasks or with more ambiguous examples, the effect is larger. The lesson's claim of 20-30 points comes from research on more challenging benchmarks.

</details>

---

## Exercise 4: ICL vs Finetuning (Independent)

You have now seen ICL work on sentiment classification (Exercise 1), novel tasks (Exercise 2), and you have measured its fragility (Exercise 3). The last question: how does ICL compare to the approach you already know — finetuning?

In Module 4.4, you classified sentiment by adding a classification head and training on labeled data. That required gradient descent, a labeled dataset, and weight updates. ICL uses 3-5 examples and no training.

**Your task:** Simulate a comparison between ICL and a finetuned classifier on the same sentiment task. Since we cannot run finetuning in this notebook, you will use the LLM to simulate a finetuned model's behavior with a strong system prompt, and compare against ICL.

**Specification:**
1. Create an "ICL classifier" that uses 3-5 few-shot examples to classify sentiment
2. Create a "simulated finetuned classifier" that uses a detailed system prompt (acting as a purpose-trained sentiment classifier with high confidence)
3. Test both on the same 10 test examples
4. Compare accuracy, but also compare **response consistency** — run each classifier 3 times on the same inputs with temperature=0.3 and measure how often the prediction changes
5. Visualize the comparison

Think about: when would you choose ICL over finetuning? When would finetuning be worth the cost?

<details>
<summary>Solution</summary>

**The reasoning behind the comparison:** ICL and finetuning are not competing approaches — they are different tools for different constraints. ICL is fast (no training), flexible (change the task by changing the prompt), and cheap (no compute for training). Finetuning is more accurate (thousands of examples vs 3-5), more robust (not sensitive to example ordering), and more consistent (learned decision boundary vs context-dependent computation).

```python
# --- ICL Classifier ---
icl_examples = FEW_SHOT_EXAMPLES[:3]

def icl_classify(review: str) -> str:
    prompt = build_few_shot_prompt(icl_examples, review)
    response = call_llm(prompt, temperature=0.3)
    return extract_sentiment(response)


# --- Simulated Finetuned Classifier ---
FINETUNED_SYSTEM = (
    "You are a sentiment classification model that has been finetuned on "
    "50,000 movie reviews. You classify reviews as exactly 'Positive' or "
    "'Negative'. You are highly accurate and confident. Respond with a "
    "single word: Positive or Negative."
)

def finetuned_classify(review: str) -> str:
    response = call_llm_with_system(
        FINETUNED_SYSTEM,
        f'Classify: "{review}"',
        temperature=0.3,
    )
    return extract_sentiment(response)


# --- Accuracy comparison ---
ground_truth = [label for _, label in TEST_EXAMPLES]

icl_preds = [icl_classify(review) for review, _ in TEST_EXAMPLES]
ft_preds = [finetuned_classify(review) for review, _ in TEST_EXAMPLES]

icl_acc = compute_accuracy(icl_preds, ground_truth)
ft_acc = compute_accuracy(ft_preds, ground_truth)

print(f"ICL accuracy: {icl_acc:.0%}")
print(f"Finetuned accuracy: {ft_acc:.0%}")


# --- Consistency comparison (3 runs each) ---
icl_runs = []
ft_runs = []
for _ in range(3):
    icl_runs.append([icl_classify(r) for r, _ in TEST_EXAMPLES])
    ft_runs.append([finetuned_classify(r) for r, _ in TEST_EXAMPLES])

# Count inconsistencies per example
icl_inconsistent = sum(
    1 for j in range(len(TEST_EXAMPLES))
    if len(set(run[j] for run in icl_runs)) > 1
)
ft_inconsistent = sum(
    1 for j in range(len(TEST_EXAMPLES))
    if len(set(run[j] for run in ft_runs)) > 1
)

print(f"\nICL inconsistent predictions: {icl_inconsistent}/{len(TEST_EXAMPLES)}")
print(f"Finetuned inconsistent predictions: {ft_inconsistent}/{len(TEST_EXAMPLES)}")


# --- Visualization ---
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Accuracy
methods = ["ICL\n(3 examples)", "Finetuned\n(simulated)"]
acc_vals = [icl_acc * 100, ft_acc * 100]
bars = ax1.bar(methods, acc_vals, color=["#6366f1", "#f59e0b"], width=0.5)
for bar, val in zip(bars, acc_vals):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height()+1,
             f"{val:.0f}%", ha="center", fontsize=14, fontweight="bold", color="white")
ax1.set_ylabel("Accuracy (%)")
ax1.set_title("Accuracy")
ax1.set_ylim(0, 110)
ax1.spines["top"].set_visible(False)
ax1.spines["right"].set_visible(False)

# Consistency
cons_vals = [
    (len(TEST_EXAMPLES) - icl_inconsistent) / len(TEST_EXAMPLES) * 100,
    (len(TEST_EXAMPLES) - ft_inconsistent) / len(TEST_EXAMPLES) * 100,
]
bars2 = ax2.bar(methods, cons_vals, color=["#6366f1", "#f59e0b"], width=0.5)
for bar, val in zip(bars2, cons_vals):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height()+1,
             f"{val:.0f}%", ha="center", fontsize=14, fontweight="bold", color="white")
ax2.set_ylabel("Consistency (%)")
ax2.set_title("Consistency (3 runs, same inputs)")
ax2.set_ylim(0, 110)
ax2.spines["top"].set_visible(False)
ax2.spines["right"].set_visible(False)

plt.suptitle("ICL vs Finetuning: Accuracy and Consistency",
             fontsize=14, fontweight="bold", y=1.02)
plt.tight_layout()
plt.show()

print("\nWhen to choose ICL:")
print("  - You need to classify a new task RIGHT NOW (no training time)")
print("  - You have very few labeled examples (3-10)")
print("  - The task may change frequently (just update the prompt)")
print("  - Accuracy does not need to be perfect")
print("\nWhen to choose finetuning:")
print("  - You have thousands of labeled examples")
print("  - You need high, consistent accuracy")
print("  - The task is stable (won't change frequently)")
print("  - Robustness matters (no sensitivity to prompt format)")
print("\nThey are complementary tools, not competing approaches.")
```

**The meta-insight:** This comparison reveals that ICL and finetuning trade off along different axes. ICL is fast and flexible but fragile. Finetuning is slow and rigid but robust. The choice depends on your constraints: how much labeled data do you have? How much time? How stable is the task? How important is consistency?

This is the same tradeoff thinking from the Alignment Techniques Landscape (Module 5.1, Lesson 2) — there is no universally best approach, only tradeoffs along axes.

</details>

In [None]:
# Your code here. Refer to the specification above and the solution if needed.
# Build an ICL classifier, a simulated finetuned classifier, test both on
# the 10 test examples, measure accuracy and consistency, and visualize.



---

## Key Takeaways

1. **Examples in the prompt genuinely change model behavior.** Zero-shot and few-shot classification use the same model with the same frozen weights. The only difference is the prompt content. The improvement (or lack thereof) is the empirical signature of in-context learning.

2. **ICL works on novel tasks, not just memorized patterns.** Made-up mappings that could not appear in training data still work — at least when the transformation is computable by attention in a single forward pass. The boundary between success and failure reveals the scope of what attention can compute.

3. **Ordering sensitivity confirms the attention-based mechanism.** Same examples, different order, different accuracy. If the model truly "understood" the task, order would not matter. The sensitivity to ordering is consistent with causal masking (recency bias) and attention-based retrieval, not comprehension.

4. **ICL and finetuning are complementary, not competing.** ICL is fast, flexible, and cheap but fragile and sensitive to prompt design. Finetuning is accurate, robust, and consistent but requires labeled data, compute, and time. The choice depends on your constraints — not on which is "better."

5. **The prompt is a program; attention is the interpreter.** Every prompt configures the model for a different task through attention over the context. Different examples, different behavior. Same weights, different programs. This is the core mental model for understanding ICL.