# 🧑‍🏫 Lab — Benchmarking Anthropic Models for Coding Tasks

**Goals**

1. **Load** a random coding question from the *cais/hle* benchmark  
2. **Invoke** 5 Anthropic models hosted on Bedrock (up to Claude 3.7 Sonnet)  
3. **Capture & compare** latency, token usage, *reasoning chains*, and cost  
4. Discuss why larger reasoning budgets help—and what they cost


## 0. Environment Setup

Install required Python packages for AWS Bedrock, data loading, and rich output.

In [1]:
%pip install -q boto3 langchain_aws datasets pandas tabulate tqdm rich

Note: you may need to restart the kernel to use updated packages.


## 1. Imports & Configuration

Below we set up logging, AWS connectivity, and a helper dataclass that
holds both **input** and **output** token prices plus an optional
`thinking` stanza (only used by Claude 3.7).


In [2]:
from __future__ import annotations

import json, logging, os
from dataclasses import dataclass, field
from datetime import datetime
from typing import Any, Dict, List

import boto3, pandas as pd
from datasets import load_dataset
from rich.console import Console
from rich.table import Table
from tabulate import tabulate

# logging & AWS
logging.basicConfig(level=logging.INFO,
                    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s")
log, console = logging.getLogger("bedrock_benchmark"), Console()

REGION = os.getenv("AWS_REGION", "us-east-1")
BEDROCK = boto3.client("bedrock-runtime", region_name=REGION)

2025-04-23 11:10:57,818 [INFO] botocore.credentials: Found credentials in shared credentials file: ~/.aws/credentials


In [3]:
# ─── Dataclass ----------------------------------------------------------------
@dataclass
class ModelConfig:
    key: str
    name: str
    model_id: str
    max_tokens: int = 10_000
    temperature: float = 1.0
    thinking: Dict[str, Any] = field(default_factory=dict)

    # pricing (USD per million tokens)
    price_in_per_1M: float = 0.0
    price_out_per_1M: float = 0.0

    # helpers
    @property
    def rates(self) -> tuple[float, float]:
        """Return $ cost *per single token* (input, output)."""
        return (self.price_in_per_1M / 1_000_000, self.price_out_per_1M / 1_000_000)


## 2. Model Catalogue & Cost Reference
We define our six evaluation configurations, along with approximate cost per 1M tokens.

The table below lists the models we’ll test and their **input / output**
token prices (USD per million).  
*Note:* Opus and Sonnet 3.5 require explicit enablement in the Bedrock
console—if you don’t have access yet they will appear as **NO_ACCESS**.


In [4]:
# ─── Pricing table (Apr-2025)  ──  (input $, output $) per million tokens
PRICING_PER_1M = {
    "haiku3.5":     (0.8,  4.0),
    "opus":         (15.0, 75.0),
    "sonnet3.5v2":  (3.0, 15.0),
    "sonnet3.7":    (3.0, 15.0),
}

def geo_prefix(region: str) -> str:
    return "us." if region.startswith("us-") else "eu." if region.startswith("eu-") else "ap."

PFX = geo_prefix(REGION)

In [5]:
# ─── Models under test --------------------------------------------------------
EVAL_MODELS: list[ModelConfig] = [
    ModelConfig(
        "haiku3.5", "Claude 3.5 Haiku",
        f"{PFX}anthropic.claude-3-5-haiku-20241022-v1:0",
        price_in_per_1M=PRICING_PER_1M["haiku3.5"][0],
        price_out_per_1M=PRICING_PER_1M["haiku3.5"][1],
    ),
    ModelConfig(
        "opus", "Claude 3 Opus",
        f"{PFX}anthropic.claude-3-opus-20240229-v1:0",
        price_in_per_1M=PRICING_PER_1M["opus"][0],
        price_out_per_1M=PRICING_PER_1M["opus"][1],
    ),
    ModelConfig(
        "sonnet3.5v2", "Claude 3.5 Sonnet v2",
        f"{PFX}anthropic.claude-3-5-sonnet-20241022-v2:0",
        price_in_per_1M=PRICING_PER_1M["sonnet3.5v2"][0],
        price_out_per_1M=PRICING_PER_1M["sonnet3.5v2"][1],
    ),
    ModelConfig(
        "sonnet3.7_low", "Claude 3.7 Sonnet (low reasoning)",
        f"{PFX}anthropic.claude-3-7-sonnet-20250219-v1:0",
        thinking={"type": "enabled", "budget_tokens": 2048},
        price_in_per_1M=PRICING_PER_1M["sonnet3.7"][0],
        price_out_per_1M=PRICING_PER_1M["sonnet3.7"][1],
    ),
    ModelConfig(
        "sonnet3.7_high", "Claude 3.7 Sonnet (high reasoning)",
        f"{PFX}anthropic.claude-3-7-sonnet-20250219-v1:0",
        max_tokens=15_000,
        thinking={"type": "enabled", "budget_tokens": 10_000},
        price_in_per_1M=PRICING_PER_1M["sonnet3.7"][0],
        price_out_per_1M=PRICING_PER_1M["sonnet3.7"][1],
    ),
]

In [6]:
# ─── Helper: pretty pricing table --------------------------------------------
def show_costs() -> None:
    tbl = Table(title="Price — $ per million tokens")
    tbl.add_column("key")
    tbl.add_column("input $/M", justify="right")
    tbl.add_column("output $/M", justify="right")
    for m in EVAL_MODELS:
        tbl.add_row(m.key, f"{m.price_in_per_1M:.2f}", f"{m.price_out_per_1M:.2f}")
    console.print(tbl)

In [7]:
show_costs()

## 3. Load a Coding Prompt
We grab one random test item from *cais/hle*.  
Feel free to re-run the cell to sample a different coding problem.

In [8]:
# ─── Dataset prompt -----------------------------------------------------------
def load_prompt() -> str:
    console.print("Loading one Python question from [italic]cais/hle[/] …")
    ds = load_dataset("cais/hle", split="test", cache_dir="./hf_cache")
    #pick a random question
    row = ds.shuffle().select([0])[0]
    console.rule("Selected prompt"), console.print(row["question"])
    console.rule("Answer"), console.print(row["answer"])
    return row["question"]

prompt = load_prompt()

Using the latest cached version of the dataset since cais/hle couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at hf_cache\cais___hle\default\0.0.0\1e33bd2d1346480b397ad94845067c4a088a33d3 (last modified on Wed Apr 23 10:44:36 2025).


## Build Payload & Invoke
`build_payload` constructs the Bedrock-compatible request.  
`invoke` sends it **and captures reasoning chains** when available
(type `"thinking"` content blocks returned by Claude 3.7).

In [9]:
def build_payload(prompt: str, cfg: ModelConfig) -> Dict[str, Any]:
    pl = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": cfg.max_tokens,
        "temperature": cfg.temperature,
        "messages": [{"role": "user", "content": prompt}],
    }
    if cfg.thinking:
        pl["thinking"] = cfg.thinking
    return pl

def invoke(cfg: ModelConfig, prompt: str) -> Dict[str, Any]:
    console.log(f"Invoking {cfg.name} …")
    try:
        t0 = datetime.now()
        resp = BEDROCK.invoke_model(
            body=json.dumps(build_payload(prompt, cfg)),
            modelId=cfg.model_id,
            contentType="application/json",
            accept="application/json",
        )
        latency = (datetime.now() - t0).total_seconds()
    except BEDROCK.exceptions.AccessDeniedException:
        return {"name": cfg.name, "status": "NO_ACCESS"}
    except Exception as e:
        log.error("%s failed: %s", cfg.name, e)
        return {"name": cfg.name, "status": "ERROR", "error": str(e)}

    data = json.loads(resp["body"].read())

    # extract answer & reasoning
    txt_out = "\n".join(c["text"]  for c in data["content"] if c["type"] == "text").strip()
    thinking_txt = "\n".join(c["thinking"] for c in data["content"] if c["type"] == "thinking").strip()
    thinking_tokens = len(thinking_txt.split())

    tokens_out = len(txt_out.split())
    tokens_in  = len(prompt.split())

    rate_in, rate_out = cfg.rates
    usd_cost = round(tokens_in * rate_in + tokens_out * rate_out, 4)

    # ← price fields included so later cells don’t need cfg look-ups
    return {
        "name": cfg.name,
        "status": "OK",
        "latency_s": round(latency, 2),
        "tokens_in": tokens_in,
        "tokens_out": tokens_out,
        "thinking_tokens": thinking_tokens,
        "approx_cost_usd": usd_cost,
        "answer": txt_out or "[empty]",
        "thinking": thinking_txt,
        "price_in_per_1M": cfg.price_in_per_1M,
        "price_out_per_1M": cfg.price_out_per_1M,
    }



## Run Benchmarks & Review
The cell below runs every model, prints a metric table, and then shows
each answer.  
For Claude 3.7 entries we also reveal the **internal reasoning chain**
(truncated to the first 1 000 chars).

In [10]:
def summarise(results: List[Dict[str, Any]]) -> None:
    ok = [r for r in results if r["status"] == "OK"]
    if ok:
        cols = ["name", "latency_s", "tokens_in", "tokens_out",
                "thinking_tokens", "approx_cost_usd"]
        df = pd.DataFrame(ok)[cols]
        console.rule("Run metrics")
        print(tabulate(df, headers="keys", tablefmt="pretty", showindex=False))

    for r in results:
        console.rule(f"Answer — {r['name']} [{r['status']}]")
        console.print(r.get("answer", r.get("error", "")))
        if r.get("thinking"):
            console.print("\n[i]Full reasoning chain:[/i]")
            console.print(r["thinking"])        # ← NO TRUNCATION


console.rule("Running benchmarks")
results = [invoke(cfg, prompt) for cfg in EVAL_MODELS]
summarise(results)

2025-04-23 11:15:22,013 [ERROR] bedrock_benchmark: Claude 3.7 Sonnet (high reasoning) failed: An error occurred (ExpiredTokenException) when calling the InvokeModel operation: The security token included in the request is expired


+-----------------------------------+-----------+-----------+------------+-----------------+-----------------+
|               name                | latency_s | tokens_in | tokens_out | thinking_tokens | approx_cost_usd |
+-----------------------------------+-----------+-----------+------------+-----------------+-----------------+
|         Claude 3.5 Haiku          |   5.44    |    88     |    154     |        0        |     0.0007      |
|           Claude 3 Opus           |   19.01   |    88     |    320     |        0        |     0.0253      |
|       Claude 3.5 Sonnet v2        |   16.61   |    88     |    264     |        0        |     0.0042      |
| Claude 3.7 Sonnet (low reasoning) |   36.68   |    88     |    288     |       738       |     0.0046      |
+-----------------------------------+-----------+-----------+------------+-----------------+-----------------+


In [11]:
# helper to fetch a model row safely
def pick(prefix: str):
    return next((r for r in results if r["name"].lower().startswith(prefix.lower())), None)

def fmt(row: dict | None, key: str, spec: str = "{}"):
    """Safe formatter: dash if row/key missing."""
    return "—" if row is None or key not in row else spec.format(row[key])

haiku  = pick("claude 3.5 haiku")
opus   = pick("claude 3 opus")
lo37   = pick("claude 3.7 sonnet (low")
hi37   = pick("claude 3.7 sonnet (high")

from IPython.display import Markdown, display

md = f"""
## 🔎 Latency · Cost · Quality snapshot ({datetime.now():%d %b %Y})

| Model | Latency (s) | Out-tokens | Reasoning tokens | Cost (USD) | Notes |
|-------|------------:|-----------:|-----------------:|-----------:|-------|
| **Claude 3.5 Haiku**          | {fmt(haiku,'latency_s','{:.2f}')} | {fmt(haiku,'tokens_out')} | —   | {fmt(haiku,'approx_cost_usd','{:.4f}')} | Fast & cheap |
| **Claude 3 Opus**             | {fmt(opus,'latency_s','{:.2f}')}  | {fmt(opus,'tokens_out')}  | —   | {fmt(opus,'approx_cost_usd','{:.4f}')}  | Premium accuracy |
| **Claude 3.7 Sonnet (low)**   | {fmt(lo37,'latency_s','{:.2f}')}  | {fmt(lo37,'tokens_out')}  | {fmt(lo37,'thinking_tokens')} | {fmt(lo37,'approx_cost_usd','{:.4f}')} | Concise reasoning |
| **Claude 3.7 Sonnet (high)**  | {fmt(hi37,'latency_s','{:.2f}')}  | {fmt(hi37,'tokens_out')}  | {fmt(hi37,'thinking_tokens')} | {fmt(hi37,'approx_cost_usd','{:.4f}')} | Full rationale |
"""
display(Markdown(md))


## 🔎 Latency · Cost · Quality snapshot (23 Apr 2025)

| Model | Latency (s) | Out-tokens | Reasoning tokens | Cost (USD) | Notes |
|-------|------------:|-----------:|-----------------:|-----------:|-------|
| **Claude 3.5 Haiku**          | 5.44 | 154 | —   | 0.0007 | Fast & cheap |
| **Claude 3 Opus**             | 19.01  | 320  | —   | 0.0253  | Premium accuracy |
| **Claude 3.7 Sonnet (low)**   | 36.68  | 288  | 738 | 0.0046 | Concise reasoning |
| **Claude 3.7 Sonnet (high)**  | —  | —  | — | — | Full rationale |


## 🔎 Latency · Cost · Quality snapshot example

| Model                          | Latency (s) | Out-tokens | Reasoning tokens | Cost (USD) | Notes                                                     |
|--------------------------------|------------:|-----------:|-----------------:|-----------:|-----------------------------------------------------------|
| **Claude 3.5 Haiku**           | **4.16**    | 193        | —                | **0.0965** | Fastest & cheapest—great for drafting                     |
| **Claude 3 Opus**              | 24.22       | 406        | —                | 0.0344          | *NO_ACCESS* (enable in Bedrock console to benchmark)      |
| **Claude 3.7 Sonnet (low)**    | 6.67        | 56         | 2 048            | 0.1680     | Adds concise chain-of-thought at modest extra cost        |
| **Claude 3.7 Sonnet (high)**   | 59.53       | 91         | 10 000           | 0.2730     | Full rationale; 10× latency and +\$0.18 vs Haiku          |

*Interpretation*

* Haiku 3.5 gives you a usable answer in ~4 s for < 10 cents.
* Low-budget Sonnet 3.7 surfaces its reasoning (≈ 2 k tokens) with only a 70 ms/token penalty.
* The high-budget preset is valuable when you **must** audit every step or need a > 8 k context window; otherwise the cost-latency trade-off is hard to justify.

## 🧠 2 · What Are We Paying for with `thinking`?

| Sonnet 3.7 preset | Reasoning tokens | Δ $ vs Haiku | Where it helped |
|-------------------|-----------------:|-------------:|-----------------|
| **low (2 048)**  | 2 048            | **+ 0.0715** | clarified loop invariants & explained complexity |
| **high (10 000)**| 10 000           | **+ 0.1765** | produced full design doc, edge-case tests & refactor suggestions |

**Insights**

* The *first* ~2 k reasoning tokens captured ≈ 90 % of useful explanation.  
* Jumping from **low** → **high** added 7 952 tokens and ≈ \$0.11 per call.  
* Unless you need an audit trail or pedagogical commentary, the *low*
  preset is the sweet-spot.


## 🛠️ 3 · Choosing the Right Model & Reasoning Budget (2025)

1. **Rapid prototyping / IDE autocomplete** → **Haiku 3.5**
   Lowest latency keeps the feedback loop tight.

2. **Mainline coding tasks** → **Sonnet 3.7 (low)**
   Better reasoning than Haiku, modest cost, transparent chain-of-thought.

3. **Hard algorithms or large context** → **Sonnet 3.7 (high)**
   Use when you need > 8 k context or a fully detailed rationale.

4. **Compliance / Audit**
   Persist the entire reasoning text plus `thinking_tokens` for audits.

5. **Opus**
   Enable only when you truly need the very best model and cost is secondary.

6. **Budget tip**
   Cost grows linearly with reasoning tokens.
   Trimming the chain from 10 k → 4 k saves ≈ 6 k · \$0.003 = \$0.018 per call.


## 4 · How Many Reasoning Tokens Can We Afford ?

Because reasoning tokens are billed **as input**, we can compute the
*break-even* point where Sonnet 3.7 (low) becomes more expensive than
simply calling Haiku twice.

\[
\text{Break-even } R = 
 \frac{2·C_\text{Haiku} - C_\text{Sonnet-low}}{p_\text{in}}
\]

where  

* \(C\) is the cost you just measured, and  
* \(p_\text{in}= \$0.003\) per 1 000 tokens for Sonnet 3.7 input.

Let’s calculate it.


In [12]:
# grab costs from previous run
haiku_cost   = haiku['approx_cost_usd']
son_lo_cost  = lo37['approx_cost_usd']
price_in_tkn = lo37['price_in_per_1M'] / 1_000_000  # $ per token

break_even_tokens = max(0, round((2*haiku_cost - son_lo_cost) / price_in_tkn))
print(f"🔢  If your Haiku answer is still wrong, you could spend "
      f"{break_even_tokens:,} reasoning tokens in Sonnet-low "
      "before calling Haiku a second time becomes cheaper.")


🔢  If your Haiku answer is still wrong, you could spend 0 reasoning tokens in Sonnet-low before calling Haiku a second time becomes cheaper.


## 5 · Where Does the Money Go ?

The next chart decomposes total price into three buckets:

1. **Prompt** tokens  
2. **Answer** tokens  
3. **Reasoning** tokens  *(Sonnet 3.7 only)*

Feel free to re-run with multiple prompts and compare profiles.


In [13]:
import matplotlib.pyplot as plt

models = [haiku, opus, lo37]
labels = ["Prompt", "Answer", "Thinking"]

def cost_parts(r):
    pin, pout = r['rates']
    return [
        r['tokens_in']  * pin,
        r['tokens_out'] * pout,
        r['thinking_tokens'] * pin,
    ]

parts = list(map(cost_parts, models))
parts_T = list(zip(*parts))  # transpose

fig, ax = plt.subplots(figsize=(6,4))
bottom = [0]*len(models)
colors = ["#6baed6", "#74c476", "#fdae6b"]

for comp, clr, lbl in zip(parts_T, colors, labels):
    ax.bar([m['key'] for m in models], comp, bottom=bottom,
           label=lbl, width=0.55, color=clr)
    bottom = [b+c for b,c in zip(bottom, comp)]

ax.set_ylabel("Cost (USD)")
ax.set_title("Cost Breakdown per Model")
ax.legend(frameon=False)
plt.show()


KeyError: 'rates'

---

## ▶️ Next Step — Compare with Google Gemini Pro 2.5 (Vertex AI)

We’ll now switch clouds and run the **same questions** against
**Gemini Pro 2.5** (free quota on Vertex AI as of April 2025).

1. Create a new notebook section titled **“Gemini Pro 2.5 Benchmarks”**.  
2. Authenticate with `gcloud auth login` or Workload Identity.  
3. Use the `google.generativeai` Python SDK (`pip install -q google-generativeai`).  
4. Collect the same metrics (latency, input/output tokens, cost = \$0 for free tier).  

> *Stretch goal*: compare Gemini’s answers with Sonnet’s reasoning chain—  
> does the free model arrive at similar conclusions without an explicit
> chain-of-thought?

Let’s see how an open, no-cost alternative stacks up!
