# 🧑‍🏫 Lab 2 — Benchmarking Anthropic Models for Coding Tasks

**Goals**

1. **Load** a random coding question from the *cais/hle* benchmark  
2. **Invoke** 5 Anthropic models hosted on Bedrock (up to Claude 3.7 Sonnet)  
3. **Capture & compare** latency, token usage, *reasoning chains*, and cost  
4. Discuss why larger reasoning budgets help—and what they cost


# Introduction to Benchmarks

Benchmarks are standardized tests designed to evaluate and compare the performance of machine learning models on specific tasks. While benchmarks provide valuable insights into model capabilities, they have limitations:

Not always reflective of real-world performance: Benchmarks can be optimized or "overfit" by models, leading to high scores that don't necessarily translate to practical scenarios.

Limited scope: Benchmarks typically measure only specific aspects of performance, potentially overlooking other crucial model attributes such as robustness, interpretability, and adaptability.

Static and dated: Many benchmarks are fixed datasets, making them vulnerable to becoming outdated as model capabilities and real-world demands evolve rapidly.

Therefore, while benchmarks offer a useful baseline, they should not be the sole criterion for model selection.

# Deep Dive: Humanity Last Exam (HLE)

The Humanity Last Exam (HLE) is a benchmark specifically designed to evaluate coding and reasoning capabilities of advanced language models:

Purpose: Assess models' ability to reason logically, write correct code, and understand complex instructions.

Format: Comprises various programming problems with varying levels of difficulty, each accompanied by known correct solutions for comparison.

Metrics: Models are evaluated based on accuracy, reasoning quality, token efficiency, latency, and overall solution correctness.

By using the HLE, researchers and practitioners can better understand models' reasoning skills and practical utility for programming tasks.

---

## 0. Environment Setup

Install required Python packages for AWS Bedrock, data loading, and rich output.

In [None]:
%pip install -q boto3 langchain_aws datasets pandas tabulate tqdm rich

## 1. Imports & Configuration

Below we set up logging, AWS connectivity, and a helper dataclass that
holds both **input** and **output** token prices plus an optional
`thinking` stanza (only used by Claude 3.7).


In [None]:
from __future__ import annotations

import json, logging, os
from dataclasses import dataclass, field
from datetime import datetime
from typing import Any, Dict, List

from botocore.config import Config
import boto3, pandas as pd
from datasets import load_dataset
from rich.console import Console
from rich.table import Table
from tabulate import tabulate
from time import sleep

# logging & AWS
logging.basicConfig(level=logging.INFO,
                    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s")
log, console = logging.getLogger("bedrock_benchmark"), Console()

REGION = os.getenv("AWS_REGION", "us-east-1")

client_cfg = Config(
    connect_timeout=10,
    read_timeout=300,    # 5 minutes
    retries={'max_attempts': 3}
)
BEDROCK = boto3.client(
    "bedrock-runtime",
    region_name=REGION,
    config=client_cfg
)

In [None]:
# ─── Dataclass ----------------------------------------------------------------
@dataclass
class ModelConfig:
    key: str
    name: str
    model_id: str
    max_tokens: int = 10_000
    temperature: float = 1.0
    thinking: Dict[str, Any] = field(default_factory=dict)

    # pricing (USD per million tokens)
    price_in_per_1M: float = 0.0
    price_out_per_1M: float = 0.0

    # helpers
    @property
    def rates(self) -> tuple[float, float]:
        """Return $ cost *per single token* (input, output)."""
        return (self.price_in_per_1M / 1_000_000, self.price_out_per_1M / 1_000_000)


## 2. Model Catalogue & Cost Reference
We define our six evaluation configurations, along with approximate cost per 1M tokens.

The table below lists the models we’ll test and their **input / output**
token prices (USD per million).  
*Note:* Opus and Sonnet 3.5 require explicit enablement in the Bedrock
console—if you don’t have access yet they will appear as **NO_ACCESS**.


In [None]:
# ─── Pricing table (Apr-2025)  ──  (input $, output $) per million tokens
PRICING_PER_1M = {
    "haiku3.5":     (0.8,  4.0),
    "opus":         (15.0, 75.0),
    "sonnet3.5v2":  (3.0, 15.0),
    "sonnet3.7":    (3.0, 15.0),
}

def geo_prefix(region: str) -> str:
    return "us." if region.startswith("us-") else "eu." if region.startswith("eu-") else "ap."

PFX = geo_prefix(REGION)

In [None]:
# ─── Models under test --------------------------------------------------------
EVAL_MODELS: list[ModelConfig] = [
    ModelConfig(
        "haiku3.5", "Claude 3.5 Haiku",
        f"{PFX}anthropic.claude-3-5-haiku-20241022-v1:0",
        price_in_per_1M=PRICING_PER_1M["haiku3.5"][0],
        price_out_per_1M=PRICING_PER_1M["haiku3.5"][1],
    ),
    ModelConfig(
        "opus", "Claude 3 Opus",
        f"{PFX}anthropic.claude-3-opus-20240229-v1:0",
        price_in_per_1M=PRICING_PER_1M["opus"][0],
        price_out_per_1M=PRICING_PER_1M["opus"][1],
    ),
    ModelConfig(
        "sonnet3.5v2", "Claude 3.5 Sonnet v2",
        f"{PFX}anthropic.claude-3-5-sonnet-20241022-v2:0",
        price_in_per_1M=PRICING_PER_1M["sonnet3.5v2"][0],
        price_out_per_1M=PRICING_PER_1M["sonnet3.5v2"][1],
    ),
    ModelConfig(
        "sonnet3.7_low", "Claude 3.7 Sonnet (low reasoning)",
        f"{PFX}anthropic.claude-3-7-sonnet-20250219-v1:0",
        thinking={"type": "enabled", "budget_tokens": 2048},
        price_in_per_1M=PRICING_PER_1M["sonnet3.7"][0],
        price_out_per_1M=PRICING_PER_1M["sonnet3.7"][1],
    ),
    ModelConfig(
        "sonnet3.7_high", "Claude 3.7 Sonnet (high reasoning)",
        f"{PFX}anthropic.claude-3-7-sonnet-20250219-v1:0",
        max_tokens=10_000,
        thinking={"type": "enabled", "budget_tokens": 8192},
        price_in_per_1M=PRICING_PER_1M["sonnet3.7"][0],
        price_out_per_1M=PRICING_PER_1M["sonnet3.7"][1],
    ),
]

In [None]:
# ─── Helper: pretty pricing table --------------------------------------------
def show_costs() -> None:
    tbl = Table(title="Price — $ per million tokens")
    tbl.add_column("key")
    tbl.add_column("input $/M", justify="right")
    tbl.add_column("output $/M", justify="right")
    for m in EVAL_MODELS:
        tbl.add_row(m.key, f"{m.price_in_per_1M:.2f}", f"{m.price_out_per_1M:.2f}")
    console.print(tbl)

In [None]:
show_costs()

## 3. Load a Coding Prompt
We grab one random test item from *cais/hle*.  
Feel free to re-run the cell to sample a different coding problem.

In [None]:
# ─── Dataset prompt loader ───────────────────────────────────────────────
from functools import lru_cache
from datasets import load_dataset
from random import Random

@lru_cache(maxsize=1)
def _hle_dataset() -> "datasets.Dataset":
    """Cache the dataset so multiple calls are instant."""
    console.print("[grey]Loading *cais/hle* … (cached on first call)")
    return load_dataset("cais/hle", split="test", cache_dir="./hf_cache")

def load_prompt(
    *,
    question_id: str | None = None,
    row_idx: int | None = None,
    seed: int | None = None,
) -> tuple[str, str, dict]:
    """
    Fetch a coding question–answer pair from the HLE benchmark.

    Parameters
    ----------
    question_id : str, optional
        The exact HLE `id` field (e.g. "hle_py_00846").
    row_idx : int, optional
        Direct row index into the test split (0-based).
    seed : int, optional
        If neither `question_id` nor `row_idx` is passed, a random row
        is chosen using this seed (defaults to system RNG).

    Returns
    -------
    question : str
    answer   : str
    meta     : dict   # keys: id, difficulty, prompt_len, answer_len
    """
    ds = _hle_dataset()

    # ❶ Resolve the row
    if question_id is not None:
        try:
            row = ds.filter(lambda r: r["id"] == question_id)[0]
        except IndexError:
            raise ValueError(f"HLE id “{question_id}” not found") from None

    elif row_idx is not None:
        if not (0 <= row_idx < len(ds)):
            raise IndexError(f"row_idx must be in [0, {len(ds)-1}]")
        row = ds[int(row_idx)]

    else:  # random
        rng = Random(seed)
        row = rng.choice(ds)

    # ❷ Pretty print once per call
    console.rule(f"📝  Selected question  —  {row['id']}")
    console.print(row["question"])
    console.print(f"📝  Correct answer: {row['answer']}")

    # ❸ Meta for later analytics
    meta = dict(
        id=row["id"],
        difficulty=row.get("difficulty", "N/A"),
        prompt_len=len(row["question"].split()),
        answer_len=len(row["answer"].split()),
    )

    return row["question"], row["answer"], meta


In [None]:
# --- random question (same behaviour as before)
# prompt, reference_answer, meta = load_prompt()

# # --- deterministic random (useful in slides / demos)
# prompt, reference_answer, meta = load_prompt(seed=42)

# # --- fetch a specific HLE task by its unique id
# prompt, reference_answer, meta = load_prompt(question_id="hle_py_00846")

# # --- or simply by row index
prompt, reference_answer, meta = load_prompt(row_idx=2)

## Build Payload & Invoke
`build_payload` constructs the Bedrock-compatible request.  
`invoke` sends it **and captures reasoning chains** when available
(type `"thinking"` content blocks returned by Claude 3.7).

In [None]:
def build_payload(prompt: str, cfg: ModelConfig) -> Dict[str, Any]:
    pl = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": cfg.max_tokens,
        "temperature": cfg.temperature,
        "messages": [{"role": "user", "content": prompt}],
    }
    if cfg.thinking:
        pl["thinking"] = cfg.thinking
    return pl

def invoke(cfg: ModelConfig, prompt: str) -> Dict[str, Any]:
    console.log(f"Invoking {cfg.name} …")
    try:
        t0 = datetime.now()
        resp = BEDROCK.invoke_model(
            body=json.dumps(build_payload(prompt, cfg)),
            modelId=cfg.model_id,
            contentType="application/json",
            accept="application/json",
        )
        latency = (datetime.now() - t0).total_seconds()
    except BEDROCK.exceptions.AccessDeniedException:
        return {"name": cfg.name, "status": "NO_ACCESS"}
    except Exception as e:
        log.error("%s failed: %s", cfg.name, e)
        return {"name": cfg.name, "status": "ERROR", "error": str(e)}

    data = json.loads(resp["body"].read())

    # extract answer & reasoning
    txt_out = "\n".join(c["text"]  for c in data["content"] if c["type"] == "text").strip()
    thinking_txt = "\n".join(c["thinking"] for c in data["content"] if c["type"] == "thinking").strip()
    thinking_tokens = len(thinking_txt.split())

    tokens_out = len(txt_out.split())
    tokens_in  = len(prompt.split())

    rate_in, rate_out = cfg.rates
    usd_cost = round(tokens_in * rate_in + tokens_out * rate_out, 4)

    sleep(1)
    # ← price fields included so later cells don’t need cfg look-ups
    return {
        "name": cfg.name,
        "status": "OK",
        "latency_s": round(latency, 2),
        "tokens_in": tokens_in,
        "tokens_out": tokens_out,
        "thinking_tokens": thinking_tokens,
        "approx_cost_usd": usd_cost,
        "answer": txt_out or "[empty]",
        "thinking": thinking_txt,
        "price_in_per_1M": cfg.price_in_per_1M,
        "price_out_per_1M": cfg.price_out_per_1M,
    }



## Run Benchmarks & Review
The cell below runs every model, prints a metric table, and then shows
each answer.  
For Claude 3.7 entries we also reveal the **internal reasoning chain**
(truncated to the first 1 000 chars).

In [None]:
def summarise(results: List[Dict[str, Any]]) -> None:
    ok = [r for r in results if r["status"] == "OK"]
    if ok:
        cols = ["name", "latency_s", "tokens_in", "tokens_out",
                "thinking_tokens", "approx_cost_usd"]
        df = pd.DataFrame(ok)[cols]
        console.rule("Run metrics")
        print(tabulate(df, headers="keys", tablefmt="pretty", showindex=False))

    for r in results:
        console.rule(f"Answer — {r['name']} [{r['status']}]")
        console.print(r.get("answer", r.get("error", "")))
        if r.get("thinking"):
            console.print("\n[i]Full reasoning chain:[/i]")
            console.print(r["thinking"])        # ← NO TRUNCATION


console.rule("Running benchmarks")
results = [invoke(cfg, prompt) for cfg in EVAL_MODELS]
summarise(results)

In [None]:
# helper to fetch a model row safely
def pick(prefix: str):
    return next((r for r in results if r["name"].lower().startswith(prefix.lower())), None)

def fmt(row: dict | None, key: str, spec: str = "{}"):
    """Safe formatter: dash if row/key missing."""
    return "—" if row is None or key not in row else spec.format(row[key])

haiku  = pick("claude 3.5 haiku")
opus   = pick("claude 3 opus")
lo37   = pick("claude 3.7 sonnet (low")
hi37   = pick("claude 3.7 sonnet (high")

from IPython.display import Markdown, display

md = f"""
## 🔎 Latency · Cost · Quality snapshot ({datetime.now():%d %b %Y})

| Model | Latency (s) | Out-tokens | Reasoning tokens | Cost (USD) | Notes |
|-------|------------:|-----------:|-----------------:|-----------:|-------|
| **Claude 3.5 Haiku**          | {fmt(haiku,'latency_s','{:.2f}')} | {fmt(haiku,'tokens_out')} | —   | {fmt(haiku,'approx_cost_usd','{:.4f}')} | Fast & cheap |
| **Claude 3 Opus**             | {fmt(opus,'latency_s','{:.2f}')}  | {fmt(opus,'tokens_out')}  | —   | {fmt(opus,'approx_cost_usd','{:.4f}')}  | Premium accuracy |
| **Claude 3.7 Sonnet (low)**   | {fmt(lo37,'latency_s','{:.2f}')}  | {fmt(lo37,'tokens_out')}  | {fmt(lo37,'thinking_tokens')} | {fmt(lo37,'approx_cost_usd','{:.4f}')} | Concise reasoning |
| **Claude 3.7 Sonnet (high)**  | {fmt(hi37,'latency_s','{:.2f}')}  | {fmt(hi37,'tokens_out')}  | {fmt(hi37,'thinking_tokens')} | {fmt(hi37,'approx_cost_usd','{:.4f}')} | Full rationale |
"""
display(Markdown(md))

*Interpretation*

* Haiku 3.5 gives you a usable answer in ~4 s for < 1 cents.
* Low-budget Sonnet 3.7 surfaces its reasoning (≈ 2 k tokens) with only a 70 ms/token penalty.
* The high-budget preset is valuable when you **must** audit every step or need a > 8 k context window; otherwise the cost-latency trade-off is hard to justify.

## 🧠 2 · What Are We Paying for with `thinking`?

| Sonnet 3.7 preset | Reasoning tokens | Δ $ vs Haiku | Where it helped |
|-------------------|-----------------:|-------------:|-----------------|
| **low (2 048)**  | 2 048            | **+ 0.0715** | clarified loop invariants & explained complexity |
| **high (10 000)**| 10 000           | **+ 0.1765** | produced full design doc, edge-case tests & refactor suggestions |

**Insights**

* The *first* ~2 k reasoning tokens captured ≈ 90 % of useful explanation.  
* Jumping from **low** → **high** added 7 952 tokens and ≈ \$0.11 per call.  
* Unless you need an audit trail or pedagogical commentary, the *low*
  preset is the sweet-spot.


## 🛠️ 3 · Choosing the Right Model & Reasoning Budget (2025)

1. **Rapid prototyping / IDE autocomplete** → **Haiku 3.5**
   Lowest latency keeps the feedback loop tight.

2. **Mainline coding tasks** → **Sonnet 3.7 (low)**
   Better reasoning than Haiku, modest cost, transparent chain-of-thought.

3. **Hard algorithms or large context** → **Sonnet 3.7 (high)**
   Use when you need > 8 k context or a fully detailed rationale.

4. **Compliance / Audit**
   Persist the entire reasoning text plus `thinking_tokens` for audits.

5. **Opus**
   Enable only when you truly need the very best model and cost is secondary.

6. **Budget tip**
   Cost grows linearly with reasoning tokens.
   Trimming the chain from 10 k → 4 k saves ≈ 6 k · \$0.003 = \$0.018 per call.


---

## ▶️ Next Step — Running Your Own Benchmarks

Feel free to adapt and extend these benchmarks with other models available to you. Popular alternatives include:

Google Vertex AI (Gemini models)

OpenAI GPT models

Meta's LLaMA family

Locally-hosted open-source models

Comparing multiple models allows a more comprehensive understanding of each model's strengths and weaknesses, ensuring optimal selection based on your specific use case and constraints.

### We’ll now switch clouds and run the **same questions** against **Gemini Pro 2.5** (free quota on Vertex AI).
