# pass@k evaluation — single-model (dannkoh/warp-1.0)

## Overview
Evaluate a single local model (dannkoh/warp-1.0) on the WARP benchmark (dannkoh/WARP-benchmark, `test` split) using pass@k with m = 10 sampled candidates per problem and k = 1..10. No instruct-style prompts — use the preset instruction below. Results are saved as JSON in `src/results_dannkoh/pass@k/`.

## Preset instruction (applied to every prompt)
````text
All per-variable constraints must be combined using a top-level (assert (and ...)) clause.
The output must be in exact, canonical SMT-LIB format without extra commentary in the constraint string.
Show your work in <think> </think> tags. And return the final SMT-LIB constraint string in <answer> </answer> tags.
For example: <answer>(assert (and  ( >=  in0 97)  ( <=  in0 122)))</answer>.
````

## Configuration
- Model: `dannkoh/warp-1.0` (single model)
- Dataset: `dannkoh/WARP-benchmark`, split=`test`
- Samples per problem: m = 10
- k values: integers 1 through 10
- Results dir: `src/results_dannkoh/pass@k/`
- Output format: JSON (per-problem and overall summary)
- Decoding (fixed) — documented in metadata:
  - temperature = 0.8
  - top_p = 0.95
  - seed = 42 (use seed offsets per sample to produce independent draws)
  - max_tokens = model-appropriate (e.g., 8000)

## Methodology
1. For each test item:
   - Compose prompt with the preset instruction (no instruct-style wrapper).
   - Generate m = 10 independent sampled candidate responses.
   - Extract SMT-LIB answer from `<answer>...</answer>` tags.
   - Use the repository Z3-based checker (check_logical_equivalence) to mark each candidate correct/incorrect.
2. Compute exact pass@k per problem using the combinatorial formula (below) for k = 1..10.
3. Aggregate pass@k across all problems by averaging per-problem pass@k values.
4. Save per-problem records and overall summary JSON files to `src/results_dannkoh/pass@k/`.

## pass@k (exact) formula
For a problem with m samples and c correct samples among them:
pass@k = 1 − C(m − c, k) / C(m, k)  (when k ≤ m)  
If k > m then treat pass@k = 1 if c > 0 else 0.

Compute this per problem and then average across problems for each k.

## Per-problem JSON schema (saved to individual_stats.json)
- index: int
- tier: string
- prompt: string
- responses: list[string] (length m)
- extracted: list[string] (length m)
- correctness: list[bool] (length m)
- reasons: list[Optional[string]] (Z3/parse failure reasons)
- c: int (number correct)
- pass_at_k: dict { "1": float, ..., "10": float }

## Overall JSON schema (saved to overall_stats.json)
- model: "dannkoh/warp-1.0"
- dataset: "dannkoh/WARP-benchmark"
- m: 10
- k_values: [1,2,...,10]
- decoding: {temperature, top_p, seed, ...}
- total_problems: int
- pass_at_k: {"1": float, ..., "10": float}

Example:
````json
{
  "model": "dannkoh/warp-1.0",
  "dataset": "dannkoh/WARP-benchmark",
  "m": 10,
  "k_values": [1,2,3,4,5,6,7,8,9,10],
  "decoding": {"temperature": 0.8, "top_p": 0.95, "seed": 42},
  "total_problems": 1234,
  "pass_at_k": {"1": 0.12, "2": 0.17, "3": 0.20, "4": 0.22, "5": 0.24, "6": 0.25, "7": 0.26, "8": 0.27, "9": 0.27, "10": 0.28}
}
````

## Reproducibility and choices
- Fix decoding hyperparameters and record them in the JSON header.
- Use deterministic RNG seed base (e.g., 42) and vary by sample index (42 + i) to create independent draws while allowing reproducibility.
- Decide whether to deduplicate identical extracted SMT-LIB strings before counting c; document and choose one approach (recommended: do not deduplicate unless justified).
- Log any parse/extraction failures and include reasons in per-problem output.

## Post-processing & visualization
- Use the overall `pass_at_k` curve (k = 1..10) to show how performance improves with more samples.
- Optionally break down pass@k by tier and save tiered summaries compatible with the repository aggregator.

## Running summary
- Ensure HF token and vLLM environment are configured.
- Run the notebook/script that implements the above steps and writes JSON to `src/results_dannkoh/pass@k/`.
- Feed results into aggregate.py (adapt as needed) for tiered summaries and percentages.

In [None]:
from __future__ import annotations

import json
import math
import os
from pathlib import Path
from typing import Any, Dict, List, Optional

import matplotlib.pyplot as plt
from datasets import load_dataset
from tqdm.auto import tqdm

from utils.configs import ModelConfig
from utils.evaluation import LLMHelper, Loader, check_logical_equivalence

# Configuration (edit as needed)
MODEL = "dannkoh/warp-1.0"
DATASET = "dannkoh/WARP-benchmark"
SPLIT = "test"
M = 10
K_VALUES = list(range(1, 11))
BATCH_SIZE = 64
RESULTS_DIR = Path("results_dannkoh/pass@k")
RESULTS_DIR.mkdir(parents=True, exist_ok=True)

In [2]:
def compute_pass_at_k(m: int, c: int, k_values: List[int]) -> Dict[int, float]:
    out: Dict[int, float] = {}
    for k in k_values:
        if c == 0:
            out[k] = 0.0
        elif k >= m:
            out[k] = 1.0 if c > 0 else 0.0
        else:
            out[k] = 1.0 - math.comb(m - c, k) / math.comb(m, k)
    return out


def run_passk(
    model_id: str,
    dataset_id: str = DATASET,
    split: str = SPLIT,
    m: int = M,
    k_values: Optional[List[int]] = None,
    batch_size: int = BATCH_SIZE,
    results_dir: Path = RESULTS_DIR,
) -> Dict[str, Any]:
    if k_values is None:
        k_values = list(range(1, m + 1))

    modelcfg = ModelConfig(
        model_name=model_id,
        quantization_mode=None,
        token=os.getenv("HUGGINGFACE_TOKEN"),
        instruct=False,
    )
    llm = LLMHelper(modelconfig=modelcfg)


    ds = load_dataset(dataset_id, split=split)
    items = []
    for idx, ex in enumerate(ds):
        question_field = ex.get("question") or ex.get("prompt") or ex.get("input") or ex.get("problem") or ""
        prompt = Loader.apply_chat_template(prompt=question_field, instruct=False)
        items.append({
            "index": int(ex.get("index", idx)),
            "tier": ex.get("tier", "unknown"),
            "prompt": prompt,
            "truth": (ex.get("answer") or "").strip(),
            "constants": ex.get("constants", None),
        })

    records: List[Dict[str, Any]] = []
    total = len(items)
    if total == 0:
        raise RuntimeError("No examples found in dataset/split")

    for batch_start in tqdm(range(0, total, batch_size), desc="Batches"):
        batch = items[batch_start: batch_start + batch_size]
        prompts = [b["prompt"] for b in batch]
        responses_per_example: List[List[str]] = [[] for _ in batch]

        for _ in range(m):
            outs = llm.get_response(prompts)
            if len(outs) != len(prompts):
                outs = (outs + [""] * len(prompts))[: len(prompts)]
            for i, o in enumerate(outs):
                responses_per_example[i].append(o)

        for meta, responses in zip(batch, responses_per_example):
            extracted: List[str] = []
            correctness: List[bool] = []
            reasons: List[Optional[str]] = []
            for resp in responses:
                try:
                    ans = Loader.extract_response(resp)
                except Exception as e:
                    ans = ""
                    extracted.append(ans)
                    correctness.append(False)
                    reasons.append(f"extraction_failure: {e}")
                    continue

                extracted.append(ans)
                try:
                    result = check_logical_equivalence(
                        original_assertions=meta["truth"],
                        generated_assertions=ans,
                        constants=meta["constants"],
                    )
                    is_correct = bool(result.get("result", False))
                    reason = result.get("reason")
                except Exception as e:
                    is_correct = False
                    reason = f"checker_exception: {e}"

                correctness.append(is_correct)
                reasons.append(reason)

            c = sum(1 for v in correctness if v)
            pass_at_k = compute_pass_at_k(m=m, c=c, k_values=k_values)

            records.append({
                "index": meta["index"],
                "tier": meta["tier"],
                "prompt": meta["prompt"],
                "responses": responses,
                "extracted": extracted,
                "correctness": correctness,
                "reasons": reasons,
                "c": int(c),
                "pass_at_k": {str(k): pass_at_k[k] for k in k_values},
            })

    overall_pass: Dict[str, float] = {str(k): 0.0 for k in k_values}
    for rec in records:
        for k in k_values:
            overall_pass[str(k)] += rec["pass_at_k"][str(k)]
    if records:
        for k in k_values:
            overall_pass[str(k)] /= len(records)

    overall = {
        "model": model_id,
        "dataset": dataset_id,
        "split": split,
        "m": m,
        "k_values": [int(k) for k in k_values],
        "total_problems": len(records),
        "pass_at_k": overall_pass,
    }

    (results_dir / "individual_stats.json").write_text(json.dumps(records, indent=2))
    (results_dir / "overall_stats.json").write_text(json.dumps(overall, indent=2))

    return overall

In [None]:
overall = run_passk(
    model_id=MODEL,
    dataset_id=DATASET,
    split=SPLIT,
    m=M,
    k_values=K_VALUES,
    batch_size=BATCH_SIZE,
    results_dir=RESULTS_DIR,
)

print("Overall summary:")
print(json.dumps(overall, indent=2))

# plot pass@k curve
k_vals = [int(k) for k in overall["k_values"]]
p_vals = [overall["pass_at_k"][str(k)] for k in k_vals]
plt.figure(figsize=(6,4))
plt.plot(k_vals, p_vals, marker="o")
plt.xlabel("k")
plt.ylabel("pass@k")
plt.title(f"pass@k — {MODEL} on {DATASET}")
plt.grid(True)
plt.xticks(k_vals)
plt.tight_layout()
plt.savefig(RESULTS_DIR / "pass_at_k_curve.png", dpi=300)

INFO 08-19 11:23:53 [__init__.py:241] Automatically detected platform cpu.
INFO 08-19 11:23:54 [utils.py:326] non-default args: {'model': 'dannkoh/warp-1.0', 'trust_remote_code': True, 'max_model_len': 8000, 'max_num_batched_tokens': 8000, 'disable_log_stats': True, 'disable_custom_all_reduce': True}


The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.


INFO 08-19 11:24:00 [__init__.py:711] Resolved architecture: Qwen2ForCausalLM
INFO 08-19 11:24:00 [__init__.py:1750] Using max model len 8000
INFO 08-19 11:24:00 [arg_utils.py:1083] Chunked prefill is not supported for ARM and POWER CPUs; disabling it for V1 backend.
INFO 08-19 11:24:06 [__init__.py:241] Automatically detected platform cpu.
[1;36m(EngineCore_0 pid=47360)[0;0m INFO 08-19 11:24:07 [core.py:636] Waiting for init message from front-end.
[1;36m(EngineCore_0 pid=47360)[0;0m INFO 08-19 11:24:07 [core.py:74] Initializing a V1 LLM engine (v0.10.2.dev7+g5f5664b3e) with config: model='dannkoh/warp-1.0', speculative_config=None, tokenizer='dannkoh/warp-1.0', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:21<00:21, 21.35s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:36<00:00, 17.41s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:36<00:00, 18.00s/it]
[1;36m(EngineCore_0 pid=47360)[0;0m 


[1;36m(EngineCore_0 pid=47360)[0;0m INFO 08-19 11:28:08 [default_loader.py:262] Loading weights took 36.06 seconds
[1;36m(EngineCore_0 pid=47360)[0;0m INFO 08-19 11:28:08 [kv_cache_utils.py:849] GPU KV cache size: 116,496 tokens
[1;36m(EngineCore_0 pid=47360)[0;0m INFO 08-19 11:28:08 [kv_cache_utils.py:853] Maximum concurrency for 8,000 tokens per request: 14.56x
[1;36m(EngineCore_0 pid=47360)[0;0m INFO 08-19 11:28:08 [cpu_model_runner.py:99] Warming up model for the compilation...
[1;36m(EngineCore_0 pid=47360)[0;0m INFO 08-19 11:29:19 [cpu_model_runner.py:103] Warming up done.
[1;36m(EngineCore_0 pid=47360)[0;0m INFO 08-19 11:29:19 [core.py:214] init engine (profile, create kv cache, warmup model) took 70.92 seconds
INFO 08-19 11:29:21 [llm.py:298] Supported_tasks: ['generate']


README.md: 0.00B [00:00, ?B/s]

dataset.parquet:   0%|          | 0.00/427k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/671 [00:00<?, ? examples/s]

Batches:   0%|          | 0/11 [00:00<?, ?it/s]

Adding requests:   0%|          | 0/64 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]



KeyboardInterrupt: 

ERROR 08-19 11:55:10 [core_client.py:562] Engine core proc EngineCore_0 died unexpectedly, shutting down client.
