WritePolicyBench is a deterministic, streaming benchmark for memory write policies: deciding what to WRITE / MERGE / EXPIRE / SKIP under strict byte budgets.
The goal is to isolate “the write problem” (what enters memory, how it is updated, and what gets evicted) as a first-class evaluation target, with byte-accurate accounting and reproducible grading.
writepolicybench/— benchmark implementation (episode schema, memory interface, evaluator)data/episodes/— frozen episode sets used for reproducible comparisons (seeMANIFEST.json)docs/— benchmark specification and runbookscripts/— utilities (freeze episodes; sanity checks)tests/— unit tests for metric/budget invariants
Create a virtualenv and install deps:
python3 -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -e .Run the evaluator (writes artifacts/results.csv):
python3 -m writepolicybench.evaluator(Optional) Run sanity checks:
python3 -m scripts.sanity_checks(Optional) Freeze episodes (recommended for reproducible comparisons):
python3 -m scripts.freeze_episodesThe benchmark supports two evaluation tracks:
- Unprivileged: policies see only the observation stream (and explicitly allowed benign metadata).
- Privileged: policies additionally observe a bounded priority signal
p_t ∈ [0, 1].
MIT (see LICENSE).