wmh makes it easy to go from agent traces to faithful replication of your production environment where your agents run.
Basically, an LLM pretends to be a virtual machine executing instructions — but it's 5x faster than a real sandbox.
Just:
git clone https://github.com/experientiallabs/world-model-harness
cd world-model-harness
uv sync
uv run wmh buildThe build command opens a wizard that walks you through creating your own world model from your traces.
Below is a comparison running 8 SWE-bench tasks: real sandboxes on the left, a world model acting as the sandbox on the right.
A frontier LLM acts as the environment your agent steps against, reconstructed from your own OpenTelemetry traces. Inspired by Qwen-AgentWorld (LLM-as-environment), GEPA (reflective prompt evolution), and DreamGym (retrieval over a trace replay buffer) — but with zero training: we get there with prompt optimization on a frontier model.
- Build from your OTel traces: ingest → normalize → split train/held-out → index a replay buffer → evolve the env prompt with GEPA against the held-out split.
- Serve: agents call
WorldModel.step(action)(in-process or via the local HTTP backend). Each step retrieves the most similar past(state, action) → observationexamples and predicts the next observation.
uv run wmh examples list # swe-bench, tau-bench, terminal-tasks
uv run wmh eval list # eval suites shipped with the examples
uv run wmh eval run tau-bench # replay + score reconstruction fidelity
uv run wmh play # step into the environment yourself
uv run wmh serve # local HTTP backend on :8000Example-local prebuilt models live under examples/<task>/models/; pass --root examples/<task> to wmh list, wmh demo, wmh play, or wmh serve to use one without rebuilding.
from wmh import Action, ActionKind
from wmh.config.store import WorldModelStore
from wmh.engine.loader import load_world_model
model_dir = WorldModelStore(".wmh").resolve("airline")
wm, _provider = load_world_model(model_dir)
session = wm.new_session(task="check out the cart")
obs = wm.step(session.id, Action(kind=ActionKind.TOOL_CALL, name="add_to_cart",
arguments={"sku": "A1"}))
print(obs.content)Or over HTTP (same code path), namespaced by model name: GET /world_models, then POST /world_models/{name}/sessions and POST /world_models/{name}/sessions/{id}/step.
One interface, four backends, verified on startup. Credentials are read from the environment:
| Provider | Model | Env vars |
|---|---|---|
| Anthropic | Claude Opus | ANTHROPIC_API_KEY |
| AWS Bedrock | Claude Opus | AWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY |
| Azure OpenAI | GPT | AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT |
| OpenAI | GPT | OPENAI_API_KEY |
Managed with uv; linting/formatting with ruff; type checking with ty. Conventions live in AGENTS.md.
uv sync --extra dev # env + dev tools
uv run ruff check . # lint
uv run ruff format . # format
uv run ty check # type check
uv run pytest -q # testswmh uses anonymous usage telemetry to track the volume of usage.
Telemetry is strictly metadata. It never includes prompts, traces, actions, observations, file paths,
model names, provider credentials, or raw user content.
Telemetry is enabled by default. To opt out for a project:
uv run wmh config telemetry disableThis writes .wmh/settings.toml. You can re-enable it with uv run wmh config telemetry enable,
check the current setting with uv run wmh config telemetry status, or disable it for a process
with DO_NOT_TRACK=1 or WMH_TELEMETRY=0.
