It's 2am. Twenty-odd alerts just fired. A few are real — a database melting down and taking checkout with it, an auth cert that expired, a queue quietly saturating toward an outage. The rest are noise: a disk-usage alert that flaps every four minutes, two CPU blips that healed before you finished reading them, a duplicate of a ticket someone already owns, and a deploy that fixed itself. NoiseBench asks a frontier LLM to do what your on-call engineer does half-asleep: decide who to wake up — catching every real incident without drowning in the noise. Miss one real page and you've failed, no matter how clean the rest of your triage.
This is a benchmark for models, not products. Every model gets the same telemetry, the same tools, the same prompt. We measure the reasoning.
Modern observability stacks don't have a data problem — they have a paging problem. Alerts fire constantly. Most are noise: flaps, transients, deploy churn, duplicates of incidents already being worked. A few are real and need a human now. The skill that matters is triage: separating the one real incident from the pile of look-alikes — without missing the real one, and without waking someone for a blip.
NoiseBench gives a model a batch of fired pages plus the context a good engineer would pull —
recent metric values, clustered log patterns, deploy history, whether the alert auto-resolved,
how often it has fired this hour, and which incidents are already open — and asks it to label
each page page (wake a human) or suppress (noise).
For each page, the model decides page vs suppress. We score the positive class (page)
with precision / recall / F1, under one non-negotiable rule:
- You may not suppress a real incident. Each scenario contains at least one real, high-severity incident that must be paged. Suppress it and you score zero, no matter how clean the rest of your answer is. Missing a real SEV1 is the cardinal sin of on-call.
- Over-paging is penalized. Every piece of noise you page costs precision. Because a real incident should collapse to a single page, the bar is unforgiving: at full recall, one false page already drops you below threshold.
It rewards exactly one behavior: wake a human for the real thing, and nothing else.
NoiseBench uses the Terminal-Bench task format, run by the
Harbor harness with the default terminus-2 agent. We
ship only the tasks + datasets + scoring — the harness and the models are external. Every
model is dropped into an identical Docker sandbox with the telemetry under /workdir/ and
standard shell tools (jq, grep, cat, …), and must write its decisions to
/workdir/triage.json. There is no product in the loop; this is a pure reasoning eval.
EdgeDelta's query language is CQL (field equality like severity:"SEV1", boolean
AND/OR/negation, numeric comparisons like @value > 400; no regex, no mid-string
wildcards). The shipped data is plain JSON/CSV so the agent can reason with jq/grep as
though running CQL filters.
Requires Harbor (uv tool install harbor), Docker, and an
OpenRouter key (or your own model credentials).
git clone https://github.com/edgedelta/noise-bench.git
cd noise-bench
# put OPENROUTER_API_KEY=... in .env
source .env
# all scenarios, several models, 3 attempts each
uv run harbor run -c configs/all-models-docker.yaml
# quick single-model / single-scenario smoke test
uv run harbor run -c configs/smoke-docker.yamlSummarise a run into a markdown table:
uv run scripts/process_results.py jobs/<run-dir>Inspect agent trajectories with harbor view jobs.
Each scenario under datasets/noisebench/ is a Terminal-Bench task:
task.toml, instruction.md, environment/Dockerfile (+ the frozen telemetry in
environment/workdir/), solution/solve.sh (an oracle answer used to validate the grader),
and tests/ (the grader test_outputs.py + verifier-only ground_truth.json). See the
dataset README for the data schema and scoring details.
The first three scenarios use a synthetic microservices app. The remaining seven are
reconstructions of representative production incident classes — all service, host,
monitor, and identifier values are fictional stand-ins. They use realistic service names
(http-receiver, metric-ingestor-1, ai-agent-svc, …), monitor
names ([ignore] Default Log Threshold Monitor, Platform API HTTP 5xx Error, NodeNotReady Error - K8s Event, OnCall AI Workflow Errors, LLM 24 Hour Token Usage, …), and standard
Kubernetes event types (DisruptionBlocked, Unconsolidatable, NodeNotReady). See the
dataset README for the per-scenario notes.
| Scenario | Tier | Pages | The trap |
|---|---|---|---|
noisy-night-shift |
medium | 20 | A real DB cascade fires 4 correlated pages — collapse them to one. The rest is flaps, transients, a duplicate of an open incident, and a self-healed deploy. |
deploy-storm |
hard | 25 | Ten services deployed at once; almost all the churn self-heals. One deploy shipped a real regression that doesn't. Over-suppressing kills you. |
quiet-but-deadly |
medium | 12 | Mostly low-grade noise, plus a quiet slow-burn incident with no deploy to blame. Tests the "blame the deploy" and "ignore the quiet one" biases. |
disk-pressure-flapper-storm |
medium | 22 | The [ignore] Default Log Threshold Monitor flaps everywhere and disk warnings self-resolve on rotation. One node crosses into real DiskPressure eviction risk. |
escalation-loopback-noise |
medium | 16 | PagerDuty escalation-policy meta-noise (loop-back to the same responder, missed-ack reminders on transient staging CI). One genuine missed-ack on a live SEV1 Platform API 5xx. |
ci-e2e-test-noise |
hard | 24 | CircleCI / Playwright e2e failures wired into PagerDuty as incidents — test-env noise that shouldn't page prod. One e2e failure reflects a real web-app regression. |
warning-spike-transients |
medium | 14 | WARN-level spikes that self-heal in seconds (incl. the classic Workflow runMainLoop bursts). One is the leading edge of a real error cascade on http-receiver. |
ai-platform-alert-noise |
hard | 20 | LLM 24 Hour Token Usage Warns + Spending Cap budget alerts — cost noise, not outages. One real ai-agent-svc outage via OnCall AI Workflow Errors. |
queue-backlog-vs-blip |
hard | 20 | Transient queue-depth blips that drain on their own. One sustained, non-draining backlog on metric-ingest-queue-1 blocking the write path. |
node-event-noise |
medium | 18 | Normal Karpenter/PDB operational events (Pdb prevents pod evictions, SpotToSpotConsolidation disabled, Unconsolidatable, store validated). One real NodeNotReady drops capacity. |
Frozen run: 17 scenarios x 13 models x 3 attempts = 663 trials, Harbor terminus-2 over OpenRouter, 2026-06-30. Pass is the scenario grader's boolean verdict. Full per-trial results (outcome, cost, tokens, timing per model) + per-model/per-task rollups are committed under benchmark-results/.
| Model | Pass rate | easy | medium | hard |
|---|---|---|---|---|
| gpt-5.5 | 88% | 100% | 100% | 75% |
| claude-sonnet-4.6 | 88% | 100% | 96% | 79% |
| gpt-5.4 | 80% | 100% | 100% | 58% |
| kimi-k2.5 | 78% | 100% | 100% | 54% |
| gemini-3.5-flash | 76% | 100% | 100% | 50% |
| gemini-3.1-pro-preview | 75% | 100% | 100% | 46% |
| gpt-5.4-mini | 71% | 100% | 88% | 50% |
| kimi-k2-thinking | 69% | 100% | 92% | 42% |
| gemini-3.1-flash-lite | 63% | 100% | 96% | 25% |
| claude-opus-4.8 | 61% | 67% | 75% | 46% |
| claude-haiku-4.5 | 61% | 33% | 83% | 42% |
| gpt-oss-120b | 51% | 100% | 75% | 21% |
| gpt-oss-20b | 16% | 0% | 33% | 0% |
Each scenario is a frozen telemetry window built by fault injection:
- Run a microservices demo app under steady synthetic load.
- Inject one real fault tied to a specific git commit (the culprit) — a connection leak, a bad pool client, a slow upstream — and deploy it.
- Let it propagate; capture the pages that fire, the metrics, the clustered log patterns, the deploy log, and any already-open incidents.
- Freeze a small window of that telemetry.
- Inject realistic distractors: chronic flappers, sub-minute self-healing transients, downstream symptoms of the real incident (to test correlation/dedup), an innocent deploy near onset (to punish "blame the latest deploy"), and a duplicate of an already-open incident. Timestamps are kept internally consistent — onset always after the culprit deploy, with the innocent deploy placed near onset as bait.
- Emit
ground_truth.json: per-pagepage/suppresslabels plus themust_pagelist.
In v1 the root-cause label space is git commits (code changes); feature-flag changes
appear only as decoys. The pipeline and a skeleton generator live in
tools/generate_scenario.py.
uv run tools/generate_scenario.py weekend-cron-storm --noise 18 --difficulty hardThis emits a runnable scenario skeleton (one real incident + injected noise, all data files +
ground truth). Copy task.toml / instruction.md / environment/Dockerfile / tests/ /
solution/solve.sh from an existing scenario, then replace the synthetic data with a captured
window. Validate that your oracle (solution/solve.sh) passes your grader before publishing.
At Edge Delta we spend our days on the on-call-burden problem: turning a firehose of alerts into the handful that deserve a human. NoiseBench is our attempt to measure that reasoning honestly and in the open — for any model, with no product in the loop. EdgeDelta isn't on the leaderboard; the benchmark is neutral.
Apache-2.0. See LICENSE.