NoiseBench

Can AI tell a real incident from alert noise?

It's 2am. Twenty-odd alerts just fired. A few are real — a database melting down and taking checkout with it, an auth cert that expired, a queue quietly saturating toward an outage. The rest are noise: a disk-usage alert that flaps every four minutes, two CPU blips that healed before you finished reading them, a duplicate of a ticket someone already owns, and a deploy that fixed itself. NoiseBench asks a frontier LLM to do what your on-call engineer does half-asleep: decide who to wake up — catching every real incident without drowning in the noise. Miss one real page and you've failed, no matter how clean the rest of your triage.

This is a benchmark for models, not products. Every model gets the same telemetry, the same tools, the same prompt. We measure the reasoning.

The question

Modern observability stacks don't have a data problem — they have a paging problem. Alerts fire constantly. Most are noise: flaps, transients, deploy churn, duplicates of incidents already being worked. A few are real and need a human now. The skill that matters is triage: separating the one real incident from the pile of look-alikes — without missing the real one, and without waking someone for a blip.

NoiseBench gives a model a batch of fired pages plus the context a good engineer would pull — recent metric values, clustered log patterns, deploy history, whether the alert auto-resolved, how often it has fired this hour, and which incidents are already open — and asks it to label each page page (wake a human) or suppress (noise).

What the benchmark measures

For each page, the model decides page vs suppress. We score the positive class (page) with precision / recall / F1, under one non-negotiable rule:

You may not suppress a real incident. Each scenario contains at least one real, high-severity incident that must be paged. Suppress it and you score zero, no matter how clean the rest of your answer is. Missing a real SEV1 is the cardinal sin of on-call.
Over-paging is penalized. Every piece of noise you page costs precision. Because a real incident should collapse to a single page, the bar is unforgiving: at full recall, one false page already drops you below threshold.

It rewards exactly one behavior: wake a human for the real thing, and nothing else.

How it works

NoiseBench uses the Terminal-Bench task format, run by the Harbor harness with the default terminus-2 agent. We ship only the tasks + datasets + scoring — the harness and the models are external. Every model is dropped into an identical Docker sandbox with the telemetry under /workdir/ and standard shell tools (jq, grep, cat, …), and must write its decisions to /workdir/triage.json. There is no product in the loop; this is a pure reasoning eval.

EdgeDelta's query language is CQL (field equality like severity:"SEV1", boolean AND/OR/negation, numeric comparisons like @value > 400; no regex, no mid-string wildcards). The shipped data is plain JSON/CSV so the agent can reason with jq/grep as though running CQL filters.

Running it

Requires Harbor (uv tool install harbor), Docker, and an OpenRouter key (or your own model credentials).

git clone https://github.com/edgedelta/noise-bench.git
cd noise-bench

# put OPENROUTER_API_KEY=... in .env
source .env

# all scenarios, several models, 3 attempts each
uv run harbor run -c configs/all-models-docker.yaml

# quick single-model / single-scenario smoke test
uv run harbor run -c configs/smoke-docker.yaml

Summarise a run into a markdown table:

uv run scripts/process_results.py jobs/<run-dir>

Inspect agent trajectories with harbor view jobs.

Task format

Each scenario under datasets/noisebench/ is a Terminal-Bench task: task.toml, instruction.md, environment/Dockerfile (+ the frozen telemetry in environment/workdir/), solution/solve.sh (an oracle answer used to validate the grader), and tests/ (the grader test_outputs.py + verifier-only ground_truth.json). See the dataset README for the data schema and scoring details.

Difficulty tiers

The first three scenarios use a synthetic microservices app. The remaining seven are reconstructions of representative production incident classes — all service, host, monitor, and identifier values are fictional stand-ins. They use realistic service names (http-receiver, metric-ingestor-1, ai-agent-svc, …), monitor names ([ignore] Default Log Threshold Monitor, Platform API HTTP 5xx Error, NodeNotReady Error - K8s Event, OnCall AI Workflow Errors, LLM 24 Hour Token Usage, …), and standard Kubernetes event types (DisruptionBlocked, Unconsolidatable, NodeNotReady). See the dataset README for the per-scenario notes.

Scenario	Tier	Pages	The trap
`noisy-night-shift`	medium	20	A real DB cascade fires 4 correlated pages — collapse them to one. The rest is flaps, transients, a duplicate of an open incident, and a self-healed deploy.
`deploy-storm`	hard	25	Ten services deployed at once; almost all the churn self-heals. One deploy shipped a real regression that doesn't. Over-suppressing kills you.
`quiet-but-deadly`	medium	12	Mostly low-grade noise, plus a quiet slow-burn incident with no deploy to blame. Tests the "blame the deploy" and "ignore the quiet one" biases.
`disk-pressure-flapper-storm`	medium	22	The `[ignore] Default Log Threshold Monitor` flaps everywhere and disk warnings self-resolve on rotation. One node crosses into real `DiskPressure` eviction risk.
`escalation-loopback-noise`	medium	16	PagerDuty escalation-policy meta-noise (loop-back to the same responder, missed-ack reminders on transient staging CI). One genuine missed-ack on a live SEV1 `Platform API 5xx`.
`ci-e2e-test-noise`	hard	24	CircleCI / Playwright e2e failures wired into PagerDuty as incidents — test-env noise that shouldn't page prod. One e2e failure reflects a real `web-app` regression.
`warning-spike-transients`	medium	14	WARN-level spikes that self-heal in seconds (incl. the classic Workflow `runMainLoop` bursts). One is the leading edge of a real error cascade on `http-receiver`.
`ai-platform-alert-noise`	hard	20	`LLM 24 Hour Token Usage` Warns + `Spending Cap` budget alerts — cost noise, not outages. One real `ai-agent-svc` outage via `OnCall AI Workflow Errors`.
`queue-backlog-vs-blip`	hard	20	Transient queue-depth blips that drain on their own. One sustained, non-draining backlog on `metric-ingest-queue-1` blocking the write path.
`node-event-noise`	medium	18	Normal Karpenter/PDB operational events (`Pdb prevents pod evictions`, `SpotToSpotConsolidation disabled`, `Unconsolidatable`, `store validated`). One real `NodeNotReady` drops capacity.

Leaderboard

Frozen run: 17 scenarios x 13 models x 3 attempts = 663 trials, Harbor terminus-2 over OpenRouter, 2026-06-30. Pass is the scenario grader's boolean verdict. Full per-trial results (outcome, cost, tokens, timing per model) + per-model/per-task rollups are committed under benchmark-results/.

Model	Pass rate	easy	medium	hard
gpt-5.5	88%	100%	100%	75%
claude-sonnet-4.6	88%	100%	96%	79%
gpt-5.4	80%	100%	100%	58%
kimi-k2.5	78%	100%	100%	54%
gemini-3.5-flash	76%	100%	100%	50%
gemini-3.1-pro-preview	75%	100%	100%	46%
gpt-5.4-mini	71%	100%	88%	50%
kimi-k2-thinking	69%	100%	92%	42%
gemini-3.1-flash-lite	63%	100%	96%	25%
claude-opus-4.8	61%	67%	75%	46%
claude-haiku-4.5	61%	33%	83%	42%
gpt-oss-120b	51%	100%	75%	21%
gpt-oss-20b	16%	0%	33%	0%

How scenarios are generated

Each scenario is a frozen telemetry window built by fault injection:

Run a microservices demo app under steady synthetic load.
Inject one real fault tied to a specific git commit (the culprit) — a connection leak, a bad pool client, a slow upstream — and deploy it.
Let it propagate; capture the pages that fire, the metrics, the clustered log patterns, the deploy log, and any already-open incidents.
Freeze a small window of that telemetry.
Inject realistic distractors: chronic flappers, sub-minute self-healing transients, downstream symptoms of the real incident (to test correlation/dedup), an innocent deploy near onset (to punish "blame the latest deploy"), and a duplicate of an already-open incident. Timestamps are kept internally consistent — onset always after the culprit deploy, with the innocent deploy placed near onset as bait.
Emit ground_truth.json: per-page page/suppress labels plus the must_page list.

In v1 the root-cause label space is git commits (code changes); feature-flag changes appear only as decoys. The pipeline and a skeleton generator live in tools/generate_scenario.py.

Building your own scenarios

uv run tools/generate_scenario.py weekend-cron-storm --noise 18 --difficulty hard

This emits a runnable scenario skeleton (one real incident + injected noise, all data files + ground truth). Copy task.toml / instruction.md / environment/Dockerfile / tests/ / solution/solve.sh from an existing scenario, then replace the synthetic data with a captured window. Validate that your oracle (solution/solve.sh) passes your grader before publishing.

Why we built this

At Edge Delta we spend our days on the on-call-burden problem: turning a firehose of alerts into the handful that deserve a human. NoiseBench is our attempt to measure that reasoning honestly and in the open — for any model, with no product in the loop. EdgeDelta isn't on the leaderboard; the benchmark is neutral.

License

Apache-2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
benchmark-results/noisebench		benchmark-results/noisebench
configs		configs
datasets/noisebench		datasets/noisebench
scripts		scripts
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NoiseBench

Can AI tell a real incident from alert noise?

The question

What the benchmark measures

How it works

Running it

Task format

Difficulty tiers

Leaderboard

How scenarios are generated

Building your own scenarios

Why we built this

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

NoiseBench

Can AI tell a real incident from alert noise?

The question

What the benchmark measures

How it works

Running it

Task format

Difficulty tiers

Leaderboard

How scenarios are generated

Building your own scenarios

Why we built this

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages