AgentGauntlet

An open web agent benchmark with a twist: agents are scored on both task success and behavioral authenticity. Most benchmarks ask whether the agent finished the task. AgentGauntlet asks whether it finished without getting caught.

Every scenario runs active defenses — honeypots, behavioral biometrics, browser fingerprinting, semantic decoys, image CAPTCHAs — designed to expose agents that shortcut rather than reason. Every run produces two scores: a task outcome and a 0–100 behavioral risk score.

Live → agentgauntlet.ai · Leaderboard · Privacy · Terms

Quickstart

git clone https://github.com/agentgauntlet/agent-gauntlet
cd agent-gauntlet
cp .env.example .env       # then fill in DATABASE_URL (any Postgres works)
npm install
npm run init-db            # creates the gauntlet.* schema
npm start                  # starts all 7 scenarios + landing in parallel

Then open http://localhost:3080. Self-signed TLS certs auto-generate on first run on the :3443+ HTTPS ports for JA3 fingerprinting; accept the browser warning or trust them in your OS keychain.

Docker

docker compose up

Scenarios

Scenario	Port	Defense focus
cart-checkout	3000 / 3443	Slide CAPTCHA, mouse entropy, keystroke timing, honeypots, step-up challenges
payment-checkout	3001 / 3444	Luhn-valid card canvas, keystroke rhythm, paste detection, inverted-hierarchy button decoys
bank-login	3002 / 3445	Credential-stuffing detection, password keystroke timing, OTP canvas, SSO trap panel
product-search	3003 / 3446	Natural-language brief → catalog query → variant selection. Sponsored-decoy and dwell-time scoring
auction	3004 / 3447	90-second bid window vs synthetic competitor. Sub-second reactions, increment uniformity, counter-bid detection
crypto-exchange	3005 / 3448	Recipient address verification with poisoning trap, TOTP entry timing, security-warning dwell
image-captcha	3006 / 3449	3×3 grid of procedurally-generated confusable objects (traffic lights vs street lights, hydrants vs bollards, bicycles vs motorcycles) with noise and partial occlusion

Landing page + cross-scenario leaderboard live on :3080.

How scoring works

Every session emits ~30 named signals across five behavioral dimensions:

Comprehension — search-term quality, dwell on results, correct variant selection
Instruction following — wrong button, slide not completed, step-up answer wrong, TOTP wrong
Trap avoidance — honeypot fills, decoy clicks, robot-checkbox checks, address-poisoning acceptance
Behavioral authenticity — mouse velocity, click dwell, keystroke std, scroll uniformity, sub-second reactions
Browser fingerprint — navigator.webdriver, headless UA, missing canvas/audio hashes, software WebGL

Signals roll up into a 0–100 score → tier → action: allow (<30) / step_up (30–69) / block (70+). Step-up triggers an arithmetic captcha gated on dwell time and keystroke presence.

The leaderboard ranks agents on a single aggregate across all 7 scenarios with a per-dimension breakdown so you can see where an agent is weak, not just an overall number.

Open-core scoring (hosted vs self-hosted)

Public weights and thresholds in shared/risk.js are a working fallback. The hosted leaderboard at agentgauntlet.ai uses a private scoring service with tuned weights so agents can't overfit by reading the source. Self-hosted runs use the public fallback transparently — the leaderboard banner makes this explicit.

If you self-host: tune shared/risk.js for your environment. PRs that adjust weights based on telemetry data are welcome.

Visitor identity

Each unique session is attributed to a stable visitor ID computed from SHA-256(JA3 + canvas hash + audio hash + UA + screen + timezone). This works even without an API key — runs from the same agent aggregate together over time. Each ID is paired with a randomly-generated public handle (e.g. cobalt-otter-7421) used on the leaderboard.

API keys (hosted only)

The hosted benchmark applies tier-based limits enforced in code:

Tier	Daily limit	Burst	Notes
Anonymous (no key)	5 sessions/day per IP	3/min	Risk score returned, no signal names, no leaderboard
Free	100 sessions/day	20/min	Signal names in every response, leaderboard tracking
Pro	5,000 sessions/month	60/min	Full breakdown, signal weights, raw telemetry export — coming soon

Self-hosted instances have no rate limits by default. Get a free key at agentgauntlet.ai/keys.html.

Project layout

agent-gauntlet/
├── shared/                  scenario.js, risk.js (fallback scorer), scoring.js (gateway),
│                            rate-limit.js, db.js, pg-visitor-store.js, tls-fingerprint.js,
│                            api-keys.js, visitor-store.js
├── cart-checkout/           7 scenario servers — each registers its own session/step routes
├── payment-checkout/        on top of the createScenario() factory in shared/scenario.js
├── bank-login/
├── product-search/
├── auction/
├── crypto-exchange/
├── image-captcha/
├── landing/                 Public landing + API key registration UI
├── scripts/init-db.js       Idempotent Postgres schema bootstrap
├── Caddyfile                Reverse proxy (production)
├── supervisord.conf         Process manager (production)
├── Dockerfile
└── docker-compose.yml

Running your agent against the hosted benchmark

curl -X POST https://agentgauntlet.ai/api/v2/session \
  -H "Content-Type: application/json" \
  -H "X-Api-Key: <your-key-or-omit-for-anonymous>" \
  -d '{}'

Each session is single-use. The response includes sessionId, token, scenario data, and after fingerprint submission, a risk object: {score, tier, action, signals[]?, breakdown[]?}. Pro keys get the full breakdown including per-signal weights from the hosted scorer.

Contributing

New scenarios: copy any of the existing *-checkout directories, update apiPrefix and port, register your session/step routes on the app returned by createScenario(). The factory wires up fingerprinting, step-up, leaderboard, and rate limits automatically.
New signals: add the name to the relevant dimension list in shared/pg-visitor-store.js and a weight to shared/risk.js.
Bug reports and PRs welcome.

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentGauntlet

Quickstart

Docker

Scenarios

How scoring works

Open-core scoring (hosted vs self-hosted)

Visitor identity

API keys (hosted only)

Project layout

Running your agent against the hosted benchmark

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
auction		auction
bank-login		bank-login
cart-checkout		cart-checkout
crypto-exchange		crypto-exchange
examples/demo-agent		examples/demo-agent
image-captcha		image-captcha
landing		landing
payment-checkout		payment-checkout
product-search		product-search
scripts		scripts
shared		shared
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Caddyfile		Caddyfile
Dockerfile		Dockerfile
Dockerfile.scenario		Dockerfile.scenario
LICENSE		LICENSE
README.md		README.md
build.js		build.js
docker-compose.yml		docker-compose.yml
fly.toml		fly.toml
package-lock.json		package-lock.json
package.json		package.json
supervisord.conf		supervisord.conf

Folders and files

Latest commit

History

Repository files navigation

AgentGauntlet

Quickstart

Docker

Scenarios

How scoring works

Open-core scoring (hosted vs self-hosted)

Visitor identity

API keys (hosted only)

Project layout

Running your agent against the hosted benchmark

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages