An open web agent benchmark with a twist: agents are scored on both task success and behavioral authenticity. Most benchmarks ask whether the agent finished the task. AgentGauntlet asks whether it finished without getting caught.
Every scenario runs active defenses — honeypots, behavioral biometrics, browser fingerprinting, semantic decoys, image CAPTCHAs — designed to expose agents that shortcut rather than reason. Every run produces two scores: a task outcome and a 0–100 behavioral risk score.
Live → agentgauntlet.ai · Leaderboard · Privacy · Terms
git clone https://github.com/agentgauntlet/agent-gauntlet
cd agent-gauntlet
cp .env.example .env # then fill in DATABASE_URL (any Postgres works)
npm install
npm run init-db # creates the gauntlet.* schema
npm start # starts all 7 scenarios + landing in parallelThen open http://localhost:3080. Self-signed TLS certs auto-generate on first run on the :3443+ HTTPS ports for JA3 fingerprinting; accept the browser warning or trust them in your OS keychain.
docker compose up| Scenario | Port | Defense focus |
|---|---|---|
| cart-checkout | 3000 / 3443 | Slide CAPTCHA, mouse entropy, keystroke timing, honeypots, step-up challenges |
| payment-checkout | 3001 / 3444 | Luhn-valid card canvas, keystroke rhythm, paste detection, inverted-hierarchy button decoys |
| bank-login | 3002 / 3445 | Credential-stuffing detection, password keystroke timing, OTP canvas, SSO trap panel |
| product-search | 3003 / 3446 | Natural-language brief → catalog query → variant selection. Sponsored-decoy and dwell-time scoring |
| auction | 3004 / 3447 | 90-second bid window vs synthetic competitor. Sub-second reactions, increment uniformity, counter-bid detection |
| crypto-exchange | 3005 / 3448 | Recipient address verification with poisoning trap, TOTP entry timing, security-warning dwell |
| image-captcha | 3006 / 3449 | 3×3 grid of procedurally-generated confusable objects (traffic lights vs street lights, hydrants vs bollards, bicycles vs motorcycles) with noise and partial occlusion |
Landing page + cross-scenario leaderboard live on :3080.
Every session emits ~30 named signals across five behavioral dimensions:
- Comprehension — search-term quality, dwell on results, correct variant selection
- Instruction following — wrong button, slide not completed, step-up answer wrong, TOTP wrong
- Trap avoidance — honeypot fills, decoy clicks, robot-checkbox checks, address-poisoning acceptance
- Behavioral authenticity — mouse velocity, click dwell, keystroke std, scroll uniformity, sub-second reactions
- Browser fingerprint —
navigator.webdriver, headless UA, missing canvas/audio hashes, software WebGL
Signals roll up into a 0–100 score → tier → action: allow (<30) / step_up (30–69) / block (70+). Step-up triggers an arithmetic captcha gated on dwell time and keystroke presence.
The leaderboard ranks agents on a single aggregate across all 7 scenarios with a per-dimension breakdown so you can see where an agent is weak, not just an overall number.
Public weights and thresholds in shared/risk.js are a working fallback. The hosted leaderboard at agentgauntlet.ai uses a private scoring service with tuned weights so agents can't overfit by reading the source. Self-hosted runs use the public fallback transparently — the leaderboard banner makes this explicit.
If you self-host: tune shared/risk.js for your environment. PRs that adjust weights based on telemetry data are welcome.
Each unique session is attributed to a stable visitor ID computed from SHA-256(JA3 + canvas hash + audio hash + UA + screen + timezone). This works even without an API key — runs from the same agent aggregate together over time. Each ID is paired with a randomly-generated public handle (e.g. cobalt-otter-7421) used on the leaderboard.
The hosted benchmark applies tier-based limits enforced in code:
| Tier | Daily limit | Burst | Notes |
|---|---|---|---|
| Anonymous (no key) | 5 sessions/day per IP | 3/min | Risk score returned, no signal names, no leaderboard |
| Free | 100 sessions/day | 20/min | Signal names in every response, leaderboard tracking |
| Pro | 5,000 sessions/month | 60/min | Full breakdown, signal weights, raw telemetry export — coming soon |
Self-hosted instances have no rate limits by default. Get a free key at agentgauntlet.ai/keys.html.
agent-gauntlet/
├── shared/ scenario.js, risk.js (fallback scorer), scoring.js (gateway),
│ rate-limit.js, db.js, pg-visitor-store.js, tls-fingerprint.js,
│ api-keys.js, visitor-store.js
├── cart-checkout/ 7 scenario servers — each registers its own session/step routes
├── payment-checkout/ on top of the createScenario() factory in shared/scenario.js
├── bank-login/
├── product-search/
├── auction/
├── crypto-exchange/
├── image-captcha/
├── landing/ Public landing + API key registration UI
├── scripts/init-db.js Idempotent Postgres schema bootstrap
├── Caddyfile Reverse proxy (production)
├── supervisord.conf Process manager (production)
├── Dockerfile
└── docker-compose.yml
curl -X POST https://agentgauntlet.ai/api/v2/session \
-H "Content-Type: application/json" \
-H "X-Api-Key: <your-key-or-omit-for-anonymous>" \
-d '{}'Each session is single-use. The response includes sessionId, token, scenario data, and after fingerprint submission, a risk object: {score, tier, action, signals[]?, breakdown[]?}. Pro keys get the full breakdown including per-signal weights from the hosted scorer.
- New scenarios: copy any of the existing
*-checkoutdirectories, updateapiPrefixandport, register your session/step routes on theappreturned bycreateScenario(). The factory wires up fingerprinting, step-up, leaderboard, and rate limits automatically. - New signals: add the name to the relevant dimension list in
shared/pg-visitor-store.jsand a weight toshared/risk.js. - Bug reports and PRs welcome.
MIT — see LICENSE.