Skip to content

v1.5.0 — Head-to-head benchmark comparison

Choose a tag to compare

@ankitlade12 ankitlade12 released this 20 Apr 14:12
· 30 commits to main since this release
4da4b12

AgentArmor v1.5.0 ships the head-to-head benchmark comparison infrastructure — the first honest, per-sample-verdicts, bootstrap-CI'd comparison of AgentArmor against established safety classifiers (LlamaGuard 3 + OpenAI Moderation) across six industry datasets.

See BENCHMARKS_HEAD_TO_HEAD.md for the results and RUNBOOK.md for operations.

Highlights

  • Head-to-head runner — sequential, resumable comparison with per-sample verdicts, bootstrap F1 / MCC / balanced-accuracy, adapter + config drift detection on resume, structured run.jsonl event log
  • Taxonomy applicability rubric with ensure_complete() CI gate — methodologically defensible (baseline, dataset) verdicts
  • BaselineChecker ABC migrationscore(text) -> float contract with legacy auto-bridge and DeprecationWarning
  • Secret allow-list in config loader — rejects *_API_KEY / *_TOKEN / *_SECRET fields
  • JSON summary schema with additive-minor / major-bump semver enforcement
  • Deterministic markdown generator with byte-identical regeneration
  • Vendor-drift canary with 20 committed neutral samples + abort-on-delta pre-publish check
  • Operations runbook with 7 numbered procedures (setup, key rotation, resume, publishing, rollback, canary failure)

Policy

  • No paper-number fallback — a failing baseline yields a blank cell, never a prior-paper-cited number
  • raw_response: null in committed per-sample JSONL; --keep-raw-responses writes gitignored only

Pinned

  • numpy>=1.26,<2.0 for bootstrap determinism
  • PyYAML>=6.0,<7.0 for config loader
  • Optional head_to_head_llamaguard extra pulls llama-cpp-python>=0.2.0,<0.4.0

Also shipped in the 1.2.0 → 1.5.0 gap (previously unreleased)

  • Explain Mode v2 (1.4.0) — structured trace recording; agentarmor.last_trace() shows which shields ran, what each decided, and why
  • Semantic Drift Detector (1.3.0) — embedding-based multi-turn conversation trajectory tracker
  • Pricing API (1.3.0) — register_pricing() for custom model entries; added o3, o4-mini, claude-opus-4-6, claude-sonnet-4-6, gemini-2.5-pro, gemini-2.5-flash
  • Strict mode + demo_attacks() (1.3.x) — catches typo'd kwargs at init() time; runs ~21 synthetic attacks through your active config

Full changelog: CHANGELOG.md