Runs getdebug + gitleaks + trufflehog against a corpus of public repos, captures findings + scan times, computes cross-tool overlap, and (when labels are present) per-tool precision/recall. The reference implementation of the open methodology in METHODOLOGY.md.
Scope. CodeSecBench is code-side and scanner-side: we benchmark how well security scanners detect vulnerabilities in AI-app code (secret leaks, framework env-var exposure, prompt-construction patterns, unsafe tool-output handling). This is distinct from aisecbench.com, an editorial site covering model-side evaluation (jailbreak resistance, prompt-injection of models, safety benchmarks like AdvBench / HarmBench).
Status. v0.1 — real-world corpus secrets-scanning comparison. v0.2 — hand-crafted fixture corpus for client-side LLM key exposure (live). v0.3 — prompt injection + unsafe tool output with mock-LLM SDK stubs (decision records locked, fixtures pending).
Who runs this today. getdebug is the maintainer of CodeSecBench as of June 2026 and is one of the tools the benchmark grades. This is acknowledged explicitly, not hidden — the conflict-of-interest reducer is openness, not brand distance: methodology, corpus, and harness are MIT-licensed; anyone can re-run, anyone can dispute via PR. See GOVERNANCE.md for the graduation plan to a neutral GitHub org once external co-maintainers want a seat.
- Findings count per tool per repo. Higher = more aggressive (could be precision OR noise — read the per-finding breakdown).
- Scan time wall-clock. Mean and median across the corpus.
- Cross-tool overlap — for each repo: findings flagged by all three tools, by each pair, and unique-to-one-tool. Same-finding heuristic:
<file>:<line>:<snippet>. Imperfect (different redaction) but a useful first cut.
- Node ≥ 18, pnpm.
gitleakson PATH:brew install gitleaks(tested with 8.30.1).trufflehogon PATH:brew install trufflehog(tested with 3.95.3).- A
getdebugbinary on PATH, OR set$GETDEBUG_BIN. Defaults to/tmp/getdebug-bench(the build target the parent project produces).
# Install workspace deps
pnpm install --filter codesecbench...
cd bench
# Full run (~5-10 min depending on repo count + network)
pnpm scan
# Quick smoke run (3 repos only)
pnpm scan --limit 3
# Keep clones around (useful when iterating on a runner)
pnpm scan --keep-clones --workdir /tmp/bench-keep
# Re-render the report from the most recent JSON
pnpm report
# Render a specific run
pnpm report --in results/run-2026-05-31T22-58-12-345Z.json
# Run the fixture corpus (precision/recall per tool against
# code-level trueStates, plus the bundle-grep oracle on the
# committed framework artifacts)
pnpm fixtures
# Rebuild framework fixture artifacts from their _build/ harness
# and confirm the rebuild still demonstrates the fixture verdict.
# Hard-fails on verdict drift; the nightly CI workflow runs the
# same check.
pnpm fixtures:rebuild
# Same, with informational byte-level diff vs the committed artifact
pnpm fixtures:rebuild --byte-diff
# Interactively adjudicate a scan's findings into bench/labels/
# (Track B — fingerprint-only, never raw secrets — see
# bench/labels/README.md and [[bench-no-raw-live-secrets]])
pnpm label # latest run, all repos
pnpm label --repo stackitcloud/rag-template --reviewer fafa
# Compute per-tool precision/recall against the committed labels
pnpm scorebench/results/
run-<ISO_TIMESTAMP>.json # raw structured data, full findings
run-<ISO_TIMESTAMP>.md # human-readable summary
latest.json # symlink-by-content to the most recent
latest.md
results/ is gitignored; commit a chosen snapshot manually when it's worth keeping.
Three classes:
- Leaky-Repo baseline —
Plazmaz/leaky-repois a synthetic repo seeded with150 known plant secrets across major providers. Used for recall sanity-check (tools should find ≥120). - Popular references — Vercel AI SDK starter, LangChain official chat template, MCP servers reference. Expected near-zero findings; useful as a precision check.
- AI-starter sample — 20 mid-popularity (10-500 star) less-curated AI templates pulled from GitHub search in the 2026-05-31 sweep. The realistic "what a real dev's repo looks like" baseline.
Add new repos by editing src/repos.ts — keep the category field set so reports can group correctly.
getdebugruns with--quiet --json. Default detector set (secrets pass only — no--local-llmSAST).gitleaksruns with--no-git --no-banner --exit-code 0. Filesystem mode, no git-history walk, doesn't exit non-zero on findings.trufflehogruns with--no-verification --results=verified,unverified,unknown. Without--no-verificationit would only emit live-API-verified leaks — a different (and slower) comparison. The verification mode is more honest if your goal is "real leaks that work right now"; the unverified mode is fairer for "scanner agreement on shape matches."
The same flags get reflected in the rendered report so the reader knows which apples-to-apples we're comparing.
- A SAST comparison. The default
getdebug analyzeruns the secrets pass only. SAST + AI-app detectors live behind--local-llm(CLI) or the hosted analyzer; comparing those against Semgrep/CodeQL is a different harness. - A recall+precision ground-truth study. Leaky-Repo gives us recall on a known dataset, but we don't label the real-world AI-starter findings as TP/FP/FN automatically. A
bench/triageworkflow could add that — flagged as future work. - A trufflehog defense. We use
--no-verificationto compare scanner agreement on shape matches; trufflehog's killer feature is the live verifier. If you only care about verified leaks, run it with default flags and ignore the unverified column.
MIT.