Skip to content

getdebug-ai/codesecbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodeSecBench — AI-app code security benchmark (reference harness)

Runs getdebug + gitleaks + trufflehog against a corpus of public repos, captures findings + scan times, computes cross-tool overlap, and (when labels are present) per-tool precision/recall. The reference implementation of the open methodology in METHODOLOGY.md.

Scope. CodeSecBench is code-side and scanner-side: we benchmark how well security scanners detect vulnerabilities in AI-app code (secret leaks, framework env-var exposure, prompt-construction patterns, unsafe tool-output handling). This is distinct from aisecbench.com, an editorial site covering model-side evaluation (jailbreak resistance, prompt-injection of models, safety benchmarks like AdvBench / HarmBench).

Status. v0.1 — real-world corpus secrets-scanning comparison. v0.2 — hand-crafted fixture corpus for client-side LLM key exposure (live). v0.3 — prompt injection + unsafe tool output with mock-LLM SDK stubs (decision records locked, fixtures pending).

Who runs this today. getdebug is the maintainer of CodeSecBench as of June 2026 and is one of the tools the benchmark grades. This is acknowledged explicitly, not hidden — the conflict-of-interest reducer is openness, not brand distance: methodology, corpus, and harness are MIT-licensed; anyone can re-run, anyone can dispute via PR. See GOVERNANCE.md for the graduation plan to a neutral GitHub org once external co-maintainers want a seat.

What it measures

  • Findings count per tool per repo. Higher = more aggressive (could be precision OR noise — read the per-finding breakdown).
  • Scan time wall-clock. Mean and median across the corpus.
  • Cross-tool overlap — for each repo: findings flagged by all three tools, by each pair, and unique-to-one-tool. Same-finding heuristic: <file>:<line>:<snippet>. Imperfect (different redaction) but a useful first cut.

Prereqs

  • Node ≥ 18, pnpm.
  • gitleaks on PATH: brew install gitleaks (tested with 8.30.1).
  • trufflehog on PATH: brew install trufflehog (tested with 3.95.3).
  • A getdebug binary on PATH, OR set $GETDEBUG_BIN. Defaults to /tmp/getdebug-bench (the build target the parent project produces).

Usage

# Install workspace deps
pnpm install --filter codesecbench...

cd bench

# Full run (~5-10 min depending on repo count + network)
pnpm scan

# Quick smoke run (3 repos only)
pnpm scan --limit 3

# Keep clones around (useful when iterating on a runner)
pnpm scan --keep-clones --workdir /tmp/bench-keep

# Re-render the report from the most recent JSON
pnpm report

# Render a specific run
pnpm report --in results/run-2026-05-31T22-58-12-345Z.json

# Run the fixture corpus (precision/recall per tool against
# code-level trueStates, plus the bundle-grep oracle on the
# committed framework artifacts)
pnpm fixtures

# Rebuild framework fixture artifacts from their _build/ harness
# and confirm the rebuild still demonstrates the fixture verdict.
# Hard-fails on verdict drift; the nightly CI workflow runs the
# same check.
pnpm fixtures:rebuild

# Same, with informational byte-level diff vs the committed artifact
pnpm fixtures:rebuild --byte-diff

# Interactively adjudicate a scan's findings into bench/labels/
# (Track B — fingerprint-only, never raw secrets — see
# bench/labels/README.md and [[bench-no-raw-live-secrets]])
pnpm label                              # latest run, all repos
pnpm label --repo stackitcloud/rag-template --reviewer fafa

# Compute per-tool precision/recall against the committed labels
pnpm score

Output

bench/results/
  run-<ISO_TIMESTAMP>.json   # raw structured data, full findings
  run-<ISO_TIMESTAMP>.md     # human-readable summary
  latest.json                # symlink-by-content to the most recent
  latest.md

results/ is gitignored; commit a chosen snapshot manually when it's worth keeping.

Corpus (src/repos.ts)

Three classes:

  1. Leaky-Repo baselinePlazmaz/leaky-repo is a synthetic repo seeded with 150 known plant secrets across major providers. Used for recall sanity-check (tools should find ≥120).
  2. Popular references — Vercel AI SDK starter, LangChain official chat template, MCP servers reference. Expected near-zero findings; useful as a precision check.
  3. AI-starter sample — 20 mid-popularity (10-500 star) less-curated AI templates pulled from GitHub search in the 2026-05-31 sweep. The realistic "what a real dev's repo looks like" baseline.

Add new repos by editing src/repos.ts — keep the category field set so reports can group correctly.

Tool invocation notes

  • getdebug runs with --quiet --json. Default detector set (secrets pass only — no --local-llm SAST).
  • gitleaks runs with --no-git --no-banner --exit-code 0. Filesystem mode, no git-history walk, doesn't exit non-zero on findings.
  • trufflehog runs with --no-verification --results=verified,unverified,unknown. Without --no-verification it would only emit live-API-verified leaks — a different (and slower) comparison. The verification mode is more honest if your goal is "real leaks that work right now"; the unverified mode is fairer for "scanner agreement on shape matches."

The same flags get reflected in the rendered report so the reader knows which apples-to-apples we're comparing.

What this isn't

  • A SAST comparison. The default getdebug analyze runs the secrets pass only. SAST + AI-app detectors live behind --local-llm (CLI) or the hosted analyzer; comparing those against Semgrep/CodeQL is a different harness.
  • A recall+precision ground-truth study. Leaky-Repo gives us recall on a known dataset, but we don't label the real-world AI-starter findings as TP/FP/FN automatically. A bench/triage workflow could add that — flagged as future work.
  • A trufflehog defense. We use --no-verification to compare scanner agreement on shape matches; trufflehog's killer feature is the live verifier. If you only care about verified leaks, run it with default flags and ignore the unverified column.

License

MIT.

About

Reproducible harness for benchmarking code-level AI-app security scanners. The reference implementation behind getdebug.dev/bench.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors