SAST Benchmark

61 real-world security advisories from OpenClaw, Ghost, and the Cosmos ecosystem (cosmos-sdk, wasmd, ibc-go, cosmos/evm, cometbft), packaged as a scanner-agnostic benchmark. Each case includes the exact commit timeline — baseline and introducing commit(s) — so any security scanner can test whether it would have detected these vulnerabilities at the point they were introduced.

Detection note: benchmark detection is based on path match + class match. Severity is still recorded as severityMatch, but it no longer gates whether a case counts as detected.

Quick Start

# 1. Clone the repos you want to benchmark. The two below are enough to
#    work through the Quick Start; see "Running the Benchmark" for the full
#    seven-repo command (OpenClaw, Ghost, and five Cosmos ecosystem repos).
git clone https://github.com/openclaw/openclaw.git ../openclaw
git clone https://github.com/TryGhost/Ghost.git ../ghost

# 2. Pick a case
cat cases/GHSA-3c6h-g97w-fg78/case.json | jq '{repository, vulnerableHead: .timeline.vulnerableHead}'

# 3. Checkout the vulnerable commit in the matching repo
cd ../openclaw
git checkout <vulnerableHead>

# 4. Run your scanner
your-scanner scan .

# 5. Compare results against expectedOutcome
cat ../sast-benchmark/cases/GHSA-3c6h-g97w-fg78/case.json | jq '.expectedOutcome'

Benchmarking Workflow

For each case in cases/GHSA-*/case.json:

Baseline scan (optional) — checkout timeline.baselineCommit, scan. This is the pre-vulnerability clean state; findings here are false positives for this advisory.
Vulnerable scan — checkout timeline.vulnerableHead, scan. Compare results against expectedOutcome:
- Does the scanner report a finding with a matching vulnerabilityClass?
- Does at least one finding path overlap with expectedPaths?
- Does the finding also meet minimumSeverity? The runner records this separately as severityMatch, but detection is based on path + class.

Case Schema (`case.json`)

Each case directory contains a single case.json with these sections:

Section	Description
`repository`	Canonical source repository ID (`OWNER/NAME`) used by the runner and validator to select the right local checkout
`advisory`	Title, severity, CVSS, CWEs, affected package ranges from the GHSA
`timeline`	Baseline commit, introducing commit(s), vulnerable head
`expectedOutcome`	Vulnerability class, minimum severity, expected file paths, description
`verification`	Verification evidence and ancestry checks for baseline → intro → vulnerableHead relationships

See schema/case.schema.json for the full JSON Schema (draft-07).

For --strict validation, the required ancestry checks in verification.checks are:

baseline_ancestor_of_intro
intro_ancestor_of_vulnerable_head
baseline_ancestor_of_vulnerable_head

Those checks should carry machine-readable ancestor and descendant SHAs.

High confidence means the case has evidence mapping the advisory to the first commit that introduced the affected implementation. That can be a later regression, or a feature/file-inception commit when the parent lacks the vulnerable path and the patch hardens that same path. The documented content-evidence check for that claim is introducing_commit_contains_vulnerable_pattern; in strict mode, its details must reference one of the case's timeline.introducingCommits SHAs.

Manifest

manifest.json is the checked-in summary index for the benchmark. Each entry in manifest.cases should include:

id
severity
title
vulnerabilityClass
baselineCommit
vulnerableHead
verificationStatus
confidence

Validation checks these summary fields against the corresponding case.json. When manifest.repositories is present, validation also checks that it matches the set of repositories referenced by the loaded cases.

Vulnerability Class Taxonomy

Class	Description
`authbypass`	Authentication bypass
`brokenauthz`	Broken authorization / privilege escalation
`commandinjection`	Command injection / argument injection
`codeexec`	Arbitrary code execution
`csrf`	Cross-site request forgery / request-origin binding failure
`pathtraversal`	Path traversal / directory traversal
`sandboxescape`	Sandbox escape / isolation bypass
`secretdisclosure`	Secret / credential disclosure
`sqlinjection`	SQL injection
`ssrf`	Server-side request forgery
`abuse`	Resource exhaustion / denial of service
`xss`	Cross-site scripting

Running the Benchmark

The runner automates the full loop: checkout each vulnerable commit, run your scanner, parse results, and produce a scorecard.

If the scanner emits parseable SARIF/simple JSON, the benchmark evaluates that output even when the scanner exits non-zero. This accommodates tools that use exit status to signal findings.

The runner loads every case under cases/ by default. Each case needs a local checkout of its source repository, so every example below passes --repo for all seven repos in the corpus. To scope to a subset (e.g. only Ghost cases), add --filter <GHSA-ID> [<GHSA-ID> ...] and pass only the matching --repo flag — see the "Run a subset of cases" example.

# Run with SARIF against the full corpus
python3 scripts/run.py \
  --repo openclaw/openclaw=../openclaw \
  --repo TryGhost/Ghost=../ghost \
  --repo cosmos/cosmos-sdk=../cosmos-sdk \
  --repo CosmWasm/wasmd=../wasmd \
  --repo cosmos/ibc-go=../ibc-go \
  --repo cosmos/evm=../cosmos-evm \
  --repo cometbft/cometbft=../cometbft \
  --scanner-cmd "semgrep scan --sarif ."

# Run with a custom JSON-producing scanner
python3 scripts/run.py \
  --repo openclaw/openclaw=../openclaw \
  --repo TryGhost/Ghost=../ghost \
  --repo cosmos/cosmos-sdk=../cosmos-sdk \
  --repo CosmWasm/wasmd=../wasmd \
  --repo cosmos/ibc-go=../ibc-go \
  --repo cosmos/evm=../cosmos-evm \
  --repo cometbft/cometbft=../cometbft \
  --scanner-cmd "my-scanner scan --json ." \
  --format simple

# Run a subset of cases — only repos referenced by the filtered GHSAs need --repo
python3 scripts/run.py \
  --repo TryGhost/Ghost=../ghost \
  --scanner-cmd "semgrep scan --sarif ." \
  --filter GHSA-w52v-v783-gw97 GHSA-cgc2-rcrh-qr5x

# Custom timeout and output path
python3 scripts/run.py \
  --repo openclaw/openclaw=../openclaw \
  --repo TryGhost/Ghost=../ghost \
  --repo cosmos/cosmos-sdk=../cosmos-sdk \
  --repo CosmWasm/wasmd=../wasmd \
  --repo cosmos/ibc-go=../ibc-go \
  --repo cosmos/evm=../cosmos-evm \
  --repo cometbft/cometbft=../cometbft \
  --scanner-cmd "codeql database analyze ." \
  --timeout 600 \
  --output my-results.json

Runner Options

Flag	Description
`--repo`	`OWNER/NAME=PATH` mapping from case repository to local git checkout; repeat for multiple repos
`--openclaw-repo`	Legacy alias for `--repo openclaw/openclaw=PATH`
`--scanner-cmd`	Scanner command to run in the checked-out worktree; must write SARIF or simple JSON to stdout
`--format`	Force `auto`, `sarif`, or `simple` output parsing
`--cases-dir`	Override the default `cases/` directory
`--filter`	Run only specific GHSA IDs
`--timeout`	Per-case timeout for the main scanner command
`--output`	Path for the JSON results file
`--baseline-cmd`	Optional setup command to run at `timeline.baselineCommit` before the vulnerable scan
`--baseline-timeout`	Timeout for `--baseline-cmd` (defaults to `--timeout`)

Diff-aware Scanners

--baseline-cmd exists for scanners that need to warm caches, build indexes, or prepare context from the pre-vulnerability commit before scanning vulnerableHead.

It runs a setup command at timeline.baselineCommit, discards that command's stdout, then checks out timeline.vulnerableHead and runs the main scanner command.

python3 scripts/run.py \
  --repo openclaw/openclaw=../openclaw \
  --repo TryGhost/Ghost=../ghost \
  --repo cosmos/cosmos-sdk=../cosmos-sdk \
  --repo CosmWasm/wasmd=../wasmd \
  --repo cosmos/ibc-go=../ibc-go \
  --repo cosmos/evm=../cosmos-evm \
  --repo cometbft/cometbft=../cometbft \
  --scanner-cmd "my-scanner scan --json ." \
  --baseline-cmd "my-scanner index ." \
  --baseline-timeout 600 \
  --format simple

--baseline-cmd does not compare baseline findings against vulnerable-scan findings. The README's earlier "baseline scan" workflow describes a larger false-positive-diffing feature that the runner still does not implement.

Scanner Output Formats

The runner accepts two output formats (auto-detected by default):

SARIF 2.1.0 (recommended) — Industry standard. Severity is derived from security-severity rule properties when available, otherwise from the result level or the rule's defaultConfiguration.level. CWE IDs are extracted from result properties or rule tags.

Simple JSON — Minimal format for scanners without SARIF support:

{
  "findings": [
    {
      "path": "src/foo.ts",
      "severity": "high",
      "ruleId": "command-injection",
      "message": "Command injection via user input",
      "cweIds": ["CWE-78"]
    }
  ]
}

expectedOutcome.vulnerabilityClass is matched from the finding's CWE metadata. Findings without CWE/class metadata cannot satisfy the class-match requirement.

Some checked-in cases intentionally keep advisory.cweIds empty because the upstream GHSA did not publish CWE metadata, and the benchmark does not synthesize replacement advisory CWEs. Those cases still define an expected benchmark class in expectedOutcome.vulnerabilityClass, but scanners must emit their own CWE metadata for the runner to award a class match.

Results

The runner writes a JSON results file (default: results.json) and prints a console scorecard:

  Case ID                      Class                Sev    Result     Path  Cls  Sev    #
  --------------------------------------------------------------------------------------
  GHSA-3c6h-g97w-fg78          commandinjection     high   DETECTED      Y    Y    Y   12
  GHSA-474h-prjg-mmw3          sandboxescape        high   MISSED        -    -    -    0
  ...
  --------------------------------------------------------------------------------------
  Detected: 15/24 (62.5%)

When --baseline-cmd is provided, the scorecard adds a Base column that shows whether baseline setup for each case was OK or FAIL.

Each entry in the JSON results payload includes the case repository alongside caseId, detected, pathMatch, classMatch, severityMatch, and scanner metadata.

Validation

# Structural validation (schema + manifest consistency; requires: pip install jsonschema)
python3 scripts/validate.py

# Semantic validation (commit SHAs resolve, ancestry checks pass in git history)
# If jsonschema is missing, schema checks are skipped with a warning.
python3 scripts/validate.py \
  --repo openclaw/openclaw=../openclaw \
  --repo TryGhost/Ghost=../ghost \
  --repo cosmos/cosmos-sdk=../cosmos-sdk \
  --repo CosmWasm/wasmd=../wasmd \
  --repo cosmos/ibc-go=../ibc-go \
  --repo cosmos/evm=../cosmos-evm \
  --repo cometbft/cometbft=../cometbft

`--strict` mode

scripts/validate.py --strict enforces that every loaded case has verification.status == "pass" AND verification.confidence == "high". The current corpus is expected to pass strict mode when the relevant upstream repositories are supplied with --repo. Strict mode also checks the three ancestry relationships and validates any introducing_commit_contains_vulnerable_pattern content-evidence check against the case's introducing commits.

Outcome Matching Criteria

A scanner "detects" a case when scanning vulnerableHead if:

Class match — reported vulnerability class matches expectedOutcome.vulnerabilityClass
- For SARIF/simple JSON inputs, this is derived from the finding's CWE metadata.
Path overlap — at least one finding path overlaps with expectedOutcome.expectedPaths
Severity floor — reported severity >= expectedOutcome.minimumSeverity, recorded as severityMatch but not required for detection

The runner evaluates severity and class matching only across findings on matching paths. Class matching is derived from all relevant findings on those paths, even when the matching finding is below the configured severity floor.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.claude		.claude
cases		cases
schema		schema
scripts		scripts
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
manifest.json		manifest.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SAST Benchmark

Quick Start

Benchmarking Workflow

Case Schema (`case.json`)

Manifest

Vulnerability Class Taxonomy

Running the Benchmark

Runner Options

Diff-aware Scanners

Scanner Output Formats

Results

Validation

`--strict` mode

Outcome Matching Criteria

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SAST Benchmark

Quick Start

Benchmarking Workflow

Case Schema (case.json)

Manifest

Vulnerability Class Taxonomy

Running the Benchmark

Runner Options

Diff-aware Scanners

Scanner Output Formats

Results

Validation

--strict mode

Outcome Matching Criteria

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Case Schema (`case.json`)

`--strict` mode

Packages