61 real-world security advisories from OpenClaw, Ghost, and the Cosmos ecosystem (cosmos-sdk, wasmd, ibc-go, cosmos/evm, cometbft), packaged as a scanner-agnostic benchmark. Each case includes the exact commit timeline — baseline and introducing commit(s) — so any security scanner can test whether it would have detected these vulnerabilities at the point they were introduced.
Detection note: benchmark detection is based on path match + class match. Severity is still recorded as severityMatch, but it no longer gates whether a case counts as detected.
# 1. Clone the repos you want to benchmark. The two below are enough to
# work through the Quick Start; see "Running the Benchmark" for the full
# seven-repo command (OpenClaw, Ghost, and five Cosmos ecosystem repos).
git clone https://github.com/openclaw/openclaw.git ../openclaw
git clone https://github.com/TryGhost/Ghost.git ../ghost
# 2. Pick a case
cat cases/GHSA-3c6h-g97w-fg78/case.json | jq '{repository, vulnerableHead: .timeline.vulnerableHead}'
# 3. Checkout the vulnerable commit in the matching repo
cd ../openclaw
git checkout <vulnerableHead>
# 4. Run your scanner
your-scanner scan .
# 5. Compare results against expectedOutcome
cat ../sast-benchmark/cases/GHSA-3c6h-g97w-fg78/case.json | jq '.expectedOutcome'For each case in cases/GHSA-*/case.json:
- Baseline scan (optional) — checkout
timeline.baselineCommit, scan. This is the pre-vulnerability clean state; findings here are false positives for this advisory. - Vulnerable scan — checkout
timeline.vulnerableHead, scan. Compare results againstexpectedOutcome:- Does the scanner report a finding with a matching
vulnerabilityClass? - Does at least one finding path overlap with
expectedPaths? - Does the finding also meet
minimumSeverity? The runner records this separately asseverityMatch, but detection is based on path + class.
- Does the scanner report a finding with a matching
Each case directory contains a single case.json with these sections:
| Section | Description |
|---|---|
repository |
Canonical source repository ID (OWNER/NAME) used by the runner and validator to select the right local checkout |
advisory |
Title, severity, CVSS, CWEs, affected package ranges from the GHSA |
timeline |
Baseline commit, introducing commit(s), vulnerable head |
expectedOutcome |
Vulnerability class, minimum severity, expected file paths, description |
verification |
Verification evidence and ancestry checks for baseline → intro → vulnerableHead relationships |
See schema/case.schema.json for the full JSON Schema (draft-07).
For --strict validation, the required ancestry checks in verification.checks are:
baseline_ancestor_of_introintro_ancestor_of_vulnerable_headbaseline_ancestor_of_vulnerable_head
Those checks should carry machine-readable ancestor and descendant SHAs.
High confidence means the case has evidence mapping the advisory to the
first commit that introduced the affected implementation. That can be a
later regression, or a feature/file-inception commit when the parent
lacks the vulnerable path and the patch hardens that same path. The
documented content-evidence check for that claim is
introducing_commit_contains_vulnerable_pattern; in strict mode, its
details must reference one of the case's timeline.introducingCommits
SHAs.
manifest.json is the checked-in summary index for the benchmark. Each entry in manifest.cases should include:
idseveritytitlevulnerabilityClassbaselineCommitvulnerableHeadverificationStatusconfidence
Validation checks these summary fields against the corresponding case.json. When manifest.repositories is present, validation also checks that it matches the set of repositories referenced by the loaded cases.
| Class | Description |
|---|---|
authbypass |
Authentication bypass |
brokenauthz |
Broken authorization / privilege escalation |
commandinjection |
Command injection / argument injection |
codeexec |
Arbitrary code execution |
csrf |
Cross-site request forgery / request-origin binding failure |
pathtraversal |
Path traversal / directory traversal |
sandboxescape |
Sandbox escape / isolation bypass |
secretdisclosure |
Secret / credential disclosure |
sqlinjection |
SQL injection |
ssrf |
Server-side request forgery |
abuse |
Resource exhaustion / denial of service |
xss |
Cross-site scripting |
The runner automates the full loop: checkout each vulnerable commit, run your scanner, parse results, and produce a scorecard.
If the scanner emits parseable SARIF/simple JSON, the benchmark evaluates that output even when the scanner exits non-zero. This accommodates tools that use exit status to signal findings.
The runner loads every case under cases/ by default. Each case needs a
local checkout of its source repository, so every example below passes
--repo for all seven repos in the corpus. To scope to a subset
(e.g. only Ghost cases), add --filter <GHSA-ID> [<GHSA-ID> ...] and pass
only the matching --repo flag — see the "Run a subset of cases" example.
# Run with SARIF against the full corpus
python3 scripts/run.py \
--repo openclaw/openclaw=../openclaw \
--repo TryGhost/Ghost=../ghost \
--repo cosmos/cosmos-sdk=../cosmos-sdk \
--repo CosmWasm/wasmd=../wasmd \
--repo cosmos/ibc-go=../ibc-go \
--repo cosmos/evm=../cosmos-evm \
--repo cometbft/cometbft=../cometbft \
--scanner-cmd "semgrep scan --sarif ."
# Run with a custom JSON-producing scanner
python3 scripts/run.py \
--repo openclaw/openclaw=../openclaw \
--repo TryGhost/Ghost=../ghost \
--repo cosmos/cosmos-sdk=../cosmos-sdk \
--repo CosmWasm/wasmd=../wasmd \
--repo cosmos/ibc-go=../ibc-go \
--repo cosmos/evm=../cosmos-evm \
--repo cometbft/cometbft=../cometbft \
--scanner-cmd "my-scanner scan --json ." \
--format simple
# Run a subset of cases — only repos referenced by the filtered GHSAs need --repo
python3 scripts/run.py \
--repo TryGhost/Ghost=../ghost \
--scanner-cmd "semgrep scan --sarif ." \
--filter GHSA-w52v-v783-gw97 GHSA-cgc2-rcrh-qr5x
# Custom timeout and output path
python3 scripts/run.py \
--repo openclaw/openclaw=../openclaw \
--repo TryGhost/Ghost=../ghost \
--repo cosmos/cosmos-sdk=../cosmos-sdk \
--repo CosmWasm/wasmd=../wasmd \
--repo cosmos/ibc-go=../ibc-go \
--repo cosmos/evm=../cosmos-evm \
--repo cometbft/cometbft=../cometbft \
--scanner-cmd "codeql database analyze ." \
--timeout 600 \
--output my-results.json| Flag | Description |
|---|---|
--repo |
OWNER/NAME=PATH mapping from case repository to local git checkout; repeat for multiple repos |
--openclaw-repo |
Legacy alias for --repo openclaw/openclaw=PATH |
--scanner-cmd |
Scanner command to run in the checked-out worktree; must write SARIF or simple JSON to stdout |
--format |
Force auto, sarif, or simple output parsing |
--cases-dir |
Override the default cases/ directory |
--filter |
Run only specific GHSA IDs |
--timeout |
Per-case timeout for the main scanner command |
--output |
Path for the JSON results file |
--baseline-cmd |
Optional setup command to run at timeline.baselineCommit before the vulnerable scan |
--baseline-timeout |
Timeout for --baseline-cmd (defaults to --timeout) |
--baseline-cmd exists for scanners that need to warm caches, build indexes, or prepare context from the pre-vulnerability commit before scanning vulnerableHead.
It runs a setup command at timeline.baselineCommit, discards that command's stdout, then checks out timeline.vulnerableHead and runs the main scanner command.
python3 scripts/run.py \
--repo openclaw/openclaw=../openclaw \
--repo TryGhost/Ghost=../ghost \
--repo cosmos/cosmos-sdk=../cosmos-sdk \
--repo CosmWasm/wasmd=../wasmd \
--repo cosmos/ibc-go=../ibc-go \
--repo cosmos/evm=../cosmos-evm \
--repo cometbft/cometbft=../cometbft \
--scanner-cmd "my-scanner scan --json ." \
--baseline-cmd "my-scanner index ." \
--baseline-timeout 600 \
--format simple--baseline-cmd does not compare baseline findings against vulnerable-scan findings. The README's earlier "baseline scan" workflow describes a larger false-positive-diffing feature that the runner still does not implement.
The runner accepts two output formats (auto-detected by default):
SARIF 2.1.0 (recommended) — Industry standard. Severity is derived from security-severity rule properties when available, otherwise from the result level or the rule's defaultConfiguration.level. CWE IDs are extracted from result properties or rule tags.
Simple JSON — Minimal format for scanners without SARIF support:
{
"findings": [
{
"path": "src/foo.ts",
"severity": "high",
"ruleId": "command-injection",
"message": "Command injection via user input",
"cweIds": ["CWE-78"]
}
]
}expectedOutcome.vulnerabilityClass is matched from the finding's CWE metadata. Findings without CWE/class metadata cannot satisfy the class-match requirement.
Some checked-in cases intentionally keep advisory.cweIds empty because the upstream GHSA did not publish CWE metadata, and the benchmark does not synthesize replacement advisory CWEs. Those cases still define an expected benchmark class in expectedOutcome.vulnerabilityClass, but scanners must emit their own CWE metadata for the runner to award a class match.
The runner writes a JSON results file (default: results.json) and prints a console scorecard:
Case ID Class Sev Result Path Cls Sev #
--------------------------------------------------------------------------------------
GHSA-3c6h-g97w-fg78 commandinjection high DETECTED Y Y Y 12
GHSA-474h-prjg-mmw3 sandboxescape high MISSED - - - 0
...
--------------------------------------------------------------------------------------
Detected: 15/24 (62.5%)
When --baseline-cmd is provided, the scorecard adds a Base column that shows whether baseline setup for each case was OK or FAIL.
Each entry in the JSON results payload includes the case repository alongside caseId, detected, pathMatch, classMatch, severityMatch, and scanner metadata.
# Structural validation (schema + manifest consistency; requires: pip install jsonschema)
python3 scripts/validate.py
# Semantic validation (commit SHAs resolve, ancestry checks pass in git history)
# If jsonschema is missing, schema checks are skipped with a warning.
python3 scripts/validate.py \
--repo openclaw/openclaw=../openclaw \
--repo TryGhost/Ghost=../ghost \
--repo cosmos/cosmos-sdk=../cosmos-sdk \
--repo CosmWasm/wasmd=../wasmd \
--repo cosmos/ibc-go=../ibc-go \
--repo cosmos/evm=../cosmos-evm \
--repo cometbft/cometbft=../cometbftscripts/validate.py --strict enforces that every loaded case has
verification.status == "pass" AND verification.confidence == "high".
The current corpus is expected to pass strict mode when the relevant
upstream repositories are supplied with --repo. Strict mode also
checks the three ancestry relationships and validates any
introducing_commit_contains_vulnerable_pattern content-evidence check
against the case's introducing commits.
A scanner "detects" a case when scanning vulnerableHead if:
- Class match — reported vulnerability class matches
expectedOutcome.vulnerabilityClass- For SARIF/simple JSON inputs, this is derived from the finding's CWE metadata.
- Path overlap — at least one finding path overlaps with
expectedOutcome.expectedPaths - Severity floor — reported severity >=
expectedOutcome.minimumSeverity, recorded asseverityMatchbut not required for detection
The runner evaluates severity and class matching only across findings on matching paths. Class matching is derived from all relevant findings on those paths, even when the matching finding is below the configured severity floor.