AI-powered security review of code changes as a Claude Code plugin.
Run a single slash command — /stride-security-review:security-review — to get a structured, severity-graded list of security findings on whatever you've changed. Powered by a dedicated security-reviewer agent that uses semantic analysis, not pattern matching, and filters out low-impact noise so the findings you see are the ones worth acting on.
/plugin marketplace add cheezy/stride-marketplace
/plugin install stride-security-review@stride-marketplaceThe plugin auto-discovers the slash command, the agent, and the skill on install. No further configuration needed.
Claude Code ships with a built-in /security-review command (a diff-only review that does NOT understand this plugin's flags — --full, --json, --maestro, --rci, --baseline, --patches are silently ignored). To invoke this plugin, use the namespaced form /stride-security-review:security-review. All examples below use that form.
Renamed in v2.0.0. This plugin was previously named
security-review, which created a namespace collision with the Claude Code built-in. The rename tostride-security-reviewresolves the collision: the bare/security-reviewcleanly belongs to the built-in, and this plugin lives at/stride-security-review:security-review. If you have scripted invocations of the old namespaced form, update them to the new one.
In any git repository, run:
# Diff mode (default) — review what's changed against HEAD
/stride-security-review:security-review # all working-tree changes (staged + unstaged) vs HEAD
/stride-security-review:security-review lib/auth.ex # scope to one file
/stride-security-review:security-review lib/ test/ # scope to directories
/stride-security-review:security-review --json # raw JSON output for piping into tools
/stride-security-review:security-review --json lib/foo # path-scoped, raw JSON
# Full mode (--full) — review the codebase end-to-end (new in v1.1.0)
/stride-security-review:security-review --full # every tracked file in the repo
/stride-security-review:security-review --full lib/ # full scan scoped to a path
/stride-security-review:security-review --full --json # full scan, raw JSON output
# MAESTRO 7-layer classification (--maestro) — group findings by agentic-AI threat layer
/stride-security-review:security-review --maestro # classify each finding by MAESTRO layer
/stride-security-review:security-review --maestro --full # full scan + layer classification
/stride-security-review:security-review --maestro --json lib/ # raw JSON with maestro_layer fields
# Recursive Criticism & Improvement (--rci [N]) — run N additional critique passes
/stride-security-review:security-review --rci # one extra critique pass after the first dispatch
/stride-security-review:security-review --rci 2 # two extra critique passes (clamped to 3)
/stride-security-review:security-review --rci --full # critique pass over a full scan (expensive)
# Baseline suppression (--baseline) — suppress already-acknowledged findings
/stride-security-review:security-review --baseline # auto-detect .security-review-baseline.json in repo root
/stride-security-review:security-review --baseline ci.json # explicit baseline path
/stride-security-review:security-review --update-baseline # rewrite the baseline from current findings
# Auto-remediation patches (--patches) — emit surgical-fix diffs alongside findings
/stride-security-review:security-review --patches # diff mode + per-finding patch suggestions
/stride-security-review:security-review --patches --json # raw JSON includes the patch field
Diff mode answers "is this change safe to merge?" — invoke it before pushing a PR. Full mode answers "what latent issues are in this codebase right now?" — invoke it when onboarding the plugin onto an existing repo, or on a periodic posture-check cadence. MAESTRO mode answers "which architectural layer needs the most attention?" — invoke it on codebases that wire LLMs / agents / Model Context Protocol clients into the request flow, so findings can be grouped by the seven-layer model from Cloud Security Alliance's MAESTRO framework. The flags compose: --maestro --full --json lib/ is valid. The output JSON schema is identical in diff and full modes; --maestro is the one flag that adds an optional field (maestro_layer) to each finding when set.
Sample output for a small diff with one finding:
Security review — 1 finding across 2 files
Critical: 0 High: 1 Medium: 0 Low: 0 Info: 0
## High
**[injection]** lib/users.ex:42 — confidence: high — CWE-89, A03:2021
User-supplied `username` parameter is concatenated directly into a SQL string at the call to
Repo.query/2 below. The trust boundary is the HTTP request handler at line 38, and the sink is
the raw query string passed to Postgres — classic SQL injection. Worst-case outcome is
full-table read for any user with credentials to reach this endpoint.
Fix: Use Ecto's parameterized query API. Replace the string-concatenated query with
Repo.query("SELECT * FROM users WHERE username = $1", [username]) so user input is bound as a
parameter rather than interpolated into the SQL text.
The security-reviewer agent (see agents/security-reviewer.md) reviews diffs across these vulnerability classes:
| Class | Examples |
|---|---|
| Injection | SQL, command, LDAP, NoSQL, XXE, template, header |
| Authentication | Missing auth check, timing-vulnerable comparison, weak password requirements |
| Authorization | IDOR, privilege escalation through parameter tampering, trusting client roles |
| Data exposure | Hardcoded secrets, secrets in logs, PII in error responses, sensitive data over plaintext |
| Cryptography | MD5/SHA1 for passwords, ECB mode, static IVs, predictable RNG for tokens |
| Input validation | Path traversal, SSRF, open redirect, zip-bomb decompression, trusting client validation |
| Race conditions | Filesystem TOCTOU, unlocked read-modify-write on security-sensitive state, symlink races |
| XSS / code execution | DOM/reflected/stored XSS, SSTI, deserialization of untrusted data |
| Insecure configuration | CORS * with credentials, disabled CSRF, debug mode in prod, missing security headers, disabled cert verification |
| Supply chain | Floating-tag container base images, curl | sh installers, CI/CD references by branch/tag instead of immutable SHA, manifest/lockfile drift, typosquat or hallucinated package names — multi-platform (Docker, GitHub Actions, GitLab CI, CircleCI, Bitbucket Pipelines, Jenkins) and multi-ecosystem (npm/PyPI/RubyGems/Hex/crates.io/Maven/NuGet/Packagist/Go modules) |
In addition to the universal classes above, three framework-specific rule packs ship by default — they activate based on file extension AND import detection (never extension alone):
| Pack | Activates on | Idiomatic rules |
|---|---|---|
| Android / Kotlin | .kt / .java / .gradle with android.* / androidx.* / kotlinx.* imports, or AndroidManifest.xml content |
Exported <activity> / <service> / <receiver> without signature-level permission, WebView with setJavaScriptEnabled(true) + addJavascriptInterface on a dynamic-URL load, android:usesCleartextTraffic="true" without networkSecurityConfig, SharedPreferences storing tokens/passwords/PINs (should be EncryptedSharedPreferences), SQLiteDatabase.rawQuery / execSQL with string concatenation, trust-all HostnameVerifier / X509TrustManager |
| Django / Python | .py with django.* / rest_framework / django.db imports |
mark_safe on user input, extra / raw() query interpolation, CSRF disabled, DEBUG=True / ALLOWED_HOSTS=['*'] / missing SECURE_* in prod settings, mass-assignment via cleaned_data, open redirect, unsafe deserialization (pickle.loads / yaml.load / signing.loads no max_age), SSRF via requests/urllib/httpx, DRF ModelSerializer fields='__all__' |
| Express / Node.js | .js / .mjs / .cjs / .ts with express / koa / fastify / @hapi/hapi / restify imports |
Reflected XSS via res.send(req.query.x), eval / Function() / vm.runInNewContext with user input, child_process.exec / execSync with shell wrapper, Mongoose Model.find(req.body) NoSQL injection, prototype pollution via lodash.merge / _.set |
| iOS / Swift | .swift / .m / .mm / .h with UIKit / SwiftUI / Foundation / WebKit / Security / CommonCrypto / CryptoKit / Network imports, or Info.plist content |
Sensitive data in UserDefaults (should be Keychain), WKWebView with allowFileAccessFromFileURLs or permissive JS bridge, ATS NSAllowsArbitraryLoads = true without NSExceptionDomains allow-list, URLSession delegate trusting any server certificate, deep-link handler dispatching state-changing actions without auth confirmation |
| Phoenix / Elixir | .ex / .exs / .heex / .eex with Phoenix.LiveView / Phoenix.Controller / Plug.Conn / Ecto.Query / Phoenix.HTML references |
Phoenix.HTML.raw/1 on user-controlled data, missing force_ssl, Plug.CSRFProtection disabled, Ecto.Query.fragment with string interpolation, LiveView event handler trusting phx-value-id without re-scoping, Ecto.Changeset.cast/3 with no explicit allow-list, Plug.Conn.redirect(external:) open redirect, Phoenix.Token.verify without max_age, System.cmd shell wrapper, LiveView allow_upload missing guards |
| Rails / Ruby | .rb / .erb with ActionController / ActiveRecord / ApplicationController references |
html_safe / raw() on user input, find_by_sql / connection.execute with interpolation, protect_from_forgery disabled, params.permit! mass-assignment, eval / send / instance_eval with user input, redirect_to params[:url] open redirect, Marshal.load / YAML.load / YAML.unsafe_load deserialization, missing authenticate_user! on state-changing controllers, unfiltered render json: @user leaking password_digest / *_token / *_secret |
| React / Next.js | .jsx / .tsx / .js / .ts with react / next/* imports, or pages/api/** / app/**/route.{js,ts} path |
dangerouslySetInnerHTML with user HTML, Next.js API route without auth check, redirect() / rewrite() with user-controlled destination, <a href={user_url}> / <img src={user_url}> with javascript: or data: scheme bypass, getServerSideProps / getStaticProps / Route Handler leaking secrets via props |
Each pack's rules map to one of the universal vulnerability classes — there are NO per-framework enum values. Adding a fourth pack (Spring, Express, Gin, Laravel, FastAPI, etc.) follows the documented template in the agent prompt.
A dedicated CI/CD pipeline rule pack activates on recognized pipeline files across eight platforms (alphabetical): Azure Pipelines, Bitbucket Pipelines, CircleCI, Drone, GitHub Actions, GitLab CI, Jenkins, Tekton. The same five rules apply to every platform: (1) external action / orb / template not pinned to an immutable SHA, (2) overly-broad permissions or scopes, (3) untrusted-ref or fork-PR build patterns, (4) secrets exposed alongside attacker-controlled input, (5) expression / interpolation injection in shell-step bodies. Activation is by file path (.github/workflows/*.yml, .gitlab-ci.yml, .circleci/config.yml, bitbucket-pipelines.yml, Jenkinsfile, azure-pipelines.yml, .drone.yml, .tekton/*.yaml) — generic YAML never triggers these rules. Adding a ninth platform means listing its file path and walking the five existing rules; the rule count stays fixed.
A Web defense-in-depth pack covers HTTP-response hardening across every framework above: missing Content-Security-Policy on HTML responses, missing Strict-Transport-Security, missing X-Frame-Options / frame-ancestors, and Set-Cookie without Secure / HttpOnly / SameSite on session/auth cookies. The pack is framework-agnostic (it activates on middleware / endpoint / response-pipeline shapes rather than file extension) and is explicitly defense-in-depth — when a primary XSS or CSRF finding is already raised on the same response site, the missing-header finding is a sibling note, not a duplicate.
For codebases that integrate LLMs, AI agents, or Model Context Protocol clients, five additional MAESTRO-derived classes activate (the file must import a recognized LLM/agent/MCP SDK in any language — Python, JavaScript/TypeScript, Go, Ruby, Elixir, Java/Kotlin all supported):
| Class | Examples |
|---|---|
| Prompt injection | Untrusted text concatenated into an LLM prompt without separation; messages=[{"role":"user","content": user_input}] patterns; un-delimited RAG context |
| Tool abuse | Agent function-call / MCP tool layer exposing file/shell/DB/HTTP/credential operations without per-tool authorization or input validation |
| Agent trust boundary | Agent-to-agent (A2A) message passing where one agent's output flows into another's prompt without the receiver treating it as untrusted |
| Model output execution | LLM response text flowing into eval, exec, subprocess with shell=True, Function(), os/exec.Command, or any code-execution sink |
| Vector store poisoning | User-controllable content embedded into a vector DB (Pinecone, Weaviate, Chroma, pgvector, etc.) without sanitization or source attribution |
The agent uses semantic analysis: a grep hit on eval( is not a finding; eval(user_input) at a trust boundary is. The analysis methodology, severity rubric, and JSON output schema live in the agent prompt.
To keep signal-to-noise high, the agent suppresses findings whose only impact is:
- Denial-of-service that is not also a data-integrity or confidentiality issue.
- Rate limiting as a general concern — unless its absence is on a credential or token-generation endpoint (which falls under Authentication).
- Memory exhaustion unless it enables another vulnerability class.
- Hypothetical risks not realizable through the changed code.
- Code style disguised as security concerns.
If your organization needs those classes flagged, see Customization — extend the agent prompt rather than working around the filter.
The agent always returns a single fenced ```json document conforming to:
{
"findings": [
{
"severity": "critical | high | medium | low | info",
"file": "path/relative/to/repo/root.ext",
"line": 42,
"vulnerability_class": "injection | authentication | authorization | data_exposure | crypto | input_validation | race_condition | xss_or_code_exec | insecure_config | supply_chain | prompt_injection | tool_abuse | agent_trust_boundary | model_output_execution | vector_store_poisoning",
"cwe": ["CWE-89"],
"owasp": ["A03:2021"],
"description": "What and why",
"remediation": "Specific fix",
"confidence": "high | medium | low",
"maestro_layer": "data-operations",
"patch": "--- a/lib/users.ex\n+++ b/lib/users.ex\n@@ ..."
}
],
"summary": {
"files_reviewed": 7,
"findings_by_severity": {"critical": 0, "high": 1, "medium": 2, "low": 0, "info": 0},
"files_skipped": [{"path": "priv/static/app.js", "reason": "oversize"}],
"suppressed_count": 0,
"rci_passes": 0
}
}Required per-finding fields (always present): severity, file, line, vulnerability_class, cwe, owasp, description, remediation, confidence. Every finding carries cwe (array of CWE-IDs like ["CWE-89"]) and owasp (array of OWASP Top 10 2021 category strings like ["A03:2021"]) so triage tools can group findings by canonical class without parsing prose. Both arrays default to [] only when a finding doesn't map to any standard category (rare).
Optional per-finding fields (emitted only when the corresponding flag is set):
| Field | Emitted when | Notes |
|---|---|---|
maestro_layer |
--maestro is set |
One of seven canonical layer IDs from CSA MAESTRO: foundation-models, data-operations, agent-frameworks, deployment-infrastructure, evaluation-observability, security-compliance, agent-ecosystem. Omitted entirely when --maestro is not set. |
patch |
--patches is set AND the agent can produce a surgical fix |
A unified-diff string the user can pipe to git apply. The agent emits a patch only when the fix is surgical (1–20 lines, single file, no new deps, no API breaks), unambiguous, and verifiable from the supplied input alone. Most findings won't have one even with --patches set. |
Optional summary fields:
| Field | Emitted when | Notes |
|---|---|---|
files_skipped |
--full is set |
Array of {path, reason} records for files the binary/size filters dropped. reason is one of binary, oversize, unreadable. Always emitted in full mode (even as [] to prove the filter ran); omitted in diff mode. |
suppressed_count |
--baseline is set |
Integer count of findings filtered out by the baseline. Omitted entirely when no baseline is in play. |
rci_passes |
--rci [N] is set |
Integer recording how many Recursive Criticism & Improvement passes ran on top of the initial dispatch. Omitted when --rci is not set. |
Cross-batch dedup (full mode). Full-mode batches are merged with an order-stable dedup pass keyed by (file, line, vulnerability_class) — duplicates that surface across batches or RCI passes collapse to the first occurrence. Diff mode is a single dispatch and dedup is a no-op there; the merged document is byte-identical to the agent's output.
Flag composition. All flags compose. --maestro --rci 2 --patches --baseline --full --json lib/ is a valid invocation. The --json flag prints the document verbatim so other tools (CI gates, Stride hooks, dashboards) can consume it.
The default /stride-security-review:security-review invocation reviews the working-tree diff against HEAD. That answers "is this PR safe to merge?" — but not "what latent issues are in this codebase right now?" The --full flag (added in v1.1.0) answers the second question by reviewing whole files rather than hunks.
Typical reasons to reach for --full: onboarding the plugin onto an existing repo (establish a baseline before you start gating PRs); a periodic posture check (quarterly, or on a cron); vendoring or forking an upstream codebase (review the imported code end-to-end before integrating); or a class-wide audit where the changed code is not the unit of interest. For PR-time gating, stay in diff mode — it is faster and the right shape for that question.
/stride-security-review:security-review --full # review every tracked file
/stride-security-review:security-review --full lib/ apps/web/ # scope to listed paths
/stride-security-review:security-review --full --json # raw JSON for piping
--full is additive — it composes with path arguments and with --json. Diff mode remains the default and its behavior is unchanged. The output JSON schema is identical in both modes, so any tool already consuming the diff-mode JSON continues to work against a --full run.
This is the surface contract every downstream piece of the plugin (slash command, agent prompt, skill, fixtures) follows. Implementation details for each bullet live in the file that owns that piece.
| Concern | Decision |
|---|---|
| Flag | --full. Parsed in the slash command's argument step and stripped from the path list before it reaches enumeration. |
| Enumeration source | git ls-files (optionally narrowed to user-supplied paths). This honors .gitignore, untracked-exclusions, and sparse-checkout — none of which find or a filesystem walk would honor for free. Untracked files are intentionally out of scope; if you want them reviewed, git add -N them first. |
| Binary filter | A single-shot grep -Il . <file1> <file2> ... call lists every non-empty text file in the enumeration. Any candidate path NOT in grep's stdout is treated as binary and skipped. This preserves the original null-byte-in-prefix heuristic (grep -I) and avoids dispatching the agent on PNGs, compiled artifacts, or minified bundles whose null bytes break tokenization. The call is batched into chunks of ~50 paths to stay under ARG_MAX; each chunk is a single Bash invocation matching the slash command's Bash(grep:*) permission entry, so the filter runs unattended in CI (no per-file pipe to gate). |
| Size cap | Skip any file larger than 262,144 bytes (256 KiB). Above this threshold the file is almost always generated, vendored, or minified, and the agent's signal-to-noise on it collapses. A single-shot wc -c <file1> <file2> ... call (batched into chunks of ~50) yields one <bytes> <path> line per file; any path over the threshold lands in files_skipped with reason: "oversize". Each wc chunk is a single Bash invocation matching Bash(wc:*). |
| Batch size | Dispatch the security-reviewer agent on 10 files per batch. Below this we burn dispatch overhead; above it we crowd the context window and lose per-file fidelity. Batches MAY be dispatched in parallel via multiple Agent tool calls in a single response. |
| Findings merge rule | The output of each batch is a JSON document conforming to the schema in the previous section. Merge in batch order, then run an order-stable dedup pass keyed by (file, line, vulnerability_class) — first occurrence wins. Dedup catches the rare case where shared setup code drives different batches to converge on the same finding, and where RCI passes replay the same batch. summary.findings_by_severity is recomputed from the post-dedup findings list, not summed from per-batch counters (those drift after dedup). summary.files_reviewed is summed across batches. The dedup pass is a no-op in diff mode (one dispatch, no RCI) so diff-mode JSON output is byte-identical to the agent's response. |
| Skipped-files reporting | The Step 2b enumeration loop records every filtered file as {path, reason} where reason ∈ {binary, oversize, unreadable}. The merged document carries these as summary.files_skipped (always emitted in full mode, even as []) so users can audit coverage. The human-readable report renders a ## Skipped block capped at 50 entries with an ... and N more overflow line. |
| Empty-input short-circuit | If enumeration yields zero files after filtering (e.g., empty repo, all files binary or over-cap), print the same short-circuit message diff mode uses for an empty diff and stop without dispatching the agent. |
| Output schema | The JSON document downstream tools consume is identical in diff and full modes for the required fields; full mode additionally emits summary.files_skipped. Optional fields gated by flags (maestro_layer, patch, summary.rci_passes, summary.suppressed_count) compose the same way in both modes. |
- It does not change vulnerability classes, severity rubric, or the false-positive filter. Those live in the agent prompt and apply to both modes.
- It does not become the default. Diff mode is the PR-gating workflow most users invoke; full mode is an explicit posture-check action.
- It does not enumerate untracked files.
git ls-filesis the source of truth; if a file isn't tracked, it isn't reviewed. - It does not deduplicate findings against an earlier run. Each invocation is independent; trend tracking is the consumer's job.
Two customization knobs:
- Scope by path. Pass paths as arguments to limit the review to a subset of changed files (diff mode) or tracked files (full mode). Useful in monorepos. Composable with both modes.
- Extend the agent prompt. Fork or patch
agents/security-reviewer.mdto add organization-specific vulnerability classes or to tighten/loosen the false-positive filter. Keep the output schema stable so downstream tooling continues to parse correctly.
The skill at skills/security-review-essentials/SKILL.md documents the surface; the agent prompt documents the behavior. Customize the agent prompt for behavior changes; customize the skill description for trigger-phrase tuning.
The --json flag makes /stride-security-review:security-review pipeable. Examples:
- A Stride completion hook can run
/stride-security-review:security-review --jsonand refuse to mark a taskdoneif a critical finding is present. - A CI gate can run
/stride-security-review:security-review --full --jsonon a schedule to track the codebase-wide finding count over time without coupling to any one PR. - A CI gate can call the agent directly (without the slash command) by importing the agent prompt and feeding it a diff from
git diff origin/main. - A dashboard can ingest the JSON across many runs and chart the per-class trend.
The agent does not call any external service, so composition is safe in repos with sensitive content.
The plugin ships a reference workflow at .github/workflows/security-review.yml. Drop it into your own repo's .github/workflows/, set an ANTHROPIC_API_KEY secret, and the slash command will run on every pull request — blocking the merge when a finding meets the configured severity threshold.
When the slash command is invoked with --fail-on <severity>, it exits non-zero if any finding at or above the threshold is present. Severities order as critical > high > medium > low > info.
| Exit | Meaning |
|---|---|
0 |
No findings at/above <severity> (or --fail-on not set) |
1 |
At least one finding at/above <severity> |
2 |
Setup or usage error (invalid --fail-on value, missing ANTHROPIC_API_KEY, agent dispatch failure) |
Default behavior (no --fail-on flag) preserves exit 0 always — byte-identical exit behavior for callers that do not opt in.
claude -p exit-code propagation has varied across Claude Code CLI versions. The shipped workflow uses a belt-and-suspenders pattern: it runs the slash command with --json --fail-on <severity> AND post-checks the JSON output with jq. The gate fails if EITHER the slash command exited non-zero OR jq counts at least one finding at/above the threshold. Either signal is sufficient to block the merge.
# Snippet from .github/workflows/security-review.yml (full file in this repo)
- name: Run security review
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
set +e
claude -p "/stride-security-review:security-review --json --fail-on critical" > review.json
claude_exit=$?
set -e
gate_count=$(jq -r '[.findings[]? | select(.severity == "critical")] | length' review.json)
if [ "$gate_count" -gt 0 ] || [ "$claude_exit" -eq 1 ]; then exit 1; fiStricter posture: --fail-on high blocks on any critical or high finding. --fail-on medium blocks on critical, high, OR medium. info-only findings never trip a threshold. Pair with --baseline to accept a known set of findings and gate only on new ones.
--sarif emits a SARIF v2.1.0 document on stdout — the de facto interchange format for security findings. Mutually exclusive with --json; both flags together produces exit 2.
# Upload findings to GitHub Code Scanning. The sarifs endpoint takes the
# SARIF payload gzipped then base64-encoded; gh api -f cannot read that
# directly from stdin, so capture into a variable first.
ENCODED=$(claude -p "/stride-security-review:security-review --sarif --full" \
| gzip \
| base64 -w0) # macOS base64 has no -w0; drop the flag there
gh api -X POST "repos/${GITHUB_REPOSITORY}/code-scanning/sarifs" \
-f commit_sha="${GITHUB_SHA}" \
-f ref="refs/heads/${GITHUB_REF_NAME}" \
-f sarif="${ENCODED}"Findings appear in the repository's Security → Code Scanning tab with severity badges, file:line links, and CWE/OWASP tags. Stable cross-run dedup uses the same fingerprint algorithm as --baseline (SHA-256 of vulnerability_class|file|line|first-80-chars-of-description, emitted under the stride/v1 key in partialFingerprints).
The full SARIF field mapping lives in schema/README.md; the schema itself is referenced from https://json.schemastore.org/sarif-2.1.0.json.
By default, diff mode scans git diff HEAD — i.e., the working tree against the last commit on the current branch. That's the right scope for a local pre-commit check, but the wrong scope for a CI gate, which should review every change the branch introduces relative to its merge target.
The --base <ref> flag widens the diff to <ref>...HEAD (three-dot range). Three-dot scopes strictly to the changes the current branch introduced over the merge-base; two-dot would also include base-side commits since divergence, producing a noisier diff. On GitHub Actions, pass --base origin/${{ github.base_ref }}. On GitLab, pass the merge-request target. The shipped workflow does this automatically when github.base_ref is non-empty. The flag is a no-op under --full (full mode reads tracked files via git ls-files, which is ref-independent). Invalid refs produce exit 2 with a clear error; the command never silently falls back to HEAD.
The scripts/run_eval.sh runner dispatches the security-reviewer agent against every fixture in test/fixtures/ and asserts the findings documented in test/fixtures/EXPECTED.md. It is the same suite CI runs (.github/workflows/eval.yml).
jqon$PATH- The Claude Code CLI (
claude) on$PATH— setCLAUDE_CLIif your binary lives elsewhere ANTHROPIC_API_KEYexported (unless using--dry-run)
bash scripts/run_eval.sh # all 23 expectations
bash scripts/run_eval.sh --fixture test/fixtures/sql_injection.py
bash scripts/run_eval.sh --dry-run # parser/comparator only; no API calls
bash scripts/run_eval.sh --verbose # echo prompts and raw agent JSON to stderrOutput is TAP 13: one ok N / not ok N line per fixture, followed by a trailing pass/fail summary. Exit code 0 means every expected vulnerability_class + severity was produced at least once on the expected file (the Bitbucket multi-finding fixture requires both expected findings).
Per fixture, two files land in logs/ (gitignored):
logs/<sanitized-path>.json— the parsed JSON document the agent emitted, with any wrapping prose stripped.logs/<sanitized-path>.raw.txt— the full unmodified stdout fromclaude -p, useful when the parser can't fence-extract or when you want to see what surrounded the JSON.
The runner asserts (file, vulnerability_class, severity, count). CWE and OWASP mismatches surface as # warn: TAP comments but do not fail the run — they are advisory metadata, not the contract. EXPECTED.md is the spec; do not modify it to match the agent.
Issues and PRs welcome at https://github.com/cheezy/stride-security-review. For prompt or filter changes, please include a smoke-test diff and the expected finding in your PR description so reviewers can verify the change does what you say.
MIT — see LICENSE.