An LLM-driven command-line agent for finding security vulnerabilities and "vibe-code" problems (AI-slop patterns: stub authentication, hallucinated imports, dummy crypto, prompt-injection sinks) in any repository, diff, or single file.
s0-cli runs a hybrid of classic static scanners (semgrep, bandit, ruff, gitleaks, trivy) and LLM detectors, then uses a multi-turn agent to triage, deduplicate, recalibrate severity, and explain each finding. The whole scanning agent is itself optimizable: s0 optimize runs a Meta-Harness outer loop that mutates the agent against a labeled benchmark with a held-out test set.
$ uv run s0 scan ./my-app
hallucinated import src/email.py:8 critical CWE-829
`import emailclient` — no such package on PyPI; nearest match is
`emailclient-aws` (likely typosquat). Suggest pinning `email-validator`.
SQL injection (f-string) src/api/users.py:42 critical CWE-89
`cur.execute(f"SELECT … {user_id}")`. Use `cur.execute("… ?", (user_id,))`.
weak password hashing src/auth/hash.py:7 high CWE-327
`hashlib.md5(...)` for password storage. Use `argon2-cffi` or `bcrypt`.
3 findings (1 critical hidden as triage filtered out 6 false positives)
git clone https://github.com/antonellof/s0-cli.git
cd s0-cli
uv sync # Python 3.12+, uv >= 0.5
cp .env.example .env # then fill in one provider keySet one of ANTHROPIC_API_KEY, OPENAI_API_KEY, or GEMINI_API_KEY and the matching S0_MODEL (default: anthropic/claude-sonnet-4-5). Everything in .env is loaded automatically.
System scanners are auto-discovered. Install whatever subset you want; missing ones are silently skipped:
brew install semgrep gitleaks trivy
uv tool install bandit ruff
uv run s0 doctor # confirms which scanners + LLM keys are live# Scan an entire repository
uv run s0 scan ./path/to/repo
# Scan only the diff against a branch (great for PRs)
uv run s0 scan ./path/to/repo --mode diff --diff main
# Scan a single file
uv run s0 scan ./path/to/repo/file.py --mode file
# Skip the LLM entirely; raw scanner findings only (zero-cost smoke test)
uv run s0 scan ./path/to/repo --no-llm --format sarif --out report.sarif
# Fail the build if any high-severity issue is found
uv run s0 scan . --fail-on high
# Stream every step the agent takes (scanners, LLM turns, tool calls)
uv run s0 scan ./path/to/repo -v
# Inspect what the agent did (full prompt + tool trace per scan)
uv run s0 runs list
uv run s0 runs show <run_id>
uv run s0 runs grep "CWE-89"Output formats: markdown (default, human-readable), json, sarif (for GitHub code-scanning, GitLab SAST, etc.).
s0-cli ships two loops that work together:
| Loop | Command | What it does | When to run it |
|---|---|---|---|
| Inner (scan) | s0 scan ./your-repo |
Finds vulnerabilities in your code | every PR, every nightly |
| Outer (optimize) | s0 optimize -n N |
Makes the agent itself smarter against a labeled benchmark | when you want a sharper scanner |
The outer loop doesn't scan your repo — it rewrites the scanning agent (a single Python file under src/s0_cli/harnesses/) and scores each rewrite on bench/tasks_train/. The improved agent then gets used by the next s0 scan you run on your code. This is the Meta-Harness approach — see the design section below for the rationale.
# 3 proposer iterations on the training bench, then a held-out test pass
uv run s0 optimize -n 3
# Want to see it work without spending tokens? Stub the LLM (zero-cost smoke test)
uv run s0 optimize -n 1 --no-llm
# Or fan out 2 parallel proposals per iteration and keep the best
uv run s0 optimize -n 5 -k 2 --fresh --run-name exp1Each iteration writes a new harness file under src/s0_cli/harnesses/ plus a scored run under runs/. The session ends with a final pass on bench/tasks_test/ (the held-out set) so you can see the train→test generalization gap.
Want it tuned to your codebase? The shipped bench is generic. To make the agent better at your repo's specific failure modes, add real vulns from your codebase as new bench tasks — see Optimize for your own codebase below.
Three drop-in integrations ship with the repo so you don't have to assemble the runtime yourself. Pick whichever matches your workflow.
Wraps s0 scan and uploads the SARIF report to your repo's Security tab. Diff mode on PRs, full-repo on main/cron.
# .github/workflows/s0-scan.yml
name: s0-cli scan
on:
pull_request:
push: { branches: [main] }
permissions:
contents: read
security-events: write
jobs:
scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 0 }
- uses: antonellof/s0-cli@v0
with:
mode: ${{ github.event_name == 'pull_request' && 'diff' || 'repo' }}
fail-on: high
openai-api-key: ${{ secrets.OPENAI_API_KEY }}Inputs are documented in action.yml. The reusable example in .github/workflows/example-pr-scan.yml is what we use to dogfood it on this repo.
Multi-arch image with every scanner pre-installed (semgrep, bandit, ruff, gitleaks, trivy, ripgrep). Reproducible — versions are pinned in the Dockerfile.
docker run --rm -v "$PWD:/work" -w /work \
-e OPENAI_API_KEY="$OPENAI_API_KEY" \
ghcr.io/antonellof/s0-cli:latest scan .The published :latest tag tracks main; pin to a vX.Y.Z tag or a short SHA for reproducible CI.
Two hooks ship in .pre-commit-hooks.yaml. The fast one runs the deterministic scanners on staged files (no LLM, no API key); the slower one runs the full LLM agent on the diff at push time.
# .pre-commit-config.yaml
repos:
- repo: https://github.com/antonellof/s0-cli
rev: v0.0.1
hooks:
- id: s0-scan-staged # every commit, ~1-2s
- id: s0-scan-diff # pre-push, ~30-60s, needs an LLM key
stages: [pre-push]Running a single static scanner gives you a wall of JSON; you still have to read every alert, decide which are real, and hunt down the data flow by hand. s0-cli runs the scanners plus an LLM agent that does that triage for you — and writes down every step it took so you can audit the result.
| Traditional SAST | s0-cli | |
|---|---|---|
| Detection | one scanner | 5 classic scanners + 2 AI-slop detectors, deduped |
| Triage | manual (engineer reads each alert) | LLM agent reads source, traces taint, marks FPs |
| Output | rule_id + line | severity + why_real + fix_hint, in markdown / JSON / SARIF |
| Audit trail | none | full prompt + every tool call recorded under runs/ |
| Reproducibility | re-run and hope | replay any past scan from runs/<id>/ |
s0 scan runs every installed scanner on the target in parallel, deduplicates findings across them by (path, line, rule_id), and hands the result to the inner harness — a multi-turn LLM agent with a tightly scoped tool surface. The agent reads source, greps for taint, blames git history, re-runs scanners with tighter rules, then either accepts each finding (assigning a severity and a fix_hint) or marks it as a false positive. Everything it does — the prompt, every tool call, every LLM response — is recorded under runs/<timestamp>__<harness>__<id>/ so any scan is reproducible and auditable.
Two scanning agents ship out of the box:
| Harness | Turns | Use |
|---|---|---|
baseline_v0_agentic (default) |
≤30 | full investigation (read source, taint, dedup) |
baseline_v0_singleshot |
1 | cheap pre-filter / CI fast path |
Pick one with --harness <name> or set S0_DEFAULT_HARNESS in .env.
| Detector | Catches | Kind |
|---|---|---|
semgrep |
broad SAST patterns (auto + p/security-audit + p/owasp-top-ten) | classic |
bandit |
Python security smells (B-codes) | classic |
ruff (S, B) |
security + bugbear lints, with severity escalation | classic |
gitleaks |
secrets in source (matched values redacted in logs) | classic |
trivy fs |
filesystem vulns, secrets, misconfigurations | classic |
hallucinated_import |
imports that aren't stdlib, declared, or local | AST |
vibe |
stub auth, dummy crypto, hardcoded backdoors, ... | LLM detector |
Findings from every detector flow into the same agent loop, which decides what to keep, what to flag as a false positive, and what severity to report. All raw scanner output, every LLM call, and every tool invocation is recorded under runs/ for replay and audit.
All settings live in .env (see .env.example). The most useful knobs:
| Variable | Default | Purpose |
|---|---|---|
S0_MODEL |
anthropic/claude-sonnet-4-5 |
Any litellm-compatible model string |
S0_DEFAULT_HARNESS |
baseline_v0_agentic |
Which scanning agent s0 scan uses |
S0_MAX_TURNS |
30 |
Agent tool-loop budget per scan |
S0_TOKEN_BUDGET |
200000 |
Soft input-token cap per scan |
S0_OUTPUT_CAP_BYTES |
30000 |
Per-tool-observation byte cap |
S0_RUNS_DIR |
./runs |
Where to write run artifacts |
S0_FAIL_ON |
high |
Default --fail-on severity floor |
The repository ships with 11 labeled tasks under bench/ (7 training, 4 held-out for testing generalization). Every task has a ground_truth.json listing the real vulnerabilities; the scorer matches predictions by (path, line ± 5). Numbers below are reproducible — run uv run s0 eval and uv run s0 eval --split test yourself.
Two configurations on openai/gpt-4o-mini:
| Configuration | Split | TP | FP | FN | Precision | Recall | F1 | Cost (in/out tokens) |
|---|---|---|---|---|---|---|---|---|
--no-llm (raw scanners only) |
train | 8 | 25 | 0 | 0.24 | 1.00 | 0.39 | 0 / 0 |
--no-llm (raw scanners only) |
test | 5 | 10 | 0 | 0.33 | 1.00 | 0.50 | 0 / 0 |
baseline_v0_agentic (LLM) |
train | 8 | 23 | 0 | 0.26 | 1.00 | 0.41 | 97k / 6k |
baseline_v0_agentic (LLM) |
test | 5 | 7 | 0 | 0.42 | 1.00 | 0.59 | 60k / 2k |
What this proves:
- Recall = 1.00 in every configuration. Across all 13 ground-truth vulnerabilities (train + test) — SQL injection, command injection, hallucinated imports, path traversal, weak crypto, hardcoded secrets, JWT no-verify, pickle deserialization, stub auth, … — the deterministic scanner pipeline alone catches every one. The LLM never has to find anything; its job is purely to triage what was already found.
- LLM triage cuts false positives by 30% on the held-out set (10 → 7) without dropping a single true positive. Held-out F1 climbs from 0.50 → 0.59 (+18% relative).
- Every scan ends in a fixed turn budget (median 5 turns, max 11 in this run) and a fixed token budget. No runaway costs.
- The held-out test split was never seen by the LLM during any optimization run — generalization is measured, not assumed.
The --no-llm mode is a useful free anchor: you keep 100% recall and pay zero LLM cost, at the price of more false positives to skim through. Most CI pipelines will want the LLM mode on PR diffs (small target, low token cost, accurate triage) and the no-LLM mode on full-repo nightly scans.
For an end-to-end demonstration on a real codebase (OWASP PyGoat, ~50 modules of deliberate Django vulnerabilities) and a from-scratch optimize loop, see docs/results/REAL_WORLD_RESULTS.md. Headline numbers from that run:
- 252 raw scanner findings → 14 kept by the LLM agent (94% noise reduction). Every kept finding is a genuine OWASP-Top-10-class issue (pickle RCE, hallucinated imports, hardcoded credentials, Docker-as-root, command injection, broken auth, …) with a
why_realandfix_hint. - A 2-iteration
s0 optimize -n 2 -k 2session (~$0.12, 174s wall-clock) produced a winning harness that lifted held-out test F1 from 0.59 → 0.67 (+18% relative). All four candidate harnesses (winners and broken alike), the Pareto frontier, and per-task traces are committed underdocs/results/so the run is reproducible.
The bench is split into a train set (visible to the optimizer) and a held-out test set (only scored at the end of an optimize session). See bench/README.md for the full task list and how to add new ones.
# Score the default harness on the training tasks
uv run s0 eval
# Score on the held-out test set
uv run s0 eval --split test
# Just the deterministic scanners, no LLM
uv run s0 eval --no-llms0 eval writes a scored run to runs/, the same place s0 scan writes; everything is uniformly inspectable with s0 runs.
The scanning agent is a single Python file. Most security tools encode their heuristics either in scattered config (.semgrepignore, custom rule files, hand-tuned LLM prompts) or in undocumented engineer intuition. s0-cli encodes them in a versioned harness file that gets automatically rewritten by an outer optimization loop, based on real evaluation data — this is the Meta-Harness approach (Lee et al., 2026).
s0 optimize runs the loop: a coding-agent proposer reads runs/ (every prior agent, every score, every tool trace), forms a hypothesis about the worst current failure mode, writes a new harness file under src/s0_cli/harnesses/, and the runner validates and re-scores it on bench/tasks_train/. After all training iterations finish, the best-train-F1 candidate is scored once on the disjoint bench/tasks_test/ to measure generalization. The proposer's contract is in SKILL.md.
| Hand-tuning prompts/rules | Meta-Harness loop | |
|---|---|---|
| What changes | a string in a config file | a whole single-file Python agent (prompts + tools + scanner-selection + dedup logic) |
| What measures progress | "feels better on my test repo" | a labeled bench scored by F1, precision, recall, tokens, turns |
| What guards overfitting | nothing | held-out bench/tasks_test/ the proposer never sees |
| History | git log of edits, no scores attached |
every attempt + score + full trace lives forever in runs/<id>/ |
| Cost vs. accuracy | implicit; you pick one config | explicit Pareto frontier (F1 ↑ vs. tokens ↓) snapshotted to runs/_frontier.json |
| Reproducibility | rerun and hope | s0 runs show <id> replays the exact harness file, prompts, and tool calls |
| Rollback | manual revert | the prior harness file is still on disk; just point S0_DEFAULT_HARNESS at it |
- Search beats intuition. The proposer can try ideas a human wouldn't bother with — "lower confidence on bandit B608 inside
tests/directories", "escalate to critical whenpickle.loadsis reachable from a Flask handler", "skip semgrep'spython.lang.security.audit.dangerous-subprocess-useforsubprocess.runcalls whose first argument is a list literal" — and measure whether each one helps. - Pareto, not point estimates. Real choice in CI isn't "best F1", it's "best F1 at the token budget I can afford on a PR". The frontier file gives you that menu directly.
- Generalization is enforced. The proposer can't see
tasks_test/and the loop refuses to start if the train and test paths resolve to the same directory. So a +0.1 F1 on train that comes with a -0.05 test gap shows up in the summary table — you can't cheat your own benchmark. - Every iteration is auditable. Each attempt is one new file plus a
runs/<id>/directory containingharness.py,score.json,summary.md, and per-task traces with the full prompt and every tool call. Disk-as-database; no schema migrations, justgrep.
Pass -k N (or --candidates N) to fan out N parallel proposals per iteration, each with a different temperature, seed harness, and focus directive. The runner evaluates them concurrently and keeps the highest-F1 winner; losers are still recorded under runs/ so you can see what each design slot tried.
# 2 parallel proposals per iteration; pick the better one each time
uv run s0 optimize -n 5 -k 2 --run-name exp_multicand --freshCost scales linearly with k, but wall-clock cost stays roughly constant (the proposers run concurrently). The strategy ladder lives in src/s0_cli/optimizer/strategies.py and is deterministic — k=2 always means slot 0 (greedy, exploit) plus slot 1 (warmer, "shrink token cost"), so reruns hit the same regions of design space.
uv run s0 optimize -n 3 # 3 iterations on train, then held-out test pass
uv run s0 optimize -n 5 --fresh --run-name exp1 # isolate under runs/exp1/, clean slate
uv run s0 optimize -n 1 --no-llm # smoke test the loop, zero tokens spentAfter every iteration the Pareto frontier (F1 vs. tokens) is snapshotted to runs/_frontier.json. The session ends with a final pass on bench/tasks_test/ that prints the train→test generalization gap. Ctrl+C finishes the current iteration and exits cleanly; press it twice to abort.
Inspect what the agents are doing:
uv run s0 runs list # all runs, newest first
uv run s0 runs frontier # only the Pareto frontier
uv run s0 runs show <run_id> # score + summary + harness diff
uv run s0 runs diff <run_a> <run_b> # side-by-side
uv run s0 runs grep "<regex>" # ripgrep across all traces
uv run s0 runs tail-traces <run_id> <task_id>Out of the box, s0 optimize improves the agent against the 11 generic tasks shipped under bench/ (SQL injection, XSS, path traversal, hardcoded secrets, …). That gives you an agent that's good at common security bugs. To get one that's good at your codebase's specific failure modes, mix in your own labeled tasks — usually drawn from past CVEs, post-mortems, pentest findings, or alerts you've manually triaged.
A natural first instinct is s0 optimize ./my-repo — let the optimizer chew on a real codebase and "get better over time". It can't, because the optimizer needs to know whether each rewrite of the agent is better than the previous one. That requires labels: a ground_truth.json that says "these N findings are real, anything else is noise". Without labels every change is unfalsifiable — F1 is undefined and the loop has no signal to climb. Pointing the optimizer at raw source would be like training a model with no loss function.
The good news: a tiny amount of labeled data goes a long way. 3–5 in-house tasks are enough to start biasing the agent toward your stack.
Each task is two files: a target/ directory holding the offending source, and a ground_truth.json listing the expected findings. Bootstrap from a known issue in your repo:
# 1. Copy the offending file(s) into a new task directory
mkdir -p bench/tasks_train/myco_sqli_001/target
cp ./your-repo/api/users.py bench/tasks_train/myco_sqli_001/target/
# 2. Write the label (one entry per ground-truth finding)
cat > bench/tasks_train/myco_sqli_001/ground_truth.json <<'EOF'
[
{
"rule_id": "sqli-fstring",
"severity": "critical",
"path": "target/users.py",
"line": 42,
"cwe": ["CWE-89"],
"note": "f-string user_id into cur.execute"
}
]
EOF
# 3. Sanity-check that the deterministic scanners can at least see it
uv run s0 eval --only myco_sqli_001 --no-llm
# 4. Re-run the optimizer; the proposer is now scored on your task too
uv run s0 optimize -n 5
# 5. Use the resulting (improved) harness in CI on the full repo
uv run s0 scan ./your-repo --fail-on highThe matcher is forgiving: a prediction matches a label if path is identical and |line - gt.line| <= 5, so you don't have to be exact on line numbers. Severity isn't matched on; it's scored separately. The full task format (multi-file targets, optional task README, etc.) is documented in bench/README.md.
Once you have one in-house task the loop closes:
real vulns ──add as bench task──▶ bench/tasks_train/<myco_xxx>/
───────────── │
• post-mortems │ s0 optimize -n N
• pentest reports ▼
• dismissed alerts src/s0_cli/harnesses/<new>.py
• prod incidents │
▲ │ s0 scan ./your-repo (CI)
│ ▼
│ findings on prod code
│ │
└──── any confirmed miss / verified FP ────────┘
(promote to a new bench task)
Every confirmed miss or verified false positive in production becomes a new training task. Over months your bench grows, and the agent gets sharper at the failure modes that actually appear in your stack — that is the "security improvements over time" loop, and it's powered by labels rather than by pointing at a folder.
| In-house tasks | What you get |
|---|---|
| 0 | a general agent (the shipped baseline) |
| 1–3 | the proposer notices your task class and biases toward it |
| 5–10 | meaningful tilt toward your codebase's vuln distribution |
| 20+ | diminishing returns; consider moving 20–30% into tasks_test/ to measure your-repo generalization |
Mix your tasks with the shipped ones — never replace them, or the optimizer will overfit your distribution and lose general competence.
- Sources of labels. Past CVEs in your dependency tree, post-mortems, internal pentest reports, the security-team backlog, alerts you've already triaged in
runs/(uses0 runs grepto find them). - Keep tasks small. A few hundred lines of source per
target/, not the whole module. The harness only needs enough context to triage one finding correctly. - Redact secrets.
target/is committed to git. If you're labeling a leaked-credential bug, replace the secret withEXAMPLE_REDACTEDbefore committing — the scanners care about the shape, not the value. - Hold some out. As you accumulate tasks, move 20–30% into
bench/tasks_test/so generalization is measured, not assumed. The optimizer is sandboxed fromtasks_test/automatically. - Privacy. If you can't put real source in the repo, keep your
bench/tasks_train/myco_*/directory outside git and points0 optimizeat it:uv run s0 optimize -n 5 --bench /path/to/private-bench.
| Stage | Detection | Reasoning | Decision |
|---|---|---|---|
| Static scan | classic scanners | — | — |
| Triage | — | LLM (single-shot or agent) | LLM |
| Investigation | LLM tool loop | LLM tool loop | LLM |
| Vibe-code detector | LLM as scanner | LLM (same call) | LLM |
Optimizer (optimize) |
— | proposer LLM | evaluator (code) |
You can run with --no-llm to use only the deterministic scanners and no LLM at all — useful as a free baseline and for CI.
src/s0_cli/
cli.py entrypoint (typer)
config.py pydantic-settings + .env loader
harness/ base classes, native tool calling, agent loop
harnesses/ scanning agents (single-file, swappable)
scanners/ deterministic + LLM detectors
targets/ repo / diff / file scan targets
eval/ bench runner + scorer + static validator
optimizer/ outer Meta-Harness loop + proposer
runs/ run-store CLI + filesystem schema
reporters/ markdown / json / sarif renderers
prompts/ system prompts (per-harness)
bench/
tasks_train/ 7 labeled tasks, visible to the optimizer
tasks_test/ 4 held-out tasks for generalization scoring
SKILL.md proposer contract (read by the outer loop)
- Lee et al. Meta-Harness: End-to-End Optimization of Model Harnesses. arXiv:2603.28052 (2026). paper · code · tbench2 artifact
- KRAFTON AI & Ludo Robotics. Terminus-KIRA. github.com/krafton-ai/KIRA
Apache-2.0. See LICENSE.


