v0.51.0 — first real corporate-share benchmark + Snaffler head-to-head
v0.51.0 — first real corporate-share benchmark
The first published head-to-head against upstream Snaffler on a
real Windows NTFS share, not LLM-curated paths.
The number
| Tool | Caught | Missed | FPs | F1 at Red+ |
|---|---|---|---|---|
| Upstream Snaffler | 16 | 59 | 4 | 0.337 |
| ShareSift v0.51 | 54 | 21 | 62 | 0.565 |
2525 files. 75 synthetic-but-format-shaped credentials across 16
categories. Operator triage policy (Red+).
ShareSift catches 3.4× more credentials than Snaffler. At the
cost of 15× more false positives, which is the genuine tradeoff:
the path classifier is aggressive on binary-extension noise (.msi
/.iso/.psd). Run Black-only for P=0.833 if you don't want them;
run Red+ if you don't want 59 real credentials silently missed.
Why this corpus exists
The v0.50 scorecard had one honesty caveat: the Windows precision
number (P=0.984 on snaffler-blind) came from LLM-labeled paths,
not real share content. v0.51 replaces it with:
- 2525 actual files on an NTFS partition built from a reproducible
JSON manifest via Stauffer's DiskForge - 75 positives across 16 categories — one per ShareSift rule
generation v0.46→v0.50, plus the classic high-value categories - 2420 corporate-share noise + 20 precision-stress filenames
- UNC backslash form (
\\corp-fs01\…) — what the rule engine sees
on real SMB shares - One docker run from the committed seed → byte-identical corpus
Honest caveat
The 16 positive categories were authored to exercise ShareSift's
rule coverage. Snaffler's defaults don't ship with rules for
German cred filenames, CMD set "VAR=val", browser-creds
meta-coverage, etc. A neutral-curated corpus would show Snaffler
at maybe 40–50% recall. The categories ShareSift covers are real
corporate-share shapes (operator-reported in Snaffler's own issue
tracker), not invented for benchmark-chasing — but the
operational gap is amplified by category selection. Full
disclosure in docs/diskforge_winshare_v1_results.md.
What didn't change
The 4-generation held-out discipline cycle is still the
methodology contribution. v3 still at 100%, v4 still at 70%
baseline. The benchmark adds the operational head-to-head story
on top.
Reproducing
git clone --branch v0.51.0 https://github.com/byevincent/ShareSift.git
cd ShareSift
uv sync --group pysnaffler-integration
bash tools/diskforge_winshare/build_corpus.sh
.venv/bin/python tools/run_full_sweep.pySame seed = byte-identical corpus = same numbers.
Artifacts
sharesift— 77MB single-file binary (Stage 1 + rule engine)- Full source —
git clone --branch v0.51.0
🤖 Generated with Claude Code