Local-first stratified-sample audit UI for ML classifiers and labeled datasets.
A keyboard-driven web UI for auditing a stratified sample of items from a dataset, with Wilson 95% confidence intervals turning your manual labels into a defensible pass/fail verdict.
Built for the workflow: "I have a classifier and a labeled dataset, and I want to audit a representative sample in an evening to know whether my labels are noisy or my classifier is wrong."
Sits between hand-editing JSON files and full annotation platforms like Label Studio / Prodigy / Argilla — no cloud, no project management overhead, no schema-design ritual.
You have a classifier (rule-based or learned) and a labeled corpus. You want to:
- Spot-check whether your dataset labels are correct
- Validate a classifier's verdicts against ground truth
- Build a small gold-standard set for evaluation
- Get statistically defensible pass/fail rates on a hand-audited sample
You don't want to:
- Build a full data-labeling pipeline (use Label Studio)
- Annotate hundreds of thousands of items (use Prodigy)
- Set up a cloud labeling service (use Encord/Lightly)
- Manage labeling teams (use anything else)
# Install (requires Bun ≥ 1.1.0)
bun add -d handlabel
# Scaffold a project — creates handlabel.config.json + one example item
bun run handlabel init my-audit/
# Start the review UI on http://localhost:7723
bun run handlabel serve --dir my-audit/
# After review, get a Wilson-CI pass/fail report
bun run handlabel report --dir my-audit/You produce one <id>.parsed.json per item to review (your code, your data source). handlabel displays each item via a keyboard-driven UI and writes one <id>.label.json per decision.
my-audit/
├── handlabel.config.json # decision schema + thresholds (you author this)
├── item_001.parsed.json # read-only item view (your pipeline generates this)
├── item_001.label.json # writable decision (handlabel writes this)
├── item_002.parsed.json
├── item_002.label.json
└── ...
{
"name": "phishing-audit",
"description": "Stratified audit of 200 EPVME corpus samples",
"decision": {
"label": "Is this actually phishing?",
"options": [
{ "key": "yes", "value": true, "shortcut": "1", "color": "danger" },
{ "key": "no", "value": false, "shortcut": "2", "color": "good" },
{ "key": "unclear", "value": "unclear", "shortcut": "3", "color": "warn" }
]
},
"fields": [
{
"type": "radio",
"name": "completeness",
"label": "Completeness",
"options": [
{ "key": "full", "shortcut": "F" },
{ "key": "truncated", "shortcut": "T" },
{ "key": "fixture", "shortcut": "A" }
]
},
{ "type": "text", "name": "brand_impersonated", "label": "Brand impersonated" },
{ "type": "textarea", "name": "notes", "label": "Notes" }
],
"rubric": "Mark phishing emails (yes), spam/legit/fixtures (no), or ambiguous (unclear). Reject security test fixtures even if labeled positive.",
"thresholds": [
{ "name": "actually_phishing", "field": "decision", "target": "yes", "min_ci_lower": 0.80 },
{ "name": "completeness_full", "field": "completeness", "target": "full", "min_ci_lower": 0.80 }
]
}{
"id": "item_001",
"stratum": "tp",
"title": "Subject of the email or document",
"meta": {
"From": "attacker@evil.com",
"To": "victim@example.com",
"Date": "2026-05-11"
},
"body": "Main content text. HTML stripped server-side by your pipeline.",
"annotations": [
{ "label": "URL", "tag": "evil.com", "value": "https://evil.com/path" }
],
"verdict": {
"label": "Low",
"score": 20,
"color": "warn",
"signals": [
{ "name": "ReplyToMismatch", "weight": 15, "description": "Reply-To domain differs from sender" }
]
}
}Every field above is optional except id and stratum. The UI renders whatever fields are present.
{
"id": "item_001",
"stratum": "tp",
"decision": "yes",
"fields": {
"completeness": "full",
"brand_impersonated": "office365",
"notes": "Classic credential-harvest impersonation"
},
"reviewer": "djohnson",
"reviewed_at": "2026-05-11T13:45:12.123Z"
}- Three-column layout: sample list (left) | item content (center) | decision panel (right)
- Progress bar in the header with per-stratum counts ("tp: 12/50, fn_zero: 8/50, ...")
- Keyboard shortcuts for every decision and field option (configurable per project)
- Auto-saves on every click; auto-advances to next unreviewed item on Save+Next
- Rubric panel (collapsible) reminds reviewers of the rubric on every item
- Server binds to
localhostonly by default — no remote exposure. - The UI escapes every string before rendering — no XSS from item content.
- Bodies and annotations are rendered as plain text, never as HTML or clickable links.
- Your pipeline is responsible for HTML stripping before producing the
.parsed.json(handlabel does NOT parse raw emails or untrusted content itself).
bun run handlabel report --dir my-audit/Computes Wilson 95% CIs for every threshold in your config:
# handlabel report
Reviewed: 200 / 200
## Per-stratum
| Stratum | Total | Reviewed | Decisions |
|---|---|---|---|
| fn_signaled | 50 | 50 | yes=12, no=35, unclear=3 |
| fn_zero | 50 | 50 | yes=4, no=44, unclear=2 |
| random | 50 | 50 | yes=15, no=33, unclear=2 |
| tp | 50 | 50 | yes=42, no=6, unclear=2 |
## Thresholds (Wilson 95% CI)
| Name | Field=target | Rate | 95% CI | n | Min lower | Pass |
|---|---|---|---|---|---|---|
| actually_phishing | decision=yes | 36.5% | 30.1%–43.5% | 73/200 | 80.0% | ❌ FAIL |
| completeness_full | completeness=full | 71.0% | 64.4%–76.8% | 142/200 | 80.0% | ❌ FAIL |
**Overall: FAIL**
Exit code: 0 if all thresholds pass (or none defined), 1 otherwise. Use it in CI to gate releases on audit results.
handlabel init [DIR] Scaffold a project (config + one example item)
handlabel serve [--dir DIR] [--port N] [--host HOST]
Start the review UI (default: localhost:7723)
handlabel report [--dir DIR] [--json PATH]
Print Wilson-CI pass/fail report
handlabel is data-source-agnostic. You write a small pipeline that emits <id>.parsed.json files into the data directory.
examples/spam-classifier-audit/ — a complete end-to-end example: 25 SMS-style messages in a CSV with classifier predictions and claimed labels, an ingest.ts script that buckets rows into TP/FP/TN/FN strata and writes parsed.json files, and a handlabel.config.json with two Wilson-CI gates ("was the classifier correct?" + "is the dataset label right?"). Clone the repo and run bun run examples/spam-classifier-audit/ingest.ts && bun run src/cli.ts serve --dir examples/spam-classifier-audit/audit-data to see it.
From CSV/Parquet — one row per .parsed.json. Use id as a stable key like row_<rownum> or a content hash. See the example above.
From JSONL — one JSON line per item; parse and emit one .parsed.json per line.
From email/.eml files — strip HTML server-side in your pipeline; put text in body, headers in meta, URLs in annotations. (handlabel does NOT parse raw emails itself for safety — your pipeline owns that.)
From a classifier output log — run your classifier, capture verdict + signals per item, drop into verdict.signals[] in the parsed file.
import { writeFileSync } from "node:fs";
import type { ParsedItem } from "handlabel";Stratified sampling is your responsibility — handlabel doesn't sample for you, it audits whatever you point it at. A typical TS snippet:
import { writeFileSync } from "node:fs";
import type { ParsedItem } from "handlabel";
const samples = stratifiedSample(corpus, {
tp: 50, fn_signaled: 50, fn_zero: 50, random: 50,
});
for (const s of samples) {
const item: ParsedItem = {
id: s.id,
stratum: s.stratum,
title: s.subject,
meta: { From: s.from, Date: s.date },
body: s.bodyText,
verdict: { label: s.severity, score: s.score, signals: s.signals },
};
writeFileSync(`my-audit/${s.id}.parsed.json`, JSON.stringify(item, null, 2));
}Recall and FP rate point-estimates are misleading on small audit samples. handlabel reports Wilson score intervals (Wilson 1927) because:
- For
k=40, n=50, point estimate is 80% — but the 95% CI lower bound is 67% - A change from
40/50to45/50(rate 80% → 90%) has overlapping CIs — likely sampling noise, not a real effect - A real "above 80%" claim requires
≥45/50or larger n
Threshold gates compare the lower bound of the CI against the target — passing means you can defensibly claim "the true rate is ≥X% with 95% confidence."
See src/stats.ts for the implementation.
git clone https://github.com/djohnson68/handlabel
cd handlabel
bun install
# Run the example
bun run src/cli.ts init /tmp/handlabel-demo
bun run src/cli.ts serve --dir /tmp/handlabel-demo
# Run tests + typecheck
bun test
bun run typecheckMIT