handlabel

Local-first stratified-sample audit UI for ML classifiers and labeled datasets.

A keyboard-driven web UI for auditing a stratified sample of items from a dataset, with Wilson 95% confidence intervals turning your manual labels into a defensible pass/fail verdict.

Built for the workflow: "I have a classifier and a labeled dataset, and I want to audit a representative sample in an evening to know whether my labels are noisy or my classifier is wrong."

Sits between hand-editing JSON files and full annotation platforms like Label Studio / Prodigy / Argilla — no cloud, no project management overhead, no schema-design ritual.

When to use this

You have a classifier (rule-based or learned) and a labeled corpus. You want to:

Spot-check whether your dataset labels are correct
Validate a classifier's verdicts against ground truth
Build a small gold-standard set for evaluation
Get statistically defensible pass/fail rates on a hand-audited sample

You don't want to:

Build a full data-labeling pipeline (use Label Studio)
Annotate hundreds of thousands of items (use Prodigy)
Set up a cloud labeling service (use Encord/Lightly)
Manage labeling teams (use anything else)

Quick start

# Install (requires Bun ≥ 1.1.0)
bun add -d handlabel

# Scaffold a project — creates handlabel.config.json + one example item
bun run handlabel init my-audit/

# Start the review UI on http://localhost:7723
bun run handlabel serve --dir my-audit/

# After review, get a Wilson-CI pass/fail report
bun run handlabel report --dir my-audit/

How it works

You produce one <id>.parsed.json per item to review (your code, your data source). handlabel displays each item via a keyboard-driven UI and writes one <id>.label.json per decision.

File layout

my-audit/
├── handlabel.config.json    # decision schema + thresholds (you author this)
├── item_001.parsed.json     # read-only item view (your pipeline generates this)
├── item_001.label.json      # writable decision (handlabel writes this)
├── item_002.parsed.json
├── item_002.label.json
└── ...

Project config (`handlabel.config.json`)

{
  "name": "phishing-audit",
  "description": "Stratified audit of 200 EPVME corpus samples",
  "decision": {
    "label": "Is this actually phishing?",
    "options": [
      { "key": "yes", "value": true, "shortcut": "1", "color": "danger" },
      { "key": "no", "value": false, "shortcut": "2", "color": "good" },
      { "key": "unclear", "value": "unclear", "shortcut": "3", "color": "warn" }
    ]
  },
  "fields": [
    {
      "type": "radio",
      "name": "completeness",
      "label": "Completeness",
      "options": [
        { "key": "full", "shortcut": "F" },
        { "key": "truncated", "shortcut": "T" },
        { "key": "fixture", "shortcut": "A" }
      ]
    },
    { "type": "text", "name": "brand_impersonated", "label": "Brand impersonated" },
    { "type": "textarea", "name": "notes", "label": "Notes" }
  ],
  "rubric": "Mark phishing emails (yes), spam/legit/fixtures (no), or ambiguous (unclear). Reject security test fixtures even if labeled positive.",
  "thresholds": [
    { "name": "actually_phishing", "field": "decision", "target": "yes", "min_ci_lower": 0.80 },
    { "name": "completeness_full", "field": "completeness", "target": "full", "min_ci_lower": 0.80 }
  ]
}

Item view (`<id>.parsed.json`)

{
  "id": "item_001",
  "stratum": "tp",
  "title": "Subject of the email or document",
  "meta": {
    "From": "attacker@evil.com",
    "To": "victim@example.com",
    "Date": "2026-05-11"
  },
  "body": "Main content text. HTML stripped server-side by your pipeline.",
  "annotations": [
    { "label": "URL", "tag": "evil.com", "value": "https://evil.com/path" }
  ],
  "verdict": {
    "label": "Low",
    "score": 20,
    "color": "warn",
    "signals": [
      { "name": "ReplyToMismatch", "weight": 15, "description": "Reply-To domain differs from sender" }
    ]
  }
}

Every field above is optional except id and stratum. The UI renders whatever fields are present.

Decision record (`<id>.label.json`) — written by handlabel

{
  "id": "item_001",
  "stratum": "tp",
  "decision": "yes",
  "fields": {
    "completeness": "full",
    "brand_impersonated": "office365",
    "notes": "Classic credential-harvest impersonation"
  },
  "reviewer": "djohnson",
  "reviewed_at": "2026-05-11T13:45:12.123Z"
}

The UI

Three-column layout: sample list (left) | item content (center) | decision panel (right)
Progress bar in the header with per-stratum counts ("tp: 12/50, fn_zero: 8/50, ...")
Keyboard shortcuts for every decision and field option (configurable per project)
Auto-saves on every click; auto-advances to next unreviewed item on Save+Next
Rubric panel (collapsible) reminds reviewers of the rubric on every item

Safety

Server binds to localhost only by default — no remote exposure.
The UI escapes every string before rendering — no XSS from item content.
Bodies and annotations are rendered as plain text, never as HTML or clickable links.
Your pipeline is responsible for HTML stripping before producing the .parsed.json (handlabel does NOT parse raw emails or untrusted content itself).

The report

bun run handlabel report --dir my-audit/

Computes Wilson 95% CIs for every threshold in your config:

# handlabel report

Reviewed: 200 / 200

## Per-stratum
| Stratum | Total | Reviewed | Decisions |
|---|---|---|---|
| fn_signaled | 50 | 50 | yes=12, no=35, unclear=3 |
| fn_zero | 50 | 50 | yes=4, no=44, unclear=2 |
| random | 50 | 50 | yes=15, no=33, unclear=2 |
| tp | 50 | 50 | yes=42, no=6, unclear=2 |

## Thresholds (Wilson 95% CI)
| Name | Field=target | Rate | 95% CI | n | Min lower | Pass |
|---|---|---|---|---|---|---|
| actually_phishing | decision=yes | 36.5% | 30.1%–43.5% | 73/200 | 80.0% | ❌ FAIL |
| completeness_full | completeness=full | 71.0% | 64.4%–76.8% | 142/200 | 80.0% | ❌ FAIL |

**Overall: FAIL**

Exit code: 0 if all thresholds pass (or none defined), 1 otherwise. Use it in CI to gate releases on audit results.

CLI

handlabel init [DIR]                   Scaffold a project (config + one example item)
handlabel serve [--dir DIR] [--port N] [--host HOST]
                                       Start the review UI (default: localhost:7723)
handlabel report [--dir DIR] [--json PATH]
                                       Print Wilson-CI pass/fail report

Producing `.parsed.json` files

handlabel is data-source-agnostic. You write a small pipeline that emits <id>.parsed.json files into the data directory.

Worked example

examples/spam-classifier-audit/ — a complete end-to-end example: 25 SMS-style messages in a CSV with classifier predictions and claimed labels, an ingest.ts script that buckets rows into TP/FP/TN/FN strata and writes parsed.json files, and a handlabel.config.json with two Wilson-CI gates ("was the classifier correct?" + "is the dataset label right?"). Clone the repo and run bun run examples/spam-classifier-audit/ingest.ts && bun run src/cli.ts serve --dir examples/spam-classifier-audit/audit-data to see it.

Common patterns

From CSV/Parquet — one row per .parsed.json. Use id as a stable key like row_<rownum> or a content hash. See the example above.

From JSONL — one JSON line per item; parse and emit one .parsed.json per line.

From email/.eml files — strip HTML server-side in your pipeline; put text in body, headers in meta, URLs in annotations. (handlabel does NOT parse raw emails itself for safety — your pipeline owns that.)

From a classifier output log — run your classifier, capture verdict + signals per item, drop into verdict.signals[] in the parsed file.

Type-safe ingest scripts

import { writeFileSync } from "node:fs";
import type { ParsedItem } from "handlabel";

Stratified sampling is your responsibility — handlabel doesn't sample for you, it audits whatever you point it at. A typical TS snippet:

import { writeFileSync } from "node:fs";
import type { ParsedItem } from "handlabel";

const samples = stratifiedSample(corpus, {
  tp: 50, fn_signaled: 50, fn_zero: 50, random: 50,
});

for (const s of samples) {
  const item: ParsedItem = {
    id: s.id,
    stratum: s.stratum,
    title: s.subject,
    meta: { From: s.from, Date: s.date },
    body: s.bodyText,
    verdict: { label: s.severity, score: s.score, signals: s.signals },
  };
  writeFileSync(`my-audit/${s.id}.parsed.json`, JSON.stringify(item, null, 2));
}

Statistical methodology

Recall and FP rate point-estimates are misleading on small audit samples. handlabel reports Wilson score intervals (Wilson 1927) because:

For k=40, n=50, point estimate is 80% — but the 95% CI lower bound is 67%
A change from 40/50 to 45/50 (rate 80% → 90%) has overlapping CIs — likely sampling noise, not a real effect
A real "above 80%" claim requires ≥45/50 or larger n

Threshold gates compare the lower bound of the CI against the target — passing means you can defensibly claim "the true rate is ≥X% with 95% confidence."

See src/stats.ts for the implementation.

Development

git clone https://github.com/djohnson68/handlabel
cd handlabel
bun install

# Run the example
bun run src/cli.ts init /tmp/handlabel-demo
bun run src/cli.ts serve --dir /tmp/handlabel-demo

# Run tests + typecheck
bun test
bun run typecheck

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples/spam-classifier-audit		examples/spam-classifier-audit
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

handlabel

When to use this

Quick start

How it works

File layout

Project config (`handlabel.config.json`)

Item view (`<id>.parsed.json`)

Decision record (`<id>.label.json`) — written by handlabel

The UI

Safety

The report

CLI

Producing `.parsed.json` files

Worked example

Common patterns

Type-safe ingest scripts

Statistical methodology

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

handlabel

When to use this

Quick start

How it works

File layout

Project config (handlabel.config.json)

Item view (<id>.parsed.json)

Decision record (<id>.label.json) — written by handlabel

The UI

Safety

The report

CLI

Producing .parsed.json files

Worked example

Common patterns

Type-safe ingest scripts

Statistical methodology

Development

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Project config (`handlabel.config.json`)

Item view (`<id>.parsed.json`)

Decision record (`<id>.label.json`) — written by handlabel

Producing `.parsed.json` files

Packages