Skip to content

djohnson68/handlabel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

handlabel

Local-first stratified-sample audit UI for ML classifiers and labeled datasets.

A keyboard-driven web UI for auditing a stratified sample of items from a dataset, with Wilson 95% confidence intervals turning your manual labels into a defensible pass/fail verdict.

Built for the workflow: "I have a classifier and a labeled dataset, and I want to audit a representative sample in an evening to know whether my labels are noisy or my classifier is wrong."

Sits between hand-editing JSON files and full annotation platforms like Label Studio / Prodigy / Argilla — no cloud, no project management overhead, no schema-design ritual.

When to use this

You have a classifier (rule-based or learned) and a labeled corpus. You want to:

  • Spot-check whether your dataset labels are correct
  • Validate a classifier's verdicts against ground truth
  • Build a small gold-standard set for evaluation
  • Get statistically defensible pass/fail rates on a hand-audited sample

You don't want to:

  • Build a full data-labeling pipeline (use Label Studio)
  • Annotate hundreds of thousands of items (use Prodigy)
  • Set up a cloud labeling service (use Encord/Lightly)
  • Manage labeling teams (use anything else)

Quick start

# Install (requires Bun ≥ 1.1.0)
bun add -d handlabel

# Scaffold a project — creates handlabel.config.json + one example item
bun run handlabel init my-audit/

# Start the review UI on http://localhost:7723
bun run handlabel serve --dir my-audit/

# After review, get a Wilson-CI pass/fail report
bun run handlabel report --dir my-audit/

How it works

You produce one <id>.parsed.json per item to review (your code, your data source). handlabel displays each item via a keyboard-driven UI and writes one <id>.label.json per decision.

File layout

my-audit/
├── handlabel.config.json    # decision schema + thresholds (you author this)
├── item_001.parsed.json     # read-only item view (your pipeline generates this)
├── item_001.label.json      # writable decision (handlabel writes this)
├── item_002.parsed.json
├── item_002.label.json
└── ...

Project config (handlabel.config.json)

{
  "name": "phishing-audit",
  "description": "Stratified audit of 200 EPVME corpus samples",
  "decision": {
    "label": "Is this actually phishing?",
    "options": [
      { "key": "yes", "value": true, "shortcut": "1", "color": "danger" },
      { "key": "no", "value": false, "shortcut": "2", "color": "good" },
      { "key": "unclear", "value": "unclear", "shortcut": "3", "color": "warn" }
    ]
  },
  "fields": [
    {
      "type": "radio",
      "name": "completeness",
      "label": "Completeness",
      "options": [
        { "key": "full", "shortcut": "F" },
        { "key": "truncated", "shortcut": "T" },
        { "key": "fixture", "shortcut": "A" }
      ]
    },
    { "type": "text", "name": "brand_impersonated", "label": "Brand impersonated" },
    { "type": "textarea", "name": "notes", "label": "Notes" }
  ],
  "rubric": "Mark phishing emails (yes), spam/legit/fixtures (no), or ambiguous (unclear). Reject security test fixtures even if labeled positive.",
  "thresholds": [
    { "name": "actually_phishing", "field": "decision", "target": "yes", "min_ci_lower": 0.80 },
    { "name": "completeness_full", "field": "completeness", "target": "full", "min_ci_lower": 0.80 }
  ]
}

Item view (<id>.parsed.json)

{
  "id": "item_001",
  "stratum": "tp",
  "title": "Subject of the email or document",
  "meta": {
    "From": "attacker@evil.com",
    "To": "victim@example.com",
    "Date": "2026-05-11"
  },
  "body": "Main content text. HTML stripped server-side by your pipeline.",
  "annotations": [
    { "label": "URL", "tag": "evil.com", "value": "https://evil.com/path" }
  ],
  "verdict": {
    "label": "Low",
    "score": 20,
    "color": "warn",
    "signals": [
      { "name": "ReplyToMismatch", "weight": 15, "description": "Reply-To domain differs from sender" }
    ]
  }
}

Every field above is optional except id and stratum. The UI renders whatever fields are present.

Decision record (<id>.label.json) — written by handlabel

{
  "id": "item_001",
  "stratum": "tp",
  "decision": "yes",
  "fields": {
    "completeness": "full",
    "brand_impersonated": "office365",
    "notes": "Classic credential-harvest impersonation"
  },
  "reviewer": "djohnson",
  "reviewed_at": "2026-05-11T13:45:12.123Z"
}

The UI

  • Three-column layout: sample list (left) | item content (center) | decision panel (right)
  • Progress bar in the header with per-stratum counts ("tp: 12/50, fn_zero: 8/50, ...")
  • Keyboard shortcuts for every decision and field option (configurable per project)
  • Auto-saves on every click; auto-advances to next unreviewed item on Save+Next
  • Rubric panel (collapsible) reminds reviewers of the rubric on every item

Safety

  • Server binds to localhost only by default — no remote exposure.
  • The UI escapes every string before rendering — no XSS from item content.
  • Bodies and annotations are rendered as plain text, never as HTML or clickable links.
  • Your pipeline is responsible for HTML stripping before producing the .parsed.json (handlabel does NOT parse raw emails or untrusted content itself).

The report

bun run handlabel report --dir my-audit/

Computes Wilson 95% CIs for every threshold in your config:

# handlabel report

Reviewed: 200 / 200

## Per-stratum
| Stratum | Total | Reviewed | Decisions |
|---|---|---|---|
| fn_signaled | 50 | 50 | yes=12, no=35, unclear=3 |
| fn_zero | 50 | 50 | yes=4, no=44, unclear=2 |
| random | 50 | 50 | yes=15, no=33, unclear=2 |
| tp | 50 | 50 | yes=42, no=6, unclear=2 |

## Thresholds (Wilson 95% CI)
| Name | Field=target | Rate | 95% CI | n | Min lower | Pass |
|---|---|---|---|---|---|---|
| actually_phishing | decision=yes | 36.5% | 30.1%–43.5% | 73/200 | 80.0% | ❌ FAIL |
| completeness_full | completeness=full | 71.0% | 64.4%–76.8% | 142/200 | 80.0% | ❌ FAIL |

**Overall: FAIL**

Exit code: 0 if all thresholds pass (or none defined), 1 otherwise. Use it in CI to gate releases on audit results.

CLI

handlabel init [DIR]                   Scaffold a project (config + one example item)
handlabel serve [--dir DIR] [--port N] [--host HOST]
                                       Start the review UI (default: localhost:7723)
handlabel report [--dir DIR] [--json PATH]
                                       Print Wilson-CI pass/fail report

Producing .parsed.json files

handlabel is data-source-agnostic. You write a small pipeline that emits <id>.parsed.json files into the data directory.

Worked example

examples/spam-classifier-audit/ — a complete end-to-end example: 25 SMS-style messages in a CSV with classifier predictions and claimed labels, an ingest.ts script that buckets rows into TP/FP/TN/FN strata and writes parsed.json files, and a handlabel.config.json with two Wilson-CI gates ("was the classifier correct?" + "is the dataset label right?"). Clone the repo and run bun run examples/spam-classifier-audit/ingest.ts && bun run src/cli.ts serve --dir examples/spam-classifier-audit/audit-data to see it.

Common patterns

From CSV/Parquet — one row per .parsed.json. Use id as a stable key like row_<rownum> or a content hash. See the example above.

From JSONL — one JSON line per item; parse and emit one .parsed.json per line.

From email/.eml files — strip HTML server-side in your pipeline; put text in body, headers in meta, URLs in annotations. (handlabel does NOT parse raw emails itself for safety — your pipeline owns that.)

From a classifier output log — run your classifier, capture verdict + signals per item, drop into verdict.signals[] in the parsed file.

Type-safe ingest scripts

import { writeFileSync } from "node:fs";
import type { ParsedItem } from "handlabel";

Stratified sampling is your responsibility — handlabel doesn't sample for you, it audits whatever you point it at. A typical TS snippet:

import { writeFileSync } from "node:fs";
import type { ParsedItem } from "handlabel";

const samples = stratifiedSample(corpus, {
  tp: 50, fn_signaled: 50, fn_zero: 50, random: 50,
});

for (const s of samples) {
  const item: ParsedItem = {
    id: s.id,
    stratum: s.stratum,
    title: s.subject,
    meta: { From: s.from, Date: s.date },
    body: s.bodyText,
    verdict: { label: s.severity, score: s.score, signals: s.signals },
  };
  writeFileSync(`my-audit/${s.id}.parsed.json`, JSON.stringify(item, null, 2));
}

Statistical methodology

Recall and FP rate point-estimates are misleading on small audit samples. handlabel reports Wilson score intervals (Wilson 1927) because:

  • For k=40, n=50, point estimate is 80% — but the 95% CI lower bound is 67%
  • A change from 40/50 to 45/50 (rate 80% → 90%) has overlapping CIs — likely sampling noise, not a real effect
  • A real "above 80%" claim requires ≥45/50 or larger n

Threshold gates compare the lower bound of the CI against the target — passing means you can defensibly claim "the true rate is ≥X% with 95% confidence."

See src/stats.ts for the implementation.

Development

git clone https://github.com/djohnson68/handlabel
cd handlabel
bun install

# Run the example
bun run src/cli.ts init /tmp/handlabel-demo
bun run src/cli.ts serve --dir /tmp/handlabel-demo

# Run tests + typecheck
bun test
bun run typecheck

License

MIT

About

Local-first stratified-sample audit UI for ML classifiers and labeled datasets. Wilson CIs, keyboard-first, no cloud.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors