chore: update base model by hanneshapke · Pull Request #346 · dataiku/kiji-proxy

hanneshapke · 2026-04-24T18:27:49Z

No description provided.

Add a standalone benchmark that evaluates the ONNX model against ai4privacy/pii-masking-300k with per-label F1 metrics and latency reporting. Supports --language filtering and --verbose per-sample output. Fix _create_word_labels to emit proper BIO tags (B- for first word, I- for continuation) instead of assigning B- to every word in a multi-word entity. Also trim SentencePiece leading whitespace from predicted spans in the benchmark to match Go backend behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The 200k and 300k datasets have different label schemas (e.g. FIRSTNAME vs GIVENNAME1, DOB vs BOD, PHONENUMBER vs TEL). The benchmark evaluates against 300k, so training should use the same dataset for consistent text styles and label coverage. Also use the 300k dataset's character offsets directly instead of falling back to text.find() for entity positions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The ONNX model exports raw logits without the CRF layer. The benchmark now loads crf_transitions.json and uses Viterbi decoding to enforce valid BIO sequences, matching what the model learned during training. This fixes fragmented multi-word entities and invalid B/I transitions. Also relax early stopping from patience=3/threshold=1% to patience=5/threshold=0.1% so the model can train longer through plateaus. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The ONNX model only exports raw logits without the CRF layer. The Go backend now loads crf_transitions.json and uses Viterbi decoding to find the globally optimal BIO label sequence, enforcing valid transitions (e.g. I-EMAIL can only follow B-EMAIL or I-EMAIL). This fixes fragmented multi-word entities like dates and phone numbers that were previously split into separate single-word spans because argmax decoded each token independently. Falls back to argmax if crf_transitions.json is not present. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The model (especially with CRF/Viterbi) may extend entity spans to include trailing sentence punctuation like commas, periods, or semicolons (e.g. "April 12, 1988," instead of "April 12, 1988"). Strip trailing ,.;:!? from entity spans in both the Python benchmark and the Go backend's finalizeEntity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The blanket stripping of trailing .,;:!? broke entities that contain dots (e.g. emails like "yahoo.com" became "yahoo"). Now only strip when the punctuation is followed by whitespace or end-of-string, preserving dots inside emails, URLs, and IP addresses. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Tests for finalizeEntity punctuation trimming: - Trailing comma/period stripped when followed by whitespace or EOF - Dots preserved inside emails (yahoo.com) and URLs (www.example.com) - Dots preserved when followed by digits (192.168.1.1) - Leading whitespace from SentencePiece offsets trimmed Tests for viterbiDecode: - All-O sequence, single entity B-I sequence - Start transitions prevent invalid I- at sequence start - Transition matrix enforces B→I over B→B - Empty input returns nil Tests for softmaxConfidence: - Clear winner returns >0.99 - Uniform logits return ~1/N Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

_is_punctuation_in_entity used string containment to decide if a punctuation token belongs to an entity. This caused false positives: the trailing comma in "1988," matched because "," exists in the entity value "April 12, 1988". Now uses character offsets to check whether the punctuation falls within the entity's start/end span. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Slice emissions to numLabels before the init loop so gosec can verify the index is in range. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Instead of separate TP/FN/FP sections, show Expected and Predicted lists with TP/FN/FP markers so it's easy to compare what the model returned against the gold annotations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The first 1MB is mostly metadata/graph structure and doesn't change between trainings, making different models appear identical. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Hannes Hapke <hanneshapke@users.noreply.github.com>

hanneshapke and others added 18 commits April 23, 2026 17:04

feat: add smoke test to verify ONNX model detects basic PII

aad541a

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

updated model

1cb25e5

updated model

49a432f

fix: add bounds check to satisfy gosec G602 in viterbiDecode

7dadf1e

Slice emissions to numLabels before the init loop so gosec can verify the index is in range. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: move nosec G602 annotation to flagged line

67e9498

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: show '(none)' for empty expected/predicted lists in verbose output

8be021d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: hash full ONNX file in smoke test instead of first 1MB

eca97f1

The first 1MB is mostly metadata/graph structure and doesn't change between trainings, making different models appear identical. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

updated model

127846d

Merge branch 'main' into hanneshapke/pii-benchmark

70dd739

Signed-off-by: Hannes Hapke <hanneshapke@users.noreply.github.com>

hanneshapke changed the title ~~Updated model~~ chore: update base model Apr 24, 2026

hanneshapke merged commit 0b116ef into main Apr 24, 2026
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: update base model#346

chore: update base model#346
hanneshapke merged 18 commits intomainfrom
hanneshapke/pii-benchmark

hanneshapke commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hanneshapke commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant