Skip to content

chore: update base model#346

Merged
hanneshapke merged 18 commits intomainfrom
hanneshapke/pii-benchmark
Apr 24, 2026
Merged

chore: update base model#346
hanneshapke merged 18 commits intomainfrom
hanneshapke/pii-benchmark

Conversation

@hanneshapke
Copy link
Copy Markdown
Collaborator

No description provided.

hanneshapke and others added 18 commits April 23, 2026 17:04
Add a standalone benchmark that evaluates the ONNX model against
ai4privacy/pii-masking-300k with per-label F1 metrics and latency
reporting. Supports --language filtering and --verbose per-sample output.

Fix _create_word_labels to emit proper BIO tags (B- for first word,
I- for continuation) instead of assigning B- to every word in a
multi-word entity. Also trim SentencePiece leading whitespace from
predicted spans in the benchmark to match Go backend behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The 200k and 300k datasets have different label schemas (e.g.
FIRSTNAME vs GIVENNAME1, DOB vs BOD, PHONENUMBER vs TEL). The
benchmark evaluates against 300k, so training should use the same
dataset for consistent text styles and label coverage.

Also use the 300k dataset's character offsets directly instead of
falling back to text.find() for entity positions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The ONNX model exports raw logits without the CRF layer. The benchmark
now loads crf_transitions.json and uses Viterbi decoding to enforce
valid BIO sequences, matching what the model learned during training.
This fixes fragmented multi-word entities and invalid B/I transitions.

Also relax early stopping from patience=3/threshold=1% to
patience=5/threshold=0.1% so the model can train longer through
plateaus.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The ONNX model only exports raw logits without the CRF layer. The Go
backend now loads crf_transitions.json and uses Viterbi decoding to
find the globally optimal BIO label sequence, enforcing valid
transitions (e.g. I-EMAIL can only follow B-EMAIL or I-EMAIL).

This fixes fragmented multi-word entities like dates and phone numbers
that were previously split into separate single-word spans because
argmax decoded each token independently. Falls back to argmax if
crf_transitions.json is not present.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The model (especially with CRF/Viterbi) may extend entity spans to
include trailing sentence punctuation like commas, periods, or
semicolons (e.g. "April 12, 1988," instead of "April 12, 1988").

Strip trailing ,.;:!? from entity spans in both the Python benchmark
and the Go backend's finalizeEntity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The blanket stripping of trailing .,;:!? broke entities that contain
dots (e.g. emails like "yahoo.com" became "yahoo"). Now only strip
when the punctuation is followed by whitespace or end-of-string,
preserving dots inside emails, URLs, and IP addresses.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests for finalizeEntity punctuation trimming:
- Trailing comma/period stripped when followed by whitespace or EOF
- Dots preserved inside emails (yahoo.com) and URLs (www.example.com)
- Dots preserved when followed by digits (192.168.1.1)
- Leading whitespace from SentencePiece offsets trimmed

Tests for viterbiDecode:
- All-O sequence, single entity B-I sequence
- Start transitions prevent invalid I- at sequence start
- Transition matrix enforces B→I over B→B
- Empty input returns nil

Tests for softmaxConfidence:
- Clear winner returns >0.99
- Uniform logits return ~1/N

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_is_punctuation_in_entity used string containment to decide if a
punctuation token belongs to an entity. This caused false positives:
the trailing comma in "1988," matched because "," exists in the entity
value "April 12, 1988". Now uses character offsets to check whether
the punctuation falls within the entity's start/end span.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Slice emissions to numLabels before the init loop so gosec can verify
the index is in range.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of separate TP/FN/FP sections, show Expected and Predicted
lists with TP/FN/FP markers so it's easy to compare what the model
returned against the gold annotations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The first 1MB is mostly metadata/graph structure and doesn't change
between trainings, making different models appear identical.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Hannes Hapke <hanneshapke@users.noreply.github.com>
@hanneshapke hanneshapke changed the title Updated model chore: update base model Apr 24, 2026
@hanneshapke hanneshapke merged commit 0b116ef into main Apr 24, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant