chore: update base model#346
Merged
hanneshapke merged 18 commits intomainfrom Apr 24, 2026
Merged
Conversation
Add a standalone benchmark that evaluates the ONNX model against ai4privacy/pii-masking-300k with per-label F1 metrics and latency reporting. Supports --language filtering and --verbose per-sample output. Fix _create_word_labels to emit proper BIO tags (B- for first word, I- for continuation) instead of assigning B- to every word in a multi-word entity. Also trim SentencePiece leading whitespace from predicted spans in the benchmark to match Go backend behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The 200k and 300k datasets have different label schemas (e.g. FIRSTNAME vs GIVENNAME1, DOB vs BOD, PHONENUMBER vs TEL). The benchmark evaluates against 300k, so training should use the same dataset for consistent text styles and label coverage. Also use the 300k dataset's character offsets directly instead of falling back to text.find() for entity positions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The ONNX model exports raw logits without the CRF layer. The benchmark now loads crf_transitions.json and uses Viterbi decoding to enforce valid BIO sequences, matching what the model learned during training. This fixes fragmented multi-word entities and invalid B/I transitions. Also relax early stopping from patience=3/threshold=1% to patience=5/threshold=0.1% so the model can train longer through plateaus. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The ONNX model only exports raw logits without the CRF layer. The Go backend now loads crf_transitions.json and uses Viterbi decoding to find the globally optimal BIO label sequence, enforcing valid transitions (e.g. I-EMAIL can only follow B-EMAIL or I-EMAIL). This fixes fragmented multi-word entities like dates and phone numbers that were previously split into separate single-word spans because argmax decoded each token independently. Falls back to argmax if crf_transitions.json is not present. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The model (especially with CRF/Viterbi) may extend entity spans to include trailing sentence punctuation like commas, periods, or semicolons (e.g. "April 12, 1988," instead of "April 12, 1988"). Strip trailing ,.;:!? from entity spans in both the Python benchmark and the Go backend's finalizeEntity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The blanket stripping of trailing .,;:!? broke entities that contain dots (e.g. emails like "yahoo.com" became "yahoo"). Now only strip when the punctuation is followed by whitespace or end-of-string, preserving dots inside emails, URLs, and IP addresses. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests for finalizeEntity punctuation trimming: - Trailing comma/period stripped when followed by whitespace or EOF - Dots preserved inside emails (yahoo.com) and URLs (www.example.com) - Dots preserved when followed by digits (192.168.1.1) - Leading whitespace from SentencePiece offsets trimmed Tests for viterbiDecode: - All-O sequence, single entity B-I sequence - Start transitions prevent invalid I- at sequence start - Transition matrix enforces B→I over B→B - Empty input returns nil Tests for softmaxConfidence: - Clear winner returns >0.99 - Uniform logits return ~1/N Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_is_punctuation_in_entity used string containment to decide if a punctuation token belongs to an entity. This caused false positives: the trailing comma in "1988," matched because "," exists in the entity value "April 12, 1988". Now uses character offsets to check whether the punctuation falls within the entity's start/end span. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Slice emissions to numLabels before the init loop so gosec can verify the index is in range. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of separate TP/FN/FP sections, show Expected and Predicted lists with TP/FN/FP markers so it's easy to compare what the model returned against the gold annotations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The first 1MB is mostly metadata/graph structure and doesn't change between trainings, making different models appear identical. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Hannes Hapke <hanneshapke@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.