Skip to content

feat(model): add CRF layer for valid BIO sequence decoding#285

Merged
hanneshapke merged 6 commits intomainfrom
feat/crf-layer-bio-decoding
Mar 31, 2026
Merged

feat(model): add CRF layer for valid BIO sequence decoding#285
hanneshapke merged 6 commits intomainfrom
feat/crf-layer-bio-decoding

Conversation

@hanneshapke
Copy link
Copy Markdown
Collaborator

Summary

  • Add pytorch-crf dependency and CRF layer on top of PII token classifier
  • CRF learns transition constraints between BIO labels during training
  • Add Viterbi decode() method for optimal sequence prediction at inference
  • Export CRF transition matrix as JSON alongside ONNX model for Go-side decoding
  • Update trainer to use CRF negative log-likelihood loss for PII task

Motivation

Independent per-token classification allows illegal BIO sequences (e.g., I-EMAIL after B-PHONENUMBER). With 24 entity types and synthetic training data, these edge cases leak PII. The CRF enforces valid sequences globally and typically improves entity-level F1 by 1-3 points on BERT-based NER.

Test plan

  • Train model and compare entity-level F1 against baseline without CRF
  • Verify no invalid BIO sequences in predictions after Viterbi decoding
  • Validate ONNX export still works (CRF is not exported; transitions are separate JSON)
  • Benchmark inference latency overhead from Viterbi decoding

Closes #256

Add a Conditional Random Field layer on top of the PII token classifier
to learn transition constraints between BIO labels and enforce valid
sequences during inference via Viterbi decoding. The CRF transition
matrix is exported alongside the ONNX model for Go-side decoding.

Closes #256
# Conflicts:
#	model/src/model.py
#	model/src/quantitize.py
#	model/src/trainer.py
#	pyproject.toml
@hanneshapke hanneshapke merged commit d0c26a8 into main Mar 31, 2026
6 checks passed
@hanneshapke hanneshapke deleted the feat/crf-layer-bio-decoding branch March 31, 2026 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(model): add CRF layer on top of token classifier for valid BIO sequence decoding

1 participant