feat(model): add CRF layer for valid BIO sequence decoding by hanneshapke · Pull Request #285 · dataiku/kiji-proxy

hanneshapke · 2026-03-29T23:45:44Z

Summary

Add pytorch-crf dependency and CRF layer on top of PII token classifier
CRF learns transition constraints between BIO labels during training
Add Viterbi decode() method for optimal sequence prediction at inference
Export CRF transition matrix as JSON alongside ONNX model for Go-side decoding
Update trainer to use CRF negative log-likelihood loss for PII task

Motivation

Independent per-token classification allows illegal BIO sequences (e.g., I-EMAIL after B-PHONENUMBER). With 24 entity types and synthetic training data, these edge cases leak PII. The CRF enforces valid sequences globally and typically improves entity-level F1 by 1-3 points on BERT-based NER.

Test plan

Train model and compare entity-level F1 against baseline without CRF
Verify no invalid BIO sequences in predictions after Viterbi decoding
Validate ONNX export still works (CRF is not exported; transitions are separate JSON)
Benchmark inference latency overhead from Viterbi decoding

Closes #256

Add a Conditional Random Field layer on top of the PII token classifier to learn transition constraints between BIO labels and enforce valid sequences during inference via Viterbi decoding. The CRF transition matrix is exported alongside the ONNX model for Go-side decoding. Closes #256

# Conflicts: # model/src/model.py # model/src/quantitize.py # model/src/trainer.py # pyproject.toml

hanneshapke added 6 commits March 29, 2026 16:44

linter fix

efe1066

linter fix

5ff9a05

fix

be09af4

updated model

2161906

Merge branch 'refs/heads/main' into feat/crf-layer-bio-decoding

b333520

# Conflicts: # model/src/model.py # model/src/quantitize.py # model/src/trainer.py # pyproject.toml

hanneshapke merged commit d0c26a8 into main Mar 31, 2026
6 checks passed

hanneshapke deleted the feat/crf-layer-bio-decoding branch March 31, 2026 19:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(model): add CRF layer for valid BIO sequence decoding#285

feat(model): add CRF layer for valid BIO sequence decoding#285
hanneshapke merged 6 commits intomainfrom
feat/crf-layer-bio-decoding

hanneshapke commented Mar 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hanneshapke commented Mar 29, 2026

Summary

Motivation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant