Skip to content

fix(tokenization): replace substring matching with character offsets from annotations#277

Merged
hanneshapke merged 4 commits intomainfrom
fix/tokenization-character-offsets
Mar 31, 2026
Merged

fix(tokenization): replace substring matching with character offsets from annotations#277
hanneshapke merged 4 commits intomainfrom
fix/tokenization-character-offsets

Conversation

@hanneshapke
Copy link
Copy Markdown
Collaborator

@hanneshapke hanneshapke commented Mar 29, 2026

Summary

  • Preserve character offsets from Label Studio annotations through the preprocessing pipeline (preprocessing.py) — start/end fields are now included in privacy_mask entries instead of being discarded
  • Use offsets directly in _find_privacy_mask_positions (tokenization.py) instead of re-discovering positions via text.find() substring search
  • Validate offset integrity by checking text[start:end] == value and logging a warning on mismatch (entry is skipped to avoid corrupt labels)
  • Error on missing offsets — if a privacy mask item lacks start/end, a ValueError is raised instead of silently falling back to fragile substring matching

Problem

The previous text.find(value) approach caused three classes of silent data quality bugs:

  1. Partial matches: "Alex" matched inside "Alexandria", "Park" inside "Parking" — non-PII words got labeled as entities
  2. Duplicate over-matching: If "John" appeared twice but only one was annotated, both got labeled — injecting noise into training data
  3. Overlapping collisions: Short entity values (e.g. "Dan") matched inside longer ones (e.g. "DanTheMan"), creating conflicting annotations

These bugs silently degraded training signal quality across the entire dataset.

Changes

File Change
model/src/preprocessing.py Include start and end in both privacy_mask.append() calls (coreference entities and standalone entities)
model/dataset/tokenization.py Rewrite _find_privacy_mask_positions to use annotation offsets directly, with offset validation and ValueError on missing offsets

Test plan

  • Run training pipeline on existing Label Studio dataset and verify no ValueError is raised (confirms all annotations have offsets)
  • Compare token-level labels before and after on a sample with duplicate PII values (e.g. same first name appearing twice) to confirm only annotated instances are labeled
  • Verify offset mismatch warning fires correctly by testing with a synthetically corrupted annotation

Closes #264

🤖 Generated with Claude Code

hanneshapke and others added 4 commits March 28, 2026 20:22
…from annotations

The `_find_privacy_mask_positions` method used `text.find(value)` to
re-discover entity positions by substring matching. This caused silent
data quality bugs: partial matches (e.g. "Alex" inside "Alexandria"),
duplicate over-matching, and overlapping entity collisions.

The original character offsets from Label Studio annotations were
available but discarded during preprocessing. This commit preserves
them through the pipeline and uses them directly in tokenization.

Closes #264

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* deps(deps-dev): bump electron from 40.1.0 to 40.6.1

Bumps [electron](https://github.com/electron/electron) from 40.1.0 to 40.6.1.
- [Release notes](https://github.com/electron/electron/releases)
- [Commits](electron/electron@v40.1.0...v40.6.1)

---
updated-dependencies:
- dependency-name: electron
  dependency-version: 40.6.1
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* deps(deps-dev): bump @typescript-eslint/parser from 8.56.0 to 8.56.1

Bumps [@typescript-eslint/parser](https://github.com/typescript-eslint/typescript-eslint/tree/HEAD/packages/parser) from 8.56.0 to 8.56.1.
- [Release notes](https://github.com/typescript-eslint/typescript-eslint/releases)
- [Changelog](https://github.com/typescript-eslint/typescript-eslint/blob/main/packages/parser/CHANGELOG.md)
- [Commits](https://github.com/typescript-eslint/typescript-eslint/commits/v8.56.1/packages/parser)

---
updated-dependencies:
- dependency-name: "@typescript-eslint/parser"
  dependency-version: 8.56.1
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* deps(deps-dev): bump @typescript-eslint/eslint-plugin

Bumps [@typescript-eslint/eslint-plugin](https://github.com/typescript-eslint/typescript-eslint/tree/HEAD/packages/eslint-plugin) from 8.56.0 to 8.56.1.
- [Release notes](https://github.com/typescript-eslint/typescript-eslint/releases)
- [Changelog](https://github.com/typescript-eslint/typescript-eslint/blob/main/packages/eslint-plugin/CHANGELOG.md)
- [Commits](https://github.com/typescript-eslint/typescript-eslint/commits/v8.56.1/packages/eslint-plugin)

---
updated-dependencies:
- dependency-name: "@typescript-eslint/eslint-plugin"
  dependency-version: 8.56.1
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* deps(deps): bump react and @types/react

Bumps [react](https://github.com/facebook/react/tree/HEAD/packages/react) and [@types/react](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/react). These dependencies needed to be updated together.

Updates `react` from 18.3.1 to 19.2.4
- [Release notes](https://github.com/facebook/react/releases)
- [Changelog](https://github.com/facebook/react/blob/main/CHANGELOG.md)
- [Commits](https://github.com/facebook/react/commits/v19.2.4/packages/react)

Updates `@types/react` from 18.3.27 to 19.2.14
- [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases)
- [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/react)

---
updated-dependencies:
- dependency-name: react
  dependency-version: 19.2.4
  dependency-type: direct:production
  update-type: version-update:semver-major
- dependency-name: "@types/react"
  dependency-version: 19.2.14
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

* deps(deps): bump @sentry/electron from 7.8.0 to 7.9.0

Bumps [@sentry/electron](https://github.com/getsentry/sentry-electron) from 7.8.0 to 7.9.0.
- [Release notes](https://github.com/getsentry/sentry-electron/releases)
- [Changelog](https://github.com/getsentry/sentry-electron/blob/master/CHANGELOG.md)
- [Commits](getsentry/sentry-electron@7.8.0...7.9.0)

---
updated-dependencies:
- dependency-name: "@sentry/electron"
  dependency-version: 7.9.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* deps(deps): bump lucide-react

Bumps the production-dependencies group in /src/frontend with 1 update: [lucide-react](https://github.com/lucide-icons/lucide/tree/HEAD/packages/lucide-react).


Updates `lucide-react` from 0.574.0 to 0.576.0
- [Release notes](https://github.com/lucide-icons/lucide/releases)
- [Commits](https://github.com/lucide-icons/lucide/commits/0.576.0/packages/lucide-react)

---
updated-dependencies:
- dependency-name: lucide-react
  dependency-version: 0.576.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: production-dependencies
...

Signed-off-by: dependabot[bot] <support@github.com>

* ci(deps): bump the github-actions group across 1 directory with 5 updates

Bumps the github-actions group with 5 updates in the / directory:

| Package | From | To |
| --- | --- | --- |
| [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) | `7.3.0` | `7.5.0` |
| [actions/upload-artifact](https://github.com/actions/upload-artifact) | `6` | `7` |
| [actions/download-artifact](https://github.com/actions/download-artifact) | `6` | `8` |
| [amannn/action-semantic-pull-request](https://github.com/amannn/action-semantic-pull-request) | `5` | `6` |
| [marocchino/sticky-pull-request-comment](https://github.com/marocchino/sticky-pull-request-comment) | `2.9.4` | `3.0.2` |



Updates `astral-sh/setup-uv` from 7.3.0 to 7.5.0
- [Release notes](https://github.com/astral-sh/setup-uv/releases)
- [Commits](astral-sh/setup-uv@eac588a...e06108d)

Updates `actions/upload-artifact` from 6 to 7
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](actions/upload-artifact@v6...v7)

Updates `actions/download-artifact` from 6 to 8
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](actions/download-artifact@v6...v8)

Updates `amannn/action-semantic-pull-request` from 5 to 6
- [Release notes](https://github.com/amannn/action-semantic-pull-request/releases)
- [Changelog](https://github.com/amannn/action-semantic-pull-request/blob/main/CHANGELOG.md)
- [Commits](amannn/action-semantic-pull-request@e32d7e6...48f2562)

Updates `marocchino/sticky-pull-request-comment` from 2.9.4 to 3.0.2
- [Release notes](https://github.com/marocchino/sticky-pull-request-comment/releases)
- [Commits](marocchino/sticky-pull-request-comment@7737449...70d2764)

---
updated-dependencies:
- dependency-name: astral-sh/setup-uv
  dependency-version: 7.5.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: github-actions
- dependency-name: actions/upload-artifact
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: github-actions
- dependency-name: actions/download-artifact
  dependency-version: '8'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: github-actions
- dependency-name: amannn/action-semantic-pull-request
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: github-actions
- dependency-name: marocchino/sticky-pull-request-comment
  dependency-version: 3.0.2
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: github-actions
...

Signed-off-by: dependabot[bot] <support@github.com>

* deps(deps): bump the go-dependencies group across 1 directory with 5 updates

Bumps the go-dependencies group with 5 updates in the / directory:

| Package | From | To |
| --- | --- | --- |
| [github.com/daulet/tokenizers](https://github.com/daulet/tokenizers) | `1.25.0` | `1.26.0` |
| [github.com/yalue/onnxruntime_go](https://github.com/yalue/onnxruntime_go) | `1.26.0` | `1.27.0` |
| [modernc.org/sqlite](https://gitlab.com/cznic/sqlite) | `1.46.1` | `1.47.0` |
| [golang.org/x/time](https://github.com/golang/time) | `0.14.0` | `0.15.0` |
| [github.com/getsentry/sentry-go](https://github.com/getsentry/sentry-go) | `0.42.0` | `0.43.0` |



Updates `github.com/daulet/tokenizers` from 1.25.0 to 1.26.0
- [Release notes](https://github.com/daulet/tokenizers/releases)
- [Commits](daulet/tokenizers@v1.25.0...v1.26.0)

Updates `github.com/yalue/onnxruntime_go` from 1.26.0 to 1.27.0
- [Commits](yalue/onnxruntime_go@v1.26.0...v1.27.0)

Updates `modernc.org/sqlite` from 1.46.1 to 1.47.0
- [Changelog](https://gitlab.com/cznic/sqlite/blob/master/CHANGELOG.md)
- [Commits](https://gitlab.com/cznic/sqlite/compare/v1.46.1...v1.47.0)

Updates `golang.org/x/time` from 0.14.0 to 0.15.0
- [Commits](golang/time@v0.14.0...v0.15.0)

Updates `github.com/getsentry/sentry-go` from 0.42.0 to 0.43.0
- [Release notes](https://github.com/getsentry/sentry-go/releases)
- [Changelog](https://github.com/getsentry/sentry-go/blob/master/CHANGELOG.md)
- [Commits](getsentry/sentry-go@v0.42.0...v0.43.0)

---
updated-dependencies:
- dependency-name: github.com/daulet/tokenizers
  dependency-version: 1.26.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: go-dependencies
- dependency-name: github.com/yalue/onnxruntime_go
  dependency-version: 1.27.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: go-dependencies
- dependency-name: modernc.org/sqlite
  dependency-version: 1.47.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: go-dependencies
- dependency-name: golang.org/x/time
  dependency-version: 0.15.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: go-dependencies
- dependency-name: github.com/getsentry/sentry-go
  dependency-version: 0.43.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: go-dependencies
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix: bump react-dom and @types/react-dom to v19 to match react v19 upgrade

* ci: update tokenizers library to v1.26.0 to match Go dependency bump

* fix: classify nested package.json files as chore in PR scope check

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@hanneshapke hanneshapke merged commit 28f1fc7 into main Mar 31, 2026
6 checks passed
@hanneshapke hanneshapke deleted the fix/tokenization-character-offsets branch March 31, 2026 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(tokenization): replace fragile substring matching with character offsets from annotations

1 participant