fix(tokenization): replace substring matching with character offsets from annotations#277
Merged
hanneshapke merged 4 commits intomainfrom Mar 31, 2026
Merged
Conversation
…from annotations The `_find_privacy_mask_positions` method used `text.find(value)` to re-discover entity positions by substring matching. This caused silent data quality bugs: partial matches (e.g. "Alex" inside "Alexandria"), duplicate over-matching, and overlapping entity collisions. The original character offsets from Label Studio annotations were available but discarded during preprocessing. This commit preserves them through the pipeline and uses them directly in tokenization. Closes #264 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* deps(deps-dev): bump electron from 40.1.0 to 40.6.1 Bumps [electron](https://github.com/electron/electron) from 40.1.0 to 40.6.1. - [Release notes](https://github.com/electron/electron/releases) - [Commits](electron/electron@v40.1.0...v40.6.1) --- updated-dependencies: - dependency-name: electron dependency-version: 40.6.1 dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * deps(deps-dev): bump @typescript-eslint/parser from 8.56.0 to 8.56.1 Bumps [@typescript-eslint/parser](https://github.com/typescript-eslint/typescript-eslint/tree/HEAD/packages/parser) from 8.56.0 to 8.56.1. - [Release notes](https://github.com/typescript-eslint/typescript-eslint/releases) - [Changelog](https://github.com/typescript-eslint/typescript-eslint/blob/main/packages/parser/CHANGELOG.md) - [Commits](https://github.com/typescript-eslint/typescript-eslint/commits/v8.56.1/packages/parser) --- updated-dependencies: - dependency-name: "@typescript-eslint/parser" dependency-version: 8.56.1 dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> * deps(deps-dev): bump @typescript-eslint/eslint-plugin Bumps [@typescript-eslint/eslint-plugin](https://github.com/typescript-eslint/typescript-eslint/tree/HEAD/packages/eslint-plugin) from 8.56.0 to 8.56.1. - [Release notes](https://github.com/typescript-eslint/typescript-eslint/releases) - [Changelog](https://github.com/typescript-eslint/typescript-eslint/blob/main/packages/eslint-plugin/CHANGELOG.md) - [Commits](https://github.com/typescript-eslint/typescript-eslint/commits/v8.56.1/packages/eslint-plugin) --- updated-dependencies: - dependency-name: "@typescript-eslint/eslint-plugin" dependency-version: 8.56.1 dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> * deps(deps): bump react and @types/react Bumps [react](https://github.com/facebook/react/tree/HEAD/packages/react) and [@types/react](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/react). These dependencies needed to be updated together. Updates `react` from 18.3.1 to 19.2.4 - [Release notes](https://github.com/facebook/react/releases) - [Changelog](https://github.com/facebook/react/blob/main/CHANGELOG.md) - [Commits](https://github.com/facebook/react/commits/v19.2.4/packages/react) Updates `@types/react` from 18.3.27 to 19.2.14 - [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases) - [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/react) --- updated-dependencies: - dependency-name: react dependency-version: 19.2.4 dependency-type: direct:production update-type: version-update:semver-major - dependency-name: "@types/react" dependency-version: 19.2.14 dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> * deps(deps): bump @sentry/electron from 7.8.0 to 7.9.0 Bumps [@sentry/electron](https://github.com/getsentry/sentry-electron) from 7.8.0 to 7.9.0. - [Release notes](https://github.com/getsentry/sentry-electron/releases) - [Changelog](https://github.com/getsentry/sentry-electron/blob/master/CHANGELOG.md) - [Commits](getsentry/sentry-electron@7.8.0...7.9.0) --- updated-dependencies: - dependency-name: "@sentry/electron" dependency-version: 7.9.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * deps(deps): bump lucide-react Bumps the production-dependencies group in /src/frontend with 1 update: [lucide-react](https://github.com/lucide-icons/lucide/tree/HEAD/packages/lucide-react). Updates `lucide-react` from 0.574.0 to 0.576.0 - [Release notes](https://github.com/lucide-icons/lucide/releases) - [Commits](https://github.com/lucide-icons/lucide/commits/0.576.0/packages/lucide-react) --- updated-dependencies: - dependency-name: lucide-react dependency-version: 0.576.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: production-dependencies ... Signed-off-by: dependabot[bot] <support@github.com> * ci(deps): bump the github-actions group across 1 directory with 5 updates Bumps the github-actions group with 5 updates in the / directory: | Package | From | To | | --- | --- | --- | | [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) | `7.3.0` | `7.5.0` | | [actions/upload-artifact](https://github.com/actions/upload-artifact) | `6` | `7` | | [actions/download-artifact](https://github.com/actions/download-artifact) | `6` | `8` | | [amannn/action-semantic-pull-request](https://github.com/amannn/action-semantic-pull-request) | `5` | `6` | | [marocchino/sticky-pull-request-comment](https://github.com/marocchino/sticky-pull-request-comment) | `2.9.4` | `3.0.2` | Updates `astral-sh/setup-uv` from 7.3.0 to 7.5.0 - [Release notes](https://github.com/astral-sh/setup-uv/releases) - [Commits](astral-sh/setup-uv@eac588a...e06108d) Updates `actions/upload-artifact` from 6 to 7 - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](actions/upload-artifact@v6...v7) Updates `actions/download-artifact` from 6 to 8 - [Release notes](https://github.com/actions/download-artifact/releases) - [Commits](actions/download-artifact@v6...v8) Updates `amannn/action-semantic-pull-request` from 5 to 6 - [Release notes](https://github.com/amannn/action-semantic-pull-request/releases) - [Changelog](https://github.com/amannn/action-semantic-pull-request/blob/main/CHANGELOG.md) - [Commits](amannn/action-semantic-pull-request@e32d7e6...48f2562) Updates `marocchino/sticky-pull-request-comment` from 2.9.4 to 3.0.2 - [Release notes](https://github.com/marocchino/sticky-pull-request-comment/releases) - [Commits](marocchino/sticky-pull-request-comment@7737449...70d2764) --- updated-dependencies: - dependency-name: astral-sh/setup-uv dependency-version: 7.5.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: github-actions - dependency-name: actions/upload-artifact dependency-version: '7' dependency-type: direct:production update-type: version-update:semver-major dependency-group: github-actions - dependency-name: actions/download-artifact dependency-version: '8' dependency-type: direct:production update-type: version-update:semver-major dependency-group: github-actions - dependency-name: amannn/action-semantic-pull-request dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major dependency-group: github-actions - dependency-name: marocchino/sticky-pull-request-comment dependency-version: 3.0.2 dependency-type: direct:production update-type: version-update:semver-major dependency-group: github-actions ... Signed-off-by: dependabot[bot] <support@github.com> * deps(deps): bump the go-dependencies group across 1 directory with 5 updates Bumps the go-dependencies group with 5 updates in the / directory: | Package | From | To | | --- | --- | --- | | [github.com/daulet/tokenizers](https://github.com/daulet/tokenizers) | `1.25.0` | `1.26.0` | | [github.com/yalue/onnxruntime_go](https://github.com/yalue/onnxruntime_go) | `1.26.0` | `1.27.0` | | [modernc.org/sqlite](https://gitlab.com/cznic/sqlite) | `1.46.1` | `1.47.0` | | [golang.org/x/time](https://github.com/golang/time) | `0.14.0` | `0.15.0` | | [github.com/getsentry/sentry-go](https://github.com/getsentry/sentry-go) | `0.42.0` | `0.43.0` | Updates `github.com/daulet/tokenizers` from 1.25.0 to 1.26.0 - [Release notes](https://github.com/daulet/tokenizers/releases) - [Commits](daulet/tokenizers@v1.25.0...v1.26.0) Updates `github.com/yalue/onnxruntime_go` from 1.26.0 to 1.27.0 - [Commits](yalue/onnxruntime_go@v1.26.0...v1.27.0) Updates `modernc.org/sqlite` from 1.46.1 to 1.47.0 - [Changelog](https://gitlab.com/cznic/sqlite/blob/master/CHANGELOG.md) - [Commits](https://gitlab.com/cznic/sqlite/compare/v1.46.1...v1.47.0) Updates `golang.org/x/time` from 0.14.0 to 0.15.0 - [Commits](golang/time@v0.14.0...v0.15.0) Updates `github.com/getsentry/sentry-go` from 0.42.0 to 0.43.0 - [Release notes](https://github.com/getsentry/sentry-go/releases) - [Changelog](https://github.com/getsentry/sentry-go/blob/master/CHANGELOG.md) - [Commits](getsentry/sentry-go@v0.42.0...v0.43.0) --- updated-dependencies: - dependency-name: github.com/daulet/tokenizers dependency-version: 1.26.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: github.com/yalue/onnxruntime_go dependency-version: 1.27.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: modernc.org/sqlite dependency-version: 1.47.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: golang.org/x/time dependency-version: 0.15.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: github.com/getsentry/sentry-go dependency-version: 0.43.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies ... Signed-off-by: dependabot[bot] <support@github.com> * fix: bump react-dom and @types/react-dom to v19 to match react v19 upgrade * ci: update tokenizers library to v1.26.0 to match Go dependency bump * fix: classify nested package.json files as chore in PR scope check --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
preprocessing.py) —start/endfields are now included inprivacy_maskentries instead of being discarded_find_privacy_mask_positions(tokenization.py) instead of re-discovering positions viatext.find()substring searchtext[start:end] == valueand logging a warning on mismatch (entry is skipped to avoid corrupt labels)start/end, aValueErroris raised instead of silently falling back to fragile substring matchingProblem
The previous
text.find(value)approach caused three classes of silent data quality bugs:These bugs silently degraded training signal quality across the entire dataset.
Changes
model/src/preprocessing.pystartandendin bothprivacy_mask.append()calls (coreference entities and standalone entities)model/dataset/tokenization.py_find_privacy_mask_positionsto use annotation offsets directly, with offset validation andValueErroron missing offsetsTest plan
ValueErroris raised (confirms all annotations have offsets)Closes #264
🤖 Generated with Claude Code