Make Python Tokenizer concurrent-use errors explicit#331
Merged
hayashi-mas-wap merged 1 commit intoMay 12, 2026
Merged
Conversation
c7c8745 to
9dece63
Compare
hayashi-mas-wap
approved these changes
May 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #205.
Summary
This makes concurrent use of a single Python
Tokenizerfail with an explicit Sudachi error instead of leaking PyO3's generic runtime borrow error.The intended concurrent usage model is unchanged: share a
Dictionary, but create oneTokenizerper worker/thread.Root Cause
Tokenizer.tokenize()releases the GIL while Rust performs analysis, which allows another Python thread to enter the binding during an ongoing tokenization.Before this change,
tokenize()used&mut self, so PyO3 attempted to borrow theTokenizermutably before the method body was entered. A concurrent call failed at the generated PyO3 wrapper layer with:That protected the mutable
StatefulTokenizer, but the error was generic and did not explain the supported usage model.Changes
This change keeps using PyO3's runtime borrow checking as the concurrency guard, but moves the borrow attempt into the method body:
tokenize()now receives&Bound<Self>instead of&mut self;modebefore borrowing the tokenizer;try_borrow_mut();py.detach(...)remains around the Rust analysis path, so GIL-free tokenization is preserved.New error message:
Behavior
Before, with two threads using the same
Tokenizer:After:
Using separate tokenizers from the same dictionary continues to work across threads.
Concurrency Measurement
Environment:
python/tests/resourcesTokenizerReference measurement:
py.detachbaselinepy.detachbaselinepy.detachbaselinepy.detachbaselineThe current
developbranch already releases the GIL during analysis; this PR preserves that behavior. The table shows why keeping the analysis path insidepy.detachmatters: with the GIL held, throughput stays flat as threads serialize; with the GIL released, aggregate throughput scales with worker count.Tests
Added coverage for:
Tokenizerraise the clear SudachiError;Already borrowedmessage does not leak for concurrenttokenize()calls;Validation run:
Full Python suite result: