Skip to content

Make Python Tokenizer concurrent-use errors explicit#331

Merged
hayashi-mas-wap merged 1 commit into
WorksApplications:developfrom
gulldan:fix/tokenizer-concurrency-guard
May 12, 2026
Merged

Make Python Tokenizer concurrent-use errors explicit#331
hayashi-mas-wap merged 1 commit into
WorksApplications:developfrom
gulldan:fix/tokenizer-concurrency-guard

Conversation

@gulldan
Copy link
Copy Markdown
Contributor

@gulldan gulldan commented May 11, 2026

Fixes #205.

Summary

This makes concurrent use of a single Python Tokenizer fail with an explicit Sudachi error instead of leaking PyO3's generic runtime borrow error.

The intended concurrent usage model is unchanged: share a Dictionary, but create one Tokenizer per worker/thread.

Root Cause

Tokenizer.tokenize() releases the GIL while Rust performs analysis, which allows another Python thread to enter the binding during an ongoing tokenization.

Before this change, tokenize() used &mut self, so PyO3 attempted to borrow the Tokenizer mutably before the method body was entered. A concurrent call failed at the generated PyO3 wrapper layer with:

RuntimeError: Already borrowed

That protected the mutable StatefulTokenizer, but the error was generic and did not explain the supported usage model.

Changes

This change keeps using PyO3's runtime borrow checking as the concurrency guard, but moves the borrow attempt into the method body:

  • tokenize() now receives &Bound<Self> instead of &mut self;
  • the method validates mode before borrowing the tokenizer;
  • the method explicitly calls try_borrow_mut();
  • concurrent use returns a SudachiError with an actionable message;
  • py.detach(...) remains around the Rust analysis path, so GIL-free tokenization is preserved.

New error message:

Tokenizer is already in use. A Tokenizer instance cannot be used concurrently; create a separate Tokenizer per thread or guard calls externally

Behavior

Before, with two threads using the same Tokenizer:

[('err', 'RuntimeError', 'Already borrowed'), ('ok', 4756)]

After:

[('err', 'SudachiError', 'Tokenizer is already in use. A Tokenizer instance cannot be used concurrently; create a separate Tokenizer per thread or guard calls externally'), ('ok', 4756)]

Using separate tokenizers from the same dictionary continues to work across threads.

Concurrency Measurement

Environment:

  • macOS 26.4.1 arm64, Mac16,5, 16 CPU cores
  • Python 3.14.4
  • test dictionary from python/tests/resources
  • synthetic Japanese corpus, 20,008 UTF-8 bytes per document
  • 80 documents per worker, median of 2 runs
  • each worker uses its own Tokenizer

Reference measurement:

variant workers elapsed, s MB/s
local no-py.detach baseline 1 0.830 1.9
local no-py.detach baseline 2 1.686 1.9
local no-py.detach baseline 4 3.331 1.9
local no-py.detach baseline 8 6.683 1.9
current branch, GIL released 1 0.819 2.0
current branch, GIL released 2 0.847 3.8
current branch, GIL released 4 0.850 7.5
current branch, GIL released 8 0.867 14.8

The current develop branch already releases the GIL during analysis; this PR preserves that behavior. The table shows why keeping the analysis path inside py.detach matters: with the GIL held, throughput stays flat as threads serialize; with the GIL released, aggregate throughput scales with worker count.

Tests

Added coverage for:

  • concurrent calls on one Tokenizer raise the clear SudachiError;
  • the generic PyO3 Already borrowed message does not leak for concurrent tokenize() calls;
  • separate tokenizers from one dictionary work across threads;
  • tokenizer state is reusable after an internal tokenization error;
  • temporary split mode is restored after an internal tokenization error.

Validation run:

cargo fmt --check
cargo test -p sudachipy --no-run
.env/bin/pip install -q -e .
.env/bin/python -m unittest tests.test_tokenizer.TestTokenizer.test_concurrent_tokenize_on_same_tokenizer_fails tests.test_tokenizer.TestTokenizer.test_separate_tokenizers_work_in_threads tests.test_tokenizer.TestTokenizer.test_tokenizer_is_released_after_internal_error tests.test_tokenizer.TestTokenizer.test_temporary_mode_is_restored_after_internal_error
.env/bin/python -m unittest

Full Python suite result:

Ran 78 tests in 3.938s

OK

@gulldan gulldan force-pushed the fix/tokenizer-concurrency-guard branch from c7c8745 to 9dece63 Compare May 11, 2026 22:18
@gulldan gulldan marked this pull request as ready for review May 11, 2026 22:20
@hayashi-mas-wap hayashi-mas-wap self-requested a review May 12, 2026 02:38
@hayashi-mas-wap hayashi-mas-wap merged commit 4f934c4 into WorksApplications:develop May 12, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Release GIL during analysis operation

2 participants