OPENNLP-1816: Make ME classes thread-safe by eliminating shared mutable instance state by krickert · Pull Request #1003 · apache/opennlp

krickert · 2026-03-30T14:32:33Z

Summary

Make ME classes (TokenizerME, SentenceDetectorME, POSTaggerME, LemmatizerME) safe for concurrent use by eliminating shared mutable instance state. This enables reusing ME instances across threads instead of allocating a new instance per call, reducing allocation overhead in high-throughput pipelines.

The old pattern (new TokenizerME(model) per call) continues to work identically — zero regressions in correctness or performance.

Motivation

ME classes were documented as not thread-safe due to mutable instance fields (bestSequence, tokProbs, newTokens, sentProbs) that corrupt under concurrent access. The recommended workaround was either creating a new ME instance per call (expensive for high-throughput pipelines processing thousands of sentences in parallel) or using the ThreadSafe*ME wrappers (which use ThreadLocal and leak in Jakarta EE / long-running thread environments).

The root cause was mutable state at four layers:

ME classes — result fields written on every call
Context generators — per-call caches (contextsCache, wordsKey, buf, collectFeats)
Feature generators — CachedFeatureGenerator with mutable prevTokens and cache
BeamSearch — shared probs[] output buffer and a contextsCache that stored references to the reused buffer (cached values were always stale)

Approach

Move mutable state to method-local variables at every layer. ME instance fields are preserved as volatile for backward-compatible probs() access (last-writer-wins under concurrency). Caches are removed entirely — they were small (size 3 typically), not thread-safe, and in BeamSearch's case, buggy.

Files changed (10 source, 5 test)

File	Change
`BeamSearch.java`	Removed shared `probs[]` and buggy `contextsCache`; added `@ThreadSafe`
`DefaultSDContextGenerator.java`	`buf`/`collectFeats` moved to method-local; `collectFeatures()` signature updated
`SentenceContextGenerator.java` (Thai)	Updated to match new `collectFeatures()` signature
`DefaultPOSContextGenerator.java`	Removed `contextsCache` and `wordsKey`
`ConfigurablePOSContextGenerator.java`	Removed `contextsCache` and `wordsKey`
`CachedFeatureGenerator.java`	Removed `prevTokens`, `contextsCache`, counters; delegates directly
`TokenizerME.java`	`newTokens`/`tokProbs` volatile; `tokenizePos()` uses local lists
`SentenceDetectorME.java`	`sentProbs` volatile; `sentPosDetect()` uses local list
`POSTaggerME.java`	`bestSequence` volatile; `tag()` uses local var; added null guard
`LemmatizerME.java`	`bestSequence` volatile; `predictSES()` uses local var

Backward compatibility

The old pattern (new ME(model) per call) is unchanged — verified by regression benchmark
probs() methods preserved (deprecated behavior under concurrency, correct single-threaded)
Constructor signatures preserved (cacheSize params accepted but ignored, marked @Deprecated(since = "3.0.0"))
No new dependencies

Test plan

All 675 existing tests pass (mvn test on opennlp-runtime)
ThreadSafetyBenchmarkTest — JUnit correctness test: shared ME instances produce identical results to single-threaded baseline across all CPU cores
RegressionBenchmark — head-to-head stock vs patched, new-instance-per-call only: zero mismatches, zero errors, performance within noise on both builds
ThreadSafetyBenchmark — three-way comparison (new-instance-per-call / instance-per-thread / shared-single-instance)
CachedFeatureGeneratorTest — updated for removed cache behavior
Checkstyle violation count unchanged (9,446 pre-existing on both stock and patched)
Full mvn clean install at root (checkstyle must be skipped — 9,446 pre-existing violations on main)

Regression benchmark results (32 threads, new-instance-per-call)

Proves zero regression — stock vs patched, same API pattern:

Component	stock avg_ms	patched avg_ms
Tokenizer	16.09	16.69
SentenceDetector	9.21	9.01
POSTagger	105.76	106.58

Speedup benchmark results (32 threads, three-way comparison)

Approaches

The benchmark compares three strategies for using ME classes in a multi-threaded environment. All three produce identical output for a given input — the difference is how ME instances are allocated and shared.

Approach	Description	Example code
new-instance-per-call	A fresh ME instance is created for every single operation. This is the traditional pattern and the baseline. Safe but expensive — each call pays the full cost of constructing the ME, its BeamSearch, context generators, and feature generator chain.	`String[] tags = new POSTaggerME(model).tag(tokens);`
instance-per-thread	One ME instance is created per thread and reused across all operations on that thread. No cross-thread sharing, so no contention. Eliminates per-call constructor overhead while remaining completely safe.	`POSTaggerME tagger = new POSTaggerME(model);` `for (String[] t : sentences) tagger.tag(t);`
shared-single-instance	A single ME instance is shared across all threads. Maximum memory efficiency — only one set of internal structures exists. Works for TokenizerME and SentenceDetectorME. POSTaggerME has known contention in the feature generator chain at high thread counts.	`POSTaggerME shared = new POSTaggerME(model);` `// pass shared to all threads`

Benchmark results

Component	Approach	avg_ms	Speedup
Tokenizer	new-instance-per-call	16.63	1.0x
Tokenizer	instance-per-thread	15.92	1.04x
Tokenizer	shared-single-instance	16.24	1.02x
SentenceDetector	new-instance-per-call	9.49	1.0x
SentenceDetector	instance-per-thread	9.28	1.02x
SentenceDetector	shared-single-instance	8.93	1.06x
POSTagger	new-instance-per-call	133.55	1.0x
POSTagger	instance-per-thread	80.01	1.67x

POSTagger sees the largest gain because its constructor is the heaviest — it builds a BeamSearch, a ConfigurablePOSContextGenerator, and a full AdaptiveFeatureGenerator chain on every instantiation. Reusing one instance per thread eliminates that allocation on every call, yielding a 1.67x speedup with zero correctness impact.

Tokenizer and SentenceDetector constructors are lighter, so the per-call overhead is smaller and all three approaches perform similarly.

See opennlp-core/opennlp-runtime/BENCHMARKS.md for full benchmark instructions.

Thank you for contributing to Apache OpenNLP.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
https://issues.apache.org/jira/browse/OPENNLP-1816
Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit?

For code changes:

Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder?
Have you written or updated unit tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
- N/A — no new dependencies added
If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder?
- N/A — no license changes required
If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder?
- N/A — no notice changes required

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.

krickert · 2026-03-30T16:56:47Z

There were 3 checkstyle violations - fixed those.

jzonthemtn · 2026-03-30T17:11:41Z

@krickert Thanks for the PR!

krickert · 2026-03-30T17:29:12Z

always been a fan of OpenNLP - what I love about finally contributing is that before this patch, I was having to create pools of ME objects or create new ones every time. This gets rid of all that scaffolding.

If you would like me to create anymore tests, let me know.

I think the new tests cover the concurrency and recall use cases well. And I think the speed tests show that there's no concern about performance. I was excited to see the > 1.5x speedup with POSTagger... It's the single reason why I decided to work on this.

rzo1 · 2026-03-30T17:34:51Z

Hi,

Thanks for the contribution! Overall, I like the idea of looking into built-in thread safety rather than relying on ThreadLocal-based wrappers, which have known issues in Jakarta EE and other long-lived thread environments.

A few concerns I'd like to discuss before this can move forward (imho):

Benchmarks (no JMH)

The benchmarks are hand-rolled System.nanoTime() + ExecutorService loops. Without JMH, the results are susceptible to JIT warmup, GC pauses, and profile pollution, i.e. there's no fork isolation, no warmup iterations, and no statistical variance reporting. For a change that removes multiple
caching layers and claims "performance within noise," JMH would be preferable. We already have a profile for JMH benchmarks.

Caches removed without replacement

Three layers of caching were removed as a shortcut to thread safety:

CachedFeatureGenerator: the class is now a pass-through that caches nothing, despite its name. During beam search, the same token position may be feature-generated up to k times (beam width) with identical inputs. This cache was saving real work.
DefaultPOSContextGenerator / ConfigurablePOSContextGenerator: per-sentence context caches removed entirely.
`BeamSearch.contextsCache: you note this was buggy (stale references from the shared probs[] buffer). That may be valid, but removing it rather than fixing it (e.g., storing copies) conflates a bug fix with the thread-safety change.

The regression benchmark reports "performance within noise," but without JMH-level statistical rigor that's hard to verify. More importantly, the benchmark uses a small set of short sentences: a benchmark against a real-world dataset (e.g., from the eval/test corpora: https://nightlies.apache.org/opennlp/) would be far more convincing, particularly for POS tagging where the feature generation cache had the most impact under larger workloads. A thread-safe alternative would be making the caches method-local rather than removing them entirely.

Thread-safety tests are not robust

No contention forcing: there's no CyclicBarrier or CountDownLatch to ensure threads hit the critical section simultaneously. Threads free-run, which reduces the probability of surfacing races.
LemmatizerME was patched but has no thread-safety test.
probs() under concurrent access is not tested, despite being preserved as volatile for backward compatibility.
The test could pass on a 2-core CI machine and fail on a 64-core box. I think, that we sohuld at minimum set higher iteration counts with barrier-synchronized thread starts.

Missing coverage

Only 4 of 7 ME classes are addressed (ChunkerME, NameFinderME, LanguageDetectorME are untouched). This is fine as a scoped PR, but worth noting, so the existing ThreadSafe*ME wrappers can't be deprecated yet.

rzo1 · 2026-03-30T17:37:40Z

Regardless of my comment, I am going to trigger a Eval build for this: https://ci-builds.apache.org/job/OpenNLP/job/eval-tests-configurable/39/

krickert · 2026-03-30T18:01:20Z

@rzo1 working on addressing all of your concerns right now - it'll be done in a moment. I'm restoring the caches and running tests with and without with proper benchmarks. All great points, and thanks for the feedback.

krickert · 2026-03-30T18:23:47Z

I'm going to make the caches optional and configurable. This way we can run tests against all scenarios and come up with as many uses cases as needed to measure the impact.

The last commit was premature, I'm still working on this.

mawiesne · 2026-03-30T18:43:41Z

I'm going to make the caches optional and configurable. This way we can run tests against all scenarios and come up with as many uses cases as needed to measure the impact.

@krickert Thx Kristian for tackling this complex topic with so much energy! Much appreciated! Happy to review this PR deeper, especially lkn fwd for the JMH analyses. Richard has already given deep feedback in the first round; I'll share my 2c later on code stylistic nuances, seeking an optimal result from devs perspective.

For the moment, completing the 3.0.0-M2 release process is on my list…

krickert · 2026-03-30T18:55:39Z

@mawiesne no problem... I've been thinking about this for awhile now.

@rzo1 you were right about CachedFeatureGenerator. The data shows it clearly and it helps. That particular cache in the old vs new instances to bring a 1.6x boost. This combined with the thread safety feature with reuse show over a 2x increase now. Thanks for pointing that out. But don't trust what I say; I'll update the tests shortly to show it (I would love to see it on another machine too)

krickert · 2026-03-30T19:47:15Z

Thanks for the detailed feedback. We've addressed all four points made by @rzo1 . Here's a summary of what changed and the JMH data behind each decision.

1. Benchmarks (JMH)

Replaced all hand-rolled System.nanoTime() benchmarks with proper JMH. Three benchmark classes in src/jmh/java:

TokenizerMEBenchmark
SentenceDetectorMEBenchmark
POSTaggerMEBenchmark (with @Param for cache configuration)

Also fixed the existing JMH profile - the annotation processor wasn't wired into the compiler plugin, so the BenchmarkList was never generated. Added maven-compiler-plugin config with annotationProcessorPaths to the jmh profile.

Approaches measured

Approach	Description	Example
`newInstancePerCall`	Fresh ME per operation - the traditional pattern and baseline. Each call pays full constructor cost (BeamSearch, context generators, feature generator chain).	`new POSTaggerME(model).tag(tokens)`
`instancePerThread`	One ME per thread, reused across operations. No cross-thread sharing, no contention. Eliminates per-call constructor overhead.	`POSTaggerME tagger = new POSTaggerME(model);` then reuse
`sharedInstance`	Single ME shared by all threads. Maximum memory efficiency.	Pass one instance to all threads

JMH Results (32 threads, all cores)

Benchmark	Mode	Cnt	Score	Error	Units
TokenizerMEBenchmark.newInstancePerCall	thrpt	5	570469	± 6885	ops/s
TokenizerMEBenchmark.instancePerThread	thrpt	5	576365	± 25758	ops/s
TokenizerMEBenchmark.sharedInstance	thrpt	5	570312	± 12754	ops/s
SentenceDetectorMEBenchmark.newInstancePerCall	thrpt	5	837841	± 7903	ops/s
SentenceDetectorMEBenchmark.instancePerThread	thrpt	5	853319	± 25920	ops/s
SentenceDetectorMEBenchmark.sharedInstance	thrpt	5	849994	± 31635	ops/s
POSTaggerMEBenchmark.newInstancePerCall	thrpt	5	24886	± 2725	ops/s
POSTaggerMEBenchmark.instancePerThread	thrpt	5	62727	± 2410	ops/s
POSTaggerMEBenchmark.sharedInstance	thrpt	5	61666	± 7119	ops/s

Tokenizer and SentenceDetector: all approaches within error bars (lightweight constructors).
POSTagger: 2.52x speedup for instancePerThread vs newInstancePerCall.

2. Caches

We restored all caches as ThreadLocal (per-thread, not shared). Same behavior as the originals in single-threaded use, safe under concurrency.

We also added a contextCacheSize parameter to POSTaggerME and a DISABLE_CACHE_PROPERTY system property to CachedFeatureGenerator so the cache impact can be measured independently via JMH @Param.

JMH Cache Impact Results (POSTagger, 32 threads)

Benchmark	(allCaches)	Mode	Cnt	Score	Error	Units
POSTaggerMEBenchmark.instancePerThread	true	thrpt	5	64349	± 3216	ops/s
POSTaggerMEBenchmark.instancePerThread	false	thrpt	5	39702	± 870	ops/s
POSTaggerMEBenchmark.newInstancePerCall	true	thrpt	5	25394	± 2467	ops/s
POSTaggerMEBenchmark.newInstancePerCall	false	thrpt	5	23954	± 2324	ops/s
POSTaggerMEBenchmark.sharedInstance	true	thrpt	5	64663	± 2735	ops/s
POSTaggerMEBenchmark.sharedInstance	false	thrpt	5	39620	± 1139	ops/s

This told us which caches matter and which don't:

Cache	Restored as	JMH Impact	Notes
`CachedFeatureGenerator`	ThreadLocal	1.62x (64K vs 39K ops/s)	Saves real work - caches outcome-independent features across beam candidates at the same token position
`ConfigurablePOSContextGenerator`	ThreadLocal	None (65K vs 64K, within error)	Cache key includes prior tags, which differ per beam candidate - near-zero hit rate
`BeamSearch.contextsCache`	ThreadLocal	N/A	Every caller in the codebase passes `cacheSize=0`. Never enabled for any ME class. Restored for API backward compatibility

Regarding the BeamSearch cache specifically

you note this was buggy (stale references from the shared probs[] buffer). That may be valid, but removing it rather than fixing it (e.g., storing copies) conflates a bug fix with the thread-safety change.

We restored it as ThreadLocal with per-thread probs[] buffers, which fixes the stale-reference issue. However, we also checked every new BeamSearch(...) call in the codebase - every single one passes cacheSize=0 (either via the 2-arg constructor or explicitly). The cache has never been enabled by any caller in the project's history. We kept the 3-arg constructor for external API compatibility.

3. Thread-safety tests

Addressed all sub-points:

Contention forcing: All tests now use CyclicBarrier - threads wait at the barrier before starting, ensuring they hit the critical section simultaneously.
LemmatizerME: Added sharedLemmatizerProducesCorrectResults() test.
Thread/iteration counts: Math.max(8, availableProcessors()) threads, 200 reps per thread.
probs(): Added probsDoesNotThrowUnderConcurrency() test - verifies probs() returns valid data (non-null, non-empty) under concurrent tag() calls without throwing. The returned values are last-writer-wins by design (documented in volatile field comments) - the core processing methods are what we guarantee correct under concurrency.

4. Missing ME classes

All 7 ME classes are now covered:

Class	Source change	Thread-safety test
TokenizerME	`volatile` + method-local	`sharedTokenizerProducesCorrectResults()`
SentenceDetectorME	`volatile` + method-local	`sharedSentenceDetectorProducesCorrectResults()`
POSTaggerME	`volatile` + method-local + null guard	`sharedPOSTaggerProducesCorrectResults()`
LemmatizerME	`volatile` + method-local	`sharedLemmatizerProducesCorrectResults()`
ChunkerME	`volatile` + method-local + null guard	`sharedChunkerProducesCorrectResults()`
NameFinderME	`volatile` + method-local + null guard	`sharedNameFinderProducesCorrectResults()`
LanguageDetectorME	Already thread-safe (stateless)	`sharedLangDetectorProducesCorrectResults()`

All 7 ME classes are annotated @ThreadSafe.

5. ThreadSafe*ME wrappers deprecated

Since the ME classes are now themselves thread-safe, the ThreadSafe*ME wrappers are redundant. We deprecated all 7:

ThreadSafeTokenizerME → use TokenizerME directly
ThreadSafeSentenceDetectorME → use SentenceDetectorME directly
ThreadSafePOSTaggerME → use POSTaggerME directly
ThreadSafeLemmatizerME → use LemmatizerME directly
ThreadSafeChunkerME → use ChunkerME directly
ThreadSafeNameFinderME → use NameFinderME directly
ThreadSafeLanguageDetectorME → use LanguageDetectorME directly

We also replaced all internal usages of ThreadSafe*ME with direct ME usage:

Muc6NameSampleStreamFactory: ThreadSafeTokenizerME → TokenizerME
TwentyNewsgroupSampleStreamFactory: ThreadSafeTokenizerME → TokenizerME
POSTaggerMEIT: ThreadSafeTokenizerME / ThreadSafePOSTaggerME → TokenizerME / POSTaggerME

No internal code uses the wrappers anymore.

Open item

a benchmark against a real-world dataset (e.g., from the eval/test corpora) would be far more convincing

Agreed - this would strengthen the perf claims. The JMH benchmarks currently use the project's test data (AnnotatedSentences.txt). We're happy to add an eval-corpus benchmark as a follow-up, or include it in this PR if you'd prefer.

Do you have any real-world dataset tests around that we can run it against quickly? It's the only way I'd feel confident as well.

krickert · 2026-03-30T20:03:56Z

Summary since first review:

Made all 7 ME classes thread-safe by eliminating shared mutable instance state. Deprecate the ThreadSafe*ME wrappers - users can now share ME instances directly.

Motivation

ME classes were documented as not thread-safe due to mutable instance fields that corrupt under concurrent access. The workarounds were creating a new ME instance per call (expensive) or using ThreadSafe*ME wrappers (ThreadLocal-based, leak-prone in Jakarta EE). This PR makes the ME classes themselves thread-safe, yielding a 2.52x throughput improvement for POSTagger (JMH, 32 threads) by enabling instance reuse.

Approach

Mutable state moved to method-local variables or per-thread caches (ThreadLocal) at every layer:

Layer	Change
ME classes (all 7)	Result fields (`bestSequence`, `tokProbs`, etc.) made `volatile`; processing uses method-local variables with atomic swap at end
BeamSearch	`probs[]` buffer and `contextsCache` moved to per-thread `ThreadLocal` state
CachedFeatureGenerator	Cache moved to per-thread `ThreadLocal` (JMH confirms 1.62x benefit)
ConfigurablePOSContextGenerator	Cache moved to per-thread `ThreadLocal`
DefaultSDContextGenerator	`buf`/`collectFeats` moved to method-local parameters

Files changed (30 total)

Source (13 files): TokenizerME, SentenceDetectorME, POSTaggerME, LemmatizerME, ChunkerME, NameFinderME, LanguageDetectorME, BeamSearch, CachedFeatureGenerator, ConfigurablePOSContextGenerator, DefaultPOSContextGenerator, DefaultSDContextGenerator, SentenceContextGenerator (Thai)

Deprecated (7 files): ThreadSafeTokenizerME, ThreadSafeSentenceDetectorME, ThreadSafePOSTaggerME, ThreadSafeLemmatizerME, ThreadSafeChunkerME, ThreadSafeNameFinderME, ThreadSafeLanguageDetectorME

Internal usage swaps (3 files): Muc6NameSampleStreamFactory, TwentyNewsgroupSampleStreamFactory, POSTaggerMEIT - replaced ThreadSafe*ME with direct ME usage

Tests/benchmarks (5 files): ThreadSafetyBenchmarkTest (8 JUnit tests), 3 JMH benchmarks, CachedFeatureGeneratorTest update

Build (1 file): pom.xml - fixed JMH annotation processor wiring

krickert · 2026-03-31T11:40:27Z

@mawiesne - I did a push again to make the code try to match the style better - the problem I had was that your CICD failed linting and forced me to do 80-column code - which makes part of the code look ugly if not for my IDE. Can you ease up on the linting to make it 120 or 140 columns? or is that too much? I don't care either way, it's just a setting on my IDE - but the code in there has 3000+ violations - so I don't suspect it's really been enforced for a long time.

rzo1 · 2026-03-31T11:52:16Z

Note: You can use the OpenNLP Formatting XML which is provided as download. In addition, you only have a few fixes:

Error:  Failed to execute goal org.apache.maven.plugins:maven-checkstyle-plugin:3.6.0:check (validate) on project opennlp-runtime: You have 2 Checkstyle violations. -> [Help 1]

krickert · 2026-03-31T14:33:13Z

Note: You can use the OpenNLP Formatting XML which is provided as download. In addition, you only have a few fixes:
Error:  Failed to execute goal org.apache.maven.plugins:maven-checkstyle-plugin:3.6.0:check (validate) on project opennlp-runtime: You have 2 Checkstyle violations. -> [Help 1]

Oh cool! Thanks. I'll fix those today

@threadsafe

…le state All 7 ME classes (TokenizerME, SentenceDetectorME, POSTaggerME, LemmatizerME, ChunkerME, NameFinderME, LanguageDetectorME) are now safe for concurrent use from multiple threads. The ThreadSafe*ME wrappers are deprecated — use the ME classes directly. Thread-safety approach: - ME instance fields (bestSequence, tokProbs, newTokens, sentProbs) changed to volatile with method-local processing, atomic swap at end - BeamSearch: probs[] buffer and contextsCache moved to per-thread state via ThreadLocal - CachedFeatureGenerator: cache moved to per-thread state via ThreadLocal (JMH confirms 1.62x benefit from this cache) - ConfigurablePOSContextGenerator: cache moved to per-thread state via ThreadLocal - DefaultSDContextGenerator: buf/collectFeats moved to method-local JMH benchmark results (32 threads): - POSTagger instancePerThread: 2.52x faster than newInstancePerCall - POSTagger cache on vs off: no measurable difference for context generator cache; CachedFeatureGenerator provides 1.62x benefit - Tokenizer/SentenceDetector: all approaches within error bars API changes: - All 7 ME classes annotated @threadsafe - All 7 ThreadSafe*ME wrappers annotated @deprecated(since="3.0.0") - POSTaggerME: added constructor with contextCacheSize parameter - CachedFeatureGenerator: added DISABLE_CACHE_PROPERTY for benchmarking - Internal usages of ThreadSafe*ME replaced with direct ME usage Tests: - ThreadSafetyBenchmarkTest: 8 JUnit tests with CyclicBarrier (all 7 ME classes + probs() concurrency test) - JMH benchmarks for Tokenizer, SentenceDetector, POSTagger - Fixed JMH annotation processor config in pom.xml - All 680 runtime + 352 formats tests pass

krickert · 2026-03-31T23:41:13Z

Fixed.. let me know if there's more tests you'd like me to do. I think between the benchmarks, passing tests, and harness, it seems like a great use case.

krickert · 2026-04-02T11:23:03Z

@atarora @jzonthemtn @rzo1 @mawiesne Just curious - what are the next steps? I'm not in a rush - just not familiar with the cadence for reviews with this repo.

I'm most curious if I ran the tests enough to convince others of the advantages or if more testing is required so clarify the speedup factor and correctness of the solution. Suggest anything to help ease the review. I'm also open for a discord / any platform chat if that makes it easier too.

Moving to a thread-safe approach should make it a lot easier to code against - I frequently forgot that it's not thread safe over the years and had to redo the same strategies and seeing the speed up has made me excited to contribute more in the future. I'll also update the documentation once it's approved.

rzo1

The updated PR addresses my initial review concerns well. The ThreadLocal leak trade-off (1.) and the missing null guard in LemmatizerME (2.) are the most important items to fix. The rest are suggestions for polish from my side.

ThreadLocal: The PR description mentions that ThreadSafe*ME wrappers "leak in Jakarta EE / long-lived thread environments" due to ThreadLocal.. This is better now, but not leak-free: in container environments with classloader isolation, any ThreadLocal holding objects from the app classloader can pin the classloader. Worth a Javadoc note on cleanup expectations.
LemmatizerME.predictSES() missing null guard: POSTaggerME.tag(), ChunkerME.chunk(), and NameFinderME.find() all add if (seq == null) guards after model.bestSequence(), but LemmatizerME.predictSES() does not. Should be consistent.
DefaultSDContextGenerator.collectFeatures() signature change is API-breaking. The protected method collectFeatures() now takes two additional parameters (List collectFeats, StringBuilder buf). The SentenceContextGenerator subclass is updated, but any external subclass would break. Since this is targeting 3.0.0, this is probably acceptable, but worth calling out in migration notes.
DefaultPOSContextGenerator cache removal: Unlike ConfigurablePOSContextGenerator which moved to ThreadLocal caches, DefaultPOSContextGenerator simply removes the cache entirely. Why?
CachedFeatureGenerator.DISABLE_CACHE_PROPERTY: Using a system property (opennlp.cache.disabled) as a global toggle is a bit coarse. It's read once at construction time, so at least it's not checked per-call. Acceptable for benchmarking purposes, but the property name is generic: consider opennlp.featuregen.cache.disabled to avoid confusion with other caches in the system.
Removed toString() from CachedFeatureGenerator: The old toString() included cache hit/miss statistics. Since stats are gone, toString() was removed entirely. This is fine, but if anyone was logging these instances, they'll now get the default Object.toString().
The main() convenience runners in each benchmark class set .forks(0), which runs benchmarks in the same JVM (no fork isolation). The class-level @fork(2) annotation is correct for mvn exec:java invocations, but someone running main() directly will get non-isolated results. Consider a comment explaining this is for quick iteration only.
The current JMH benchmarks train on the small bundled test corpora, which is fine for correctness and relative comparison. I'd love to see some JMH numbers from a run against a real-world dataset (e.g., the pre-trained en models from the OpenNLP website) to get a sense of the absolute throughput characteristics and how the speedup scales with larger, production-representative model. Maybe you can post a related JMH result?

I have also triggered an eval build: https://ci-builds.apache.org/job/OpenNLP/job/eval-tests-configurable/41/

krickert · 2026-04-10T06:19:21Z

OK - the tests didn't catch it but there was one more part that needed extra thread safe work:

POSTaggerME, TokenizerME, SentenceDetectorME, ChunkerME, LemmatizerME, and NameFinderME kept last-decode state in ordinary instance fields for the first caller, so concurrent probs() / “last run” access could race or observe another thread’s results. They only moved that state into ThreadLocal after a second thread touched the same instance, which made the bug easy to miss.

I'll post the fix and describe the solution.

krickert · 2026-04-13T05:30:17Z

Fix description

The automated tests still looked green, but there was one more thread-safety gap we needed to close.

POSTaggerME, TokenizerME, SentenceDetectorME, ChunkerME, LemmatizerME, and NameFinderME all expose “last decode” state for the legacy probs() / related APIs. We had been keeping that state in plain instance fields for whichever thread used the instance first, and only moving it behind ThreadLocal after a second thread touched the same instance. That lazy switch made the race easy to miss in tests: under concurrency, probs() could still see another thread’s last run, or observe state in surprising order.

The fix has been pushed to the PR

The important part for performance: we did not want to pay full ThreadLocal overhead on every call in the common single-threaded / new-instance-per-call path. The implementation therefore tracks an owner thread first and only escalates to per-thread storage when sharing is actually detected—so we keep correctness for shared instances without reintroducing the large throughput hit we saw when everything went straight through ThreadLocal.

…Local Track ownership by Thread#threadId() (long) instead of holding a volatile Thread reference, matching OwnerOrPerThreadState. Holding a strong reference to a worker thread in a long-lived component pins the thread's context classloader in container environments (Jakarta EE) - exactly the leak this class is designed to avoid. Worst case with ID-based tracking is that a recycled-id thread sees a stale ownerValue from a previous owner instead of null, which is no worse than the documented contract for get(). Also adds focused unit tests for both LastResultOwnerOrThreadLocal and OwnerOrPerThreadState (owner-fast-path, second-thread isolation, clear-and-reclaim, one-way multi-threaded transition, concurrent stress) and expands the class Javadoc to document the three thread-safety strategies used across the seven ME classes.

…olish - BeamSearch CacheState now stashes a per-thread tempScores buffer (length numOutcomes) instead of allocating a fresh double[] inside the inner beam loop on every cache hit/miss. Functionally equivalent; removes one allocation per beam step per token. - BeamSearch.close() Javadoc now spells out the per-thread cleanup contract: a single close() releases only the calling thread's CacheState, not every per-thread slot held by long-lived BeamSearch instances shared across pool threads. - POSTaggerME class Javadoc clarifies that sharing one tagger saves both memory and model load time (the dominant startup cost). - CachedFeatureGenerator class Javadoc calls out that the "is this still the same sentence?" check uses reference identity (tokens == prevTokens), so a freshly allocated String[] with the same contents is treated as a new sentence and triggers a cache miss + clear.

krickert · 2026-04-17T10:16:05Z

@mawiesne @rzo1 — pushed two review-follow-up commits, branch is now at 26ef5152:

OPENNLP-1816: Drop strong Thread reference in LastResultOwnerOrThreadLocal (7e442322)
- Track ownership via Thread#threadId() (long) instead of a volatile Thread reference, matching what OwnerOrPerThreadState already does. Holding a strong Thread reference in a long-lived component would pin the worker thread's context classloader in container environments — exactly the leak this class was designed to avoid. Worst-case under id-based tracking is that a recycled-id thread sees a stale ownerValue instead of null, which is no worse than the documented get() contract (callers set() before get() on every thread).
- Added focused unit tests for both LastResultOwnerOrThreadLocal and OwnerOrPerThreadState: owner-fast-path, second-thread isolation, owner reclaim after clearForCurrentThread(), one-way multi-threaded transition, 16-thread concurrent stress, scoped clear semantics. 12 new tests, all passing. Previously these helpers were only exercised indirectly through ThreadSafetyBenchmarkIT.
- Class Javadoc on LastResultOwnerOrThreadLocal now documents the three thread-safety strategies side-by-side (owner-fast-path / volatile-swap / plain ThreadLocal) and which ME classes use which, so the reasoning isn't spread across seven files.
OPENNLP-1816: BeamSearch tempScores reuse and thread-safety Javadoc polish (26ef5152)
- BeamSearch.CacheState now stashes a per-thread tempScores buffer of length numOutcomes instead of allocating a fresh double[] inside the inner beam loop on every iteration. Functionally equivalent; removes one allocation per beam step per token on the cache hit/miss hot path.
- BeamSearch.close() Javadoc spells out the per-thread cleanup contract: a single close() only releases the calling thread's CacheState, not every per-thread slot held by long-lived BeamSearch instances shared across pool threads.
- POSTaggerME class Javadoc clarifies that sharing one tagger saves both memory and model load time (the dominant startup cost).
- CachedFeatureGenerator class Javadoc calls out that the "is this still the same sentence?" check uses reference identity (tokens == prevTokens) — passing a freshly allocated String[] with the same contents triggers a cache miss + clear.

Full local verify is green: 685 / 685 unit tests + 19 / 19 integration tests (including all 8 ThreadSafetyBenchmarkIT cases), checkstyle clean.

@rzo1 — would you mind re-triggering eval-tests-configurable against the new tip 26ef5152?

rzo1

Please double check

  opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/featuregen/POSTaggerNameFeatureGenerator.java:40-69
                                                                                                                                                                                                            
  private String[] cachedTokens;   // plain instance fields – no volatile, no ThreadLocal
  private String[] cachedTags;                                                                                                                                                                              
                                                                                                                                                                                                            
  public void createFeatures(...) {                                                                                                                                                                         
      if (!Arrays.equals(this.cachedTokens, toks)) {                                                                                                                                                        
          this.cachedTokens = toks;                                                                                                                                                                         
          this.cachedTags = this.posTagger.tag(toks);
      }                                                                                                                                                                                                     
      feats.add("pos=" + this.cachedTags[index]);   // race + potential AIOOBE
  }

NameFinderME is annotated @ThreadSafe, but any model trained with the shipped feature template opennlp/tools/namefind/ner-pos-features.xml (and ner-pos-features-v15.xml, ner-en_pos-features.xml) pulls
this generator into the pipeline. Under concurrent find() calls:

Thread A: cache miss, writes cachedTokens=toksA, cachedTags=tagsA (len=N).
Thread B preempts: writes cachedTokens=toksB, cachedTags=tagsB (len=M where M<N).
Thread A resumes: reads cachedTags[index] with index>=M → ArrayIndexOutOfBoundsException or wrong tag.

Could you check on DictionaryFeatureGenerator as well`? It's most likely write once, so not an actual issue, I guess.

NameFinderME:85-89 stores a fresh AdditionalContextFeatureGenerator per thread inside NameFinderState. That AFG has its own ThreadLocal<String[][]> (AdditionalContextFeatureGenerator:34) — so each AFG
instance is only ever used by one thread, making its inner ThreadLocal pure overhead. Functionally fine, but worth simplifying to a plain field.

Have triggered the eval build, but I think the first one needs a fix.

@threadsafe

The cachedTokens / cachedTags fields were plain instance fields, which raced under concurrent NameFinderME.find() calls when the enclosing NameFinderME (now @threadsafe) was shared across threads. With models trained from the shipped feature templates (ner-pos-features.xml, ner-pos-features-v15.xml, ner-en_pos-features.xml) this generator is on the find() critical path, so the race could either return wrong tags or, on length-mismatched interleavings, throw ArrayIndexOutOfBoundsException (thread A stashes a longer cachedTags, thread B replaces it with a shorter one, thread A reads cachedTags[index] past the new bounds). Fix: - Move the per-sentence cache into a per-thread CacheState held in a ThreadLocal. Each thread now sees its own cachedTokens/cachedTags pair and indexes into a tag array that always belongs to the same sentence it just tagged. - Annotate the class @threadsafe and document the per-thread cache so the reader can see the contract at a glance. - Preserve the original Arrays.equals(cachedTokens, toks) cache-hit semantics; the only change is that the cache is now per-thread. Test: testConcurrentCreateFeaturesIsThreadSafe stress-tests the original failure shape - many threads, sentences of differing lengths, hundreds of iterations on one shared generator. Verified that this test fails on the unfixed class (1 error: AIOOBE inside createFeatures) and passes after the fix.

…safe

…safe The isg field was a plain (non-final, non-volatile) reference. In normal use it is set once at construction time via setDictionary() and never replaced, but both the constructor write and any later setDictionary() write needed a synchronization edge to be visible to other threads on the createFeatures() read side. - Mark isg volatile so both the one-shot constructor write and any later setDictionary() call publish safely. - Annotate the class @threadsafe; the underlying InSpanGenerator is already @threadsafe so the delegating createFeatures() is now concurrent-safe. - createFeatures() reads isg into a local before delegating, so it is immune to a setDictionary() racing with an in-flight call (the call finishes against whichever dictionary it observed first). - Document that setDictionary() is intended for setup time, not the hot path: it does not coordinate with in-flight reads beyond the volatile publish, so callers swapping dictionaries while createFeatures() runs on other threads may observe either the old or the new dictionary's features.

Previously NameFinderME stored a ThreadLocal<NameFinderState>, where each per-thread NameFinderState held its own freshly-allocated AdditionalContextFeatureGenerator (AFG). AFG itself keeps the per-thread additional-context array via its own ThreadLocal<String[][]>, so each per-thread AFG instance was only ever touched by one thread - making the inner ThreadLocal pure overhead and the outer per-thread allocation redundant. Refactor: - Replace ThreadLocal<NameFinderState> with one shared AdditionalContextFeatureGenerator field on NameFinderME (one per instance, not per thread). The AFG's existing internal ThreadLocal handles per-thread context just as before, with no nesting. - Replace the bestSequence slot in NameFinderState with LastResultOwnerOrThreadLocal<Sequence>, matching the pattern POSTaggerME / SentenceDetectorME / TokenizerME already use. This gives single-threaded short-lived NameFinderME instances the owner-fast-path (no ThreadLocal map entry at all until a second thread shows up), and keeps multi-thread callers correct. - Drop the anonymous AdaptiveFeatureGenerator wrapper that delegated each call to the per-thread AFG; the WindowFeatureGenerator now wraps the shared AFG directly. AdditionalContextFeatureGenerator: add clearForCurrentThread() so NameFinderME.clearThreadLocalState() can also release the AFG's per-thread slot, completing the per-thread cleanup contract used elsewhere in the PR. NameFinderME.clearThreadLocalState() Javadoc rewritten to spell out that this is a per-thread, not per-instance, operation - same lifecycle contract as on the other ME classes - and that it does not reach into BeamSearch / other feature generator per-thread state.

krickert · 2026-04-17T12:38:12Z

@mawiesne @rzo1 — thanks for the review, all three concerns addressed in three follow-up commits, branch tip is now 19a2ac68.

1. `POSTaggerNameFeatureGenerator` race (P0) — `cb4d7f2e`

Confirmed reproducible. Fix:

Per-sentence cache (cachedTokens / cachedTags) is now held in a per-thread CacheState via ThreadLocal, so each thread indexes into a tag array that always belongs to the sentence it just tagged. No more length-mismatched interleavings between threads.
Class is now @ThreadSafe; the only behavioral change vs the original is per-thread isolation. The existing Arrays.equals(cachedTokens, toks) cache-hit semantics are preserved.
New unit test testConcurrentCreateFeaturesIsThreadSafe reproduces the original failure shape: 16 threads × 200 reps × sentences of differing lengths on one shared generator. Verified that test fails on the unfixed class (with exactly the AIOOBE you described) and passes after the fix.

2. `DictionaryFeatureGenerator` — `c6274c46`

Right — in normal use it is set once at construction and never replaced, so it's not a runtime race. But the field needed safe publication for both the constructor write and any later setDictionary() write. So:

isg is now volatile; class annotated @ThreadSafe (the underlying InSpanGenerator is already @ThreadSafe).
createFeatures() reads isg into a local before delegating, so an in-flight call finishes against whichever dictionary it observed first.
Javadoc now explicitly says setDictionary() is for setup time, not the hot path: it does not coordinate with in-flight reads beyond the volatile publish, so callers swapping mid-flight may observe either dictionary.

3. Nested `ThreadLocal` in `NameFinderME` — `19a2ac68`

Good catch. Two changes here:

AdditionalContextFeatureGenerator is now one shared instance per NameFinderME (was: one freshly-allocated AFG per thread inside NameFinderState). The AFG's existing internal ThreadLocal<String[][]> handles per-thread context just as before, with no nesting and no per-thread allocation of the wrapper.
The bestSequence slot is now a LastResultOwnerOrThreadLocal<Sequence> instead of ThreadLocal<NameFinderState>, matching the pattern POSTaggerME / SentenceDetectorME / TokenizerME already use. Single-threaded short-lived NameFinderME instances now get the owner-fast-path (no ThreadLocal map entry until a second thread shows up).
Added AdditionalContextFeatureGenerator.clearForCurrentThread() so NameFinderME.clearThreadLocalState() releases the AFG's per-thread slot too.
Rewrote clearThreadLocalState() Javadoc to spell out the per-thread (not per-instance) contract, same as on the other ME classes.

Verification

686/686 unit tests pass on opennlp-core/opennlp-runtime (the +1 vs the previous 685 is the new POSTaggerNameFeatureGenerator concurrency test). I also bisected the new test against the unfixed class to confirm it actually catches the described race.

@rzo1 — would you mind re-triggering eval-tests-configurable against 19a2ac68?

rzo1

I have triggered the eval build from the current branch head (again).

From my side, the changes looks ok, although I think that it cannot be backported to 2.x due to usage of Thread API introduced in Java 17+

Since this is a substantial contribution, it might require an ICLA before we can move forward. WDYT @jzonthemtn ?

atarora

Thanks @krickert for the great contribution and the thorough iteration, and to @rzo1 and @mawiesne for thw detailed reviews that made this stronger
Looks good to me too! :)
before we merge:

echo @rzo1 ICLA question
and doc update would be great, especially around the ThreadSafe*ME deprecation and new shared usage pattern

krickert · 2026-04-17T17:35:41Z

@atarora let me know where you'd want the documentation update - I can work on that too.

krickert · 2026-04-17T19:01:41Z

I went ahead and submitted an ICLA and referenced this just to get ahead of things just now. So if the answer is yes, those wheels are in motion. @atarora @rzo1 @mawiesne

rzo1 · 2026-04-17T19:06:02Z

I went ahead and submitted an ICLA and referenced this just to get ahead of things just now. So if the answer is yes, those wheels are in motion. @atarora @rzo1 @mawiesne

Thanks ! That will make things easier.

rzo1 · 2026-04-17T19:06:29Z

@atarora let me know where you'd want the documentation update - I can work on that too.

Perhaps just update the existing docs within this PR - so we are complete :)

- Replace the outdated "NameFinderME is not thread safe" paragraph with positive guidance for sharing a single instance across threads. - Add a one-line thread-safety note next to each *ME constructor in the per-component docs (POSTaggerME, ChunkerME, SentenceDetectorME, TokenizerME, LemmatizerME, NameFinderME), and note that the legacy ThreadSafe*ME wrappers are retained-but-deprecated. - Clarify that POSTaggerME.probs() and ChunkerME.probs() are now per-thread "last result" calls when the tagger is shared. - Bump the model-loading.xml note from "Java 17+" to "Java 21+ (the minimum supported version since OpenNLP 3.0.0)". - Add a short "Thread safety" subsection to the README's "Migrating from 2.x to 3.x" block.

krickert · 2026-04-17T20:31:42Z

@atarora let me know where you'd want the documentation update - I can work on that too.

Perhaps just update the existing docs within this PR - so we are complete :)

Updated the documentation and got a receipt for the ICLA. So pending further reviews, we're good to go.

rzo1 · 2026-04-23T12:16:32Z

Thanks @krickert - this is now in good shape and the 3.x line will have it in. Happy to see additional contributions from your side!

krickert · 2026-04-23T19:36:32Z

@rzo1 thanks!! So happy to see it made it! Can you get me the ICLA request going? Would love to get my name on that contribution list that ASF maintains..

I'm certainly going to add some more bells and whistles. Looking forward to the next 3.x-M release.

krickert changed the title ~~OPENNLP-1816Make ME classes thread-safe by eliminating shared mutable instance state~~ OPENNLP-1816: Make ME classes thread-safe by eliminating shared mutable instance state Mar 30, 2026

krickert force-pushed the feature/thread-safe-me branch from 2a904dd to 729b9c1 Compare March 30, 2026 16:49

rzo1 requested review from atarora, jzonthemtn, mawiesne and rzo1 March 30, 2026 17:35

krickert force-pushed the feature/thread-safe-me branch from 729b9c1 to 94ca28d Compare March 30, 2026 18:20

krickert force-pushed the feature/thread-safe-me branch 2 times, most recently from bfc8fdf to d31aaa6 Compare March 30, 2026 19:31

krickert force-pushed the feature/thread-safe-me branch from d31aaa6 to b02c2eb Compare March 31, 2026 03:20

mawiesne marked this pull request as draft March 31, 2026 06:50

krickert force-pushed the feature/thread-safe-me branch from b02c2eb to 178386f Compare March 31, 2026 23:37

rzo1 reviewed Apr 2, 2026

View reviewed changes

krickert marked this pull request as draft April 10, 2026 06:18

OPENNLP-1816: ME last-result state and Javadoc line length

13177da

krickert requested a review from mawiesne April 14, 2026 00:06

krickert marked this pull request as ready for review April 14, 2026 22:11

krickert added 2 commits April 17, 2026 06:11

rzo1 requested changes Apr 17, 2026

View reviewed changes

krickert added 3 commits April 17, 2026 08:36

rzo1 approved these changes Apr 17, 2026

View reviewed changes

atarora approved these changes Apr 17, 2026

View reviewed changes

mawiesne approved these changes Apr 23, 2026

View reviewed changes

mawiesne assigned rzo1 Apr 23, 2026

mawiesne added java Pull requests that update Java code tests Pull requests that add or update test code labels Apr 23, 2026

mawiesne assigned krickert Apr 23, 2026

rzo1 merged commit 56cf783 into apache:main Apr 23, 2026
14 of 15 checks passed

Conversation

krickert commented Mar 30, 2026

Summary

Motivation

Approach

Files changed (10 source, 5 test)

Backward compatibility

Test plan

Regression benchmark results (32 threads, new-instance-per-call)

Speedup benchmark results (32 threads, three-way comparison)

Approaches

Benchmark results

For all changes:

For code changes:

For documentation related changes:

Note:

Uh oh!

krickert commented Mar 30, 2026

Uh oh!

jzonthemtn commented Mar 30, 2026

Uh oh!

krickert commented Mar 30, 2026

Uh oh!

rzo1 commented Mar 30, 2026

Uh oh!

rzo1 commented Mar 30, 2026

Uh oh!

krickert commented Mar 30, 2026

Uh oh!

krickert commented Mar 30, 2026

Uh oh!

mawiesne commented Mar 30, 2026

Uh oh!

krickert commented Mar 30, 2026

Uh oh!

krickert commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Benchmarks (JMH)

Approaches measured

JMH Results (32 threads, all cores)

2. Caches

JMH Cache Impact Results (POSTagger, 32 threads)

Regarding the BeamSearch cache specifically

3. Thread-safety tests

4. Missing ME classes

5. ThreadSafe*ME wrappers deprecated

Open item

Uh oh!

krickert commented Mar 30, 2026

Motivation

Approach

Files changed (30 total)

Uh oh!

krickert commented Mar 31, 2026

Uh oh!

rzo1 commented Mar 31, 2026

Uh oh!

krickert commented Mar 31, 2026

Uh oh!

krickert commented Mar 31, 2026

Uh oh!

krickert commented Apr 2, 2026

Uh oh!

rzo1 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krickert commented Apr 10, 2026

Uh oh!

krickert commented Apr 13, 2026

Fix description

Uh oh!

krickert commented Apr 17, 2026

Uh oh!

rzo1 left a comment

Choose a reason for hiding this comment

Uh oh!

krickert commented Apr 17, 2026

1. POSTaggerNameFeatureGenerator race (P0) — cb4d7f2e

2. DictionaryFeatureGenerator — c6274c46

krickert commented Mar 30, 2026 •

edited

Loading

rzo1 left a comment •

edited

Loading

1. `POSTaggerNameFeatureGenerator` race (P0) — `cb4d7f2e`

2. `DictionaryFeatureGenerator` — `c6274c46`

3. Nested `ThreadLocal` in `NameFinderME` — `19a2ac68`

krickert commented Apr 17, 2026 •

edited

Loading