Migrate away from per-segment-per-threadlocals on SegmentReader #11998

rmuir · 2022-12-06T03:14:41Z

Background

Currently the stored fields and term vectors apis on the index are "stateless".
Unlike the other parts of the APIs, users can't call any iterators/enumerators, only:

indexReader.document(0);
indexReader.document(1);
// up to potentially thousands of docs because lusers do that

Instead of adding any real iterator, ThreadLocals were added on each segment behind-the-scenes to prevent from having to clone() the stored fields or term vectors reader on every document. For example, this caching could reduce the amount of NIOFSDirectory buffer refills and other associated overhead.

But the old API from a previous time, seems to only gets worse these days, because the implementations are more complicated and do block-compression, dictionaries, etc. In some cases, the cached resources can grow to extremely large amounts (see @luyuncheng writeup in #11987 as an example)

These per-segment threadlocals can cause memory issues if you have tons of segments, tons of threads, or especially both. Seems plenty of java developers can't help but run into it.

New API

I propose we deprecate these APIs:

IndexReader.document()
IndexReader.termVectors()
IndexSearcher.doc()

And replace the functionality with these APIs:

IndexReader.storedFields()
IndexReader.termVectors()
IndexSearcher.storedFields()

Instead of lucene internally caching a reader per-thread-per-segment, a user can get one themselves e.g. per-search:

TopDocs hits = searcher.search(query, 10);
StoredFields storedFields = searcher.storedFields();
for (ScoreDoc hit : hits.scoreDocs) {
  Document doc = storedFields.document(hit.doc);
}
// now the StoredFields instance can be gc'd normally

It will re-use the datastructures across a search if someone has thousands and thousands of hits, but avoid the ThreadLocal pain.

Deprecated API

The deprecated APIs still work the same way, and still use ThreadLocals. This way apps can safely migrate to the new APIs at their convenience. Once all the deprecations are fixed, then no ThreadLocals will be created any more, and the app can enjoy the RAM savings.

All code and tests have been moved to the new API in this PR, after backporting to 9.x, we can commit this patch to remove the deprecated apis / threadlocal support completely: nukeDeprecated.patch.txt

…eader into o.a.l.index These apis "StoredFields" and "TermVectors" act like the other enums on fields/postings. The codec *Readers retain all the low-level details such as checkIntegrity/clone/etc

jpountz

I like the suggested direction.

For the record, another direction I've been contemplating consisted of passing a set of doc IDs to the IndexReader#document API, but I think it's a worse option because it doesn't really work for merging. I like this one better.

As far as fixing tests is concerned, would you be ok with fixing them in a suboptimal mechanical way, e.g. replacing every call to indexReader.document(doc) with indexreader.storedFields().document(doc), without taking care of reusing the same StoredFields instance across multiple docs?

rmuir · 2022-12-06T18:13:17Z

That's fine. or we could fix newSearcher to not wrap with crazy CodecReader's. or we could fix said CodecReaders (since they are only used for tests) to implement the deprecated document apis, like SegmentReader does. Or we could give CodecReader a default impl that isn't very performant other than UOE.

I wanted to throw the UOE, at least at first, to be sure i knew exactly what was calling old .document API (e.g. in case i forgot to fix a filter-reader). But it doesn't have to stay.

there's no longer a final default implementation in CodecReader (it throws UOE), but as a practical matter, it seems to make tests happy and isn't too invasive.

rmuir · 2022-12-06T18:50:02Z

I got the tests happy for now with 12a5dfa

Maybe not the right solution in the end, but makes it easier to iterate when you have passing tests at least.

...ne/test-framework/src/java/org/apache/lucene/tests/index/BaseStoredFieldsFormatTestCase.java

jpountz · 2022-12-08T17:16:23Z

I pushed a fix for most call sites I think. The main remaining ones are in lucene/highlighter, which require a bit more changes.

lucene/queries/src/java/org/apache/lucene/queries/mlt/MoreLikeThis.java

Makes tests look more natural, perform better, and give better coverage (e.g. exercise reuse of the same instance)

rmuir · 2022-12-09T17:08:58Z

I cutover all the remaining stuff, by nuking the old api locally and making sure full gradle check passes.

attached is the patch i used to nuke the old APIs.
nukeDeprecated.patch.txt

It reveals two more things to fix:

Fix test-framework's FieldFilterLeafReader to implement new APIs
Look at test-framework's AssertingLeafReader to see if we can add safety checks to new APIs

…s,Terms,Fields instances These should not be shared across threads, so add checks.

lucene/test-framework/src/java/org/apache/lucene/tests/index/AssertingLeafReader.java

jpountz

It looks good. Are you good with me adding some notes on thread confinement to MIGRATE.txt and top-level javadocs of StoredFields and TermVectors?

rmuir · 2022-12-11T21:47:04Z

+1 to improve docs

Add new stored fields and termvectors interfaces: IndexReader.storedFields() and IndexReader.termVectors(). Deprecate IndexReader.document() and IndexReader.getTermVector(). The new APIs do not rely upon ThreadLocal storage for each index segment, which can greatly reduce RAM requirements when there are many threads and/or segments. Co-authored-by: Adrien Grand <jpountz@gmail.com>

See apache/lucene#11998 Only the StoredFieldsBenchmark is impacted. Otherwise luceneutil doesn't mess with stored fields/term vectors

dsmiley · 2023-03-04T20:17:27Z

Love the change here to migrate away from per-segment-per-threadlocals!

When porting this to Solr, I noticed a new assertion that previously didn't exist. AssertingLeafReader now ensures that a Terms instance is only ever accessed by the thread that created it. Terms does not document wether it's thread-safe or not but I don't believe I've ever encountered one that wasn't thread-safe. Furthermore, the result of MultiTerms.getTerms is definitely thread-safe and can be useful to cache as there is some overhead. Thoughts on this?

rmuir · 2023-03-04T20:50:28Z

This was done on purpose as there's no such guarantee on Terms instances. Of particular concern are the ones returned back from TermVectors class.

You can see above in the iterations on this PR that we beefed up thread-confinement tests and documentation for both StoredFields and TermVectors: these data structures were always thread-confined before by ThreadLocal variables. Now it is on the user and (unfortunately) it opens up the possibility of mistakes.

We no longer "hand-hold" users with ThreadLocal variables, but at the same time, it allows for large memory reduction for applications that don't need them.

I don't think MultiTerms/Terms should be accessed by multiple threads either, there's no guarantee that here, that there isnt some unsafe publishing of variables updated or something like that: please don't cache them. If you insist on caching them, maybe consider using your own ThreadLocal for that purpose.

hydrogen666 · 2023-05-22T10:23:26Z

In previous version, StoredFieldsReader is cached in ThreadLocal, but now we need to clone StoredFieldsReader every time if we need to visit store fields. Will this PR cause any performance issue?

hydrogen666 · 2023-05-22T12:16:41Z

In previous version, StoredFieldsReader is cached in ThreadLocal, but now we need to clone StoredFieldsReader every time if we need to visit store fields. Will this PR cause any performance issue?

It is reasonable to clone StoredFieldsReader in one IndexSearcher context because one search request may hit many docs, but in some circumstances such as get by _id in Elasticsearch, cloning StoredFieldsReader every time may cause performance issue. Does my aforementioned concern make sense?

rmuir added 2 commits December 5, 2022 18:26

Factor out user API from o.a.l.codecs.StoredFieldsReader/TermVectorsR…

213bf04

…eader into o.a.l.index These apis "StoredFields" and "TermVectors" act like the other enums on fields/postings. The codec *Readers retain all the low-level details such as checkIntegrity/clone/etc

add new api, start fixing tests

e298870

jpountz reviewed Dec 6, 2022

View reviewed changes

delegate the deprecated APIs in FilterCodecReader.

12a5dfa

there's no longer a final default implementation in CodecReader (it throws UOE), but as a practical matter, it seems to make tests happy and isn't too invasive.

rmuir marked this pull request as ready for review December 7, 2022 13:04

jpountz approved these changes Dec 8, 2022

View reviewed changes

...ne/test-framework/src/java/org/apache/lucene/tests/index/BaseStoredFieldsFormatTestCase.java Outdated Show resolved Hide resolved

jpountz added 5 commits December 8, 2022 11:19

Fix mismatched test.

9e1c254

Add more checks on top of stored fields readers.

a8eb079

Fix more tests to call the proper stored fields API (mechanical change).

8e995aa

Fix more tests to call the proper term vectors API (mechanical change).

ed5b578

Fix more call sites in non-test sources.

0b239c5

rmuir commented Dec 8, 2022

View reviewed changes

lucene/queries/src/java/org/apache/lucene/queries/mlt/MoreLikeThis.java Outdated Show resolved Hide resolved

jpountz and others added 9 commits December 8, 2022 18:52

Thread safety.

d882b67

Fix code sample for vectors

3443f04

hoist ir.storedFields() out of loops (mostly core tests)

2b0bb3b

Makes tests look more natural, perform better, and give better coverage (e.g. exercise reuse of the same instance)

Merge branch 'main' into thread_local_annihilation

4e065bb

fix merging bugs

ce3a08c

Fix remaining javadocs refs to deprecated APIs in lucene/core

4e8d234

tidy

2e91ef2

hoist more instances out of loops

6c791e2

cut over remaining stragglers

5ccdeb6

rmuir added 2 commits December 9, 2022 13:24

implement new api on FieldFilterLeafReader

c2da550

AssertingLeafReader: thread-safety checks for StoredFields,TermVector…

f266a81

…s,Terms,Fields instances These should not be shared across threads, so add checks.

dweiss reviewed Dec 10, 2022

View reviewed changes

lucene/test-framework/src/java/org/apache/lucene/tests/index/AssertingLeafReader.java Show resolved Hide resolved

docs

270423c

rmuir added this to the 9.5.0 milestone Dec 11, 2022

jpountz approved these changes Dec 11, 2022

View reviewed changes

Add notes on thread safety.

6265691

rmuir mentioned this pull request Dec 12, 2022

Clear thread local values on UTF8TaxonomyWriterCache.close() #12013

Closed

rmuir merged commit 47f8c1b into apache:main Dec 13, 2022

asfgit pushed a commit that referenced this pull request Dec 13, 2022

Remove deprecated API in 10.x (#11998)

9eeab8c

asfgit pushed a commit that referenced this pull request Dec 13, 2022

Remove deprecated API usages in 9.x-only code (#11998)

4d7b0e0

magibney mentioned this pull request Jan 31, 2023

Decouple QTP idleTimeout from pool shrink rate jetty/jetty.project#9237

Closed

rmuir mentioned this pull request Feb 10, 2023

Separate index/store document APIs take 2? #12142

Open

javanna mentioned this pull request Feb 22, 2023

Move to new stored fields and term vector interfaces elastic/elasticsearch#94005

Closed

dsmiley mentioned this pull request Mar 4, 2023

SOLR 16642 : upgrade Solr to use Lucene 9.5.0 apache/solr#1360

Merged

steffenvan pushed a commit to steffenvan/jackrabbit-oak that referenced this pull request Aug 27, 2023

refactor: used StoredFields recommended in apache/lucene#11998

90b02d0

jpountz mentioned this pull request Jan 8, 2024

LUCENE-10519: Improvement for CloseableThreadLocal #816

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate away from per-segment-per-threadlocals on SegmentReader #11998

Migrate away from per-segment-per-threadlocals on SegmentReader #11998

rmuir commented Dec 6, 2022 •

edited

Loading

jpountz left a comment

rmuir commented Dec 6, 2022

rmuir commented Dec 6, 2022

jpountz commented Dec 8, 2022 •

edited

Loading

rmuir commented Dec 9, 2022

jpountz left a comment

rmuir commented Dec 11, 2022

dsmiley commented Mar 4, 2023

rmuir commented Mar 4, 2023

hydrogen666 commented May 22, 2023 •

edited

Loading

hydrogen666 commented May 22, 2023 •

edited

Loading

Migrate away from per-segment-per-threadlocals on SegmentReader #11998

Migrate away from per-segment-per-threadlocals on SegmentReader #11998

Conversation

rmuir commented Dec 6, 2022 • edited Loading

Background

New API

Deprecated API

jpountz left a comment

Choose a reason for hiding this comment

rmuir commented Dec 6, 2022

rmuir commented Dec 6, 2022

jpountz commented Dec 8, 2022 • edited Loading

rmuir commented Dec 9, 2022

jpountz left a comment

Choose a reason for hiding this comment

rmuir commented Dec 11, 2022

dsmiley commented Mar 4, 2023

rmuir commented Mar 4, 2023

hydrogen666 commented May 22, 2023 • edited Loading

hydrogen666 commented May 22, 2023 • edited Loading

rmuir commented Dec 6, 2022 •

edited

Loading

jpountz commented Dec 8, 2022 •

edited

Loading

hydrogen666 commented May 22, 2023 •

edited

Loading

hydrogen666 commented May 22, 2023 •

edited

Loading