Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate away from per-segment-per-threadlocals on SegmentReader #11998

Merged
merged 21 commits into from
Dec 13, 2022

Conversation

rmuir
Copy link
Member

@rmuir rmuir commented Dec 6, 2022

Background

Currently the stored fields and term vectors apis on the index are "stateless".
Unlike the other parts of the APIs, users can't call any iterators/enumerators, only:

indexReader.document(0);
indexReader.document(1);
// up to potentially thousands of docs because lusers do that

Instead of adding any real iterator, ThreadLocals were added on each segment behind-the-scenes to prevent from having to clone() the stored fields or term vectors reader on every document. For example, this caching could reduce the amount of NIOFSDirectory buffer refills and other associated overhead.

But the old API from a previous time, seems to only gets worse these days, because the implementations are more complicated and do block-compression, dictionaries, etc. In some cases, the cached resources can grow to extremely large amounts (see @luyuncheng writeup in #11987 as an example)

These per-segment threadlocals can cause memory issues if you have tons of segments, tons of threads, or especially both. Seems plenty of java developers can't help but run into it.

New API

I propose we deprecate these APIs:

  • IndexReader.document()
  • IndexReader.termVectors()
  • IndexSearcher.doc()

And replace the functionality with these APIs:

  • IndexReader.storedFields()
  • IndexReader.termVectors()
  • IndexSearcher.storedFields()

Instead of lucene internally caching a reader per-thread-per-segment, a user can get one themselves e.g. per-search:

TopDocs hits = searcher.search(query, 10);
StoredFields storedFields = searcher.storedFields();
for (ScoreDoc hit : hits.scoreDocs) {
  Document doc = storedFields.document(hit.doc);
}
// now the StoredFields instance can be gc'd normally

It will re-use the datastructures across a search if someone has thousands and thousands of hits, but avoid the ThreadLocal pain.

Deprecated API

The deprecated APIs still work the same way, and still use ThreadLocals. This way apps can safely migrate to the new APIs at their convenience. Once all the deprecations are fixed, then no ThreadLocals will be created any more, and the app can enjoy the RAM savings.

All code and tests have been moved to the new API in this PR, after backporting to 9.x, we can commit this patch to remove the deprecated apis / threadlocal support completely: nukeDeprecated.patch.txt

…eader into o.a.l.index

These apis "StoredFields" and "TermVectors" act like the other enums on fields/postings. The codec *Readers retain all the low-level details such as checkIntegrity/clone/etc
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the suggested direction.

For the record, another direction I've been contemplating consisted of passing a set of doc IDs to the IndexReader#document API, but I think it's a worse option because it doesn't really work for merging. I like this one better.

As far as fixing tests is concerned, would you be ok with fixing them in a suboptimal mechanical way, e.g. replacing every call to indexReader.document(doc) with indexreader.storedFields().document(doc), without taking care of reusing the same StoredFields instance across multiple docs?

@rmuir
Copy link
Member Author

rmuir commented Dec 6, 2022

That's fine. or we could fix newSearcher to not wrap with crazy CodecReader's. or we could fix said CodecReaders (since they are only used for tests) to implement the deprecated document apis, like SegmentReader does. Or we could give CodecReader a default impl that isn't very performant other than UOE.

I wanted to throw the UOE, at least at first, to be sure i knew exactly what was calling old .document API (e.g. in case i forgot to fix a filter-reader). But it doesn't have to stay.

there's no longer a final default implementation in CodecReader (it
throws UOE), but as a practical matter, it seems to make tests happy and
isn't too invasive.
@rmuir
Copy link
Member Author

rmuir commented Dec 6, 2022

I got the tests happy for now with 12a5dfa

Maybe not the right solution in the end, but makes it easier to iterate when you have passing tests at least.

@rmuir rmuir marked this pull request as ready for review December 7, 2022 13:04
@jpountz
Copy link
Contributor

jpountz commented Dec 8, 2022

I pushed a fix for most call sites I think. The main remaining ones are in lucene/highlighter, which require a bit more changes.

@rmuir
Copy link
Member Author

rmuir commented Dec 9, 2022

I cutover all the remaining stuff, by nuking the old api locally and making sure full gradle check passes.

attached is the patch i used to nuke the old APIs.
nukeDeprecated.patch.txt

It reveals two more things to fix:

  • Fix test-framework's FieldFilterLeafReader to implement new APIs
  • Look at test-framework's AssertingLeafReader to see if we can add safety checks to new APIs

…s,Terms,Fields instances

These should not be shared across threads, so add checks.
@rmuir rmuir added this to the 9.5.0 milestone Dec 11, 2022
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good. Are you good with me adding some notes on thread confinement to MIGRATE.txt and top-level javadocs of StoredFields and TermVectors?

@rmuir
Copy link
Member Author

rmuir commented Dec 11, 2022

+1 to improve docs

@rmuir rmuir merged commit 47f8c1b into apache:main Dec 13, 2022
asfgit pushed a commit that referenced this pull request Dec 13, 2022
Add new stored fields and termvectors interfaces: IndexReader.storedFields()
and IndexReader.termVectors(). Deprecate IndexReader.document() and IndexReader.getTermVector().
The new APIs do not rely upon ThreadLocal storage for each index segment, which can greatly
reduce RAM requirements when there are many threads and/or segments.

Co-authored-by: Adrien Grand <jpountz@gmail.com>
asfgit pushed a commit that referenced this pull request Dec 13, 2022
rmuir added a commit to mikemccand/luceneutil that referenced this pull request Dec 13, 2022
See apache/lucene#11998

Only the StoredFieldsBenchmark is impacted. Otherwise luceneutil
doesn't mess with stored fields/term vectors
@dsmiley
Copy link
Contributor

dsmiley commented Mar 4, 2023

Love the change here to migrate away from per-segment-per-threadlocals!

When porting this to Solr, I noticed a new assertion that previously didn't exist. AssertingLeafReader now ensures that a Terms instance is only ever accessed by the thread that created it. Terms does not document wether it's thread-safe or not but I don't believe I've ever encountered one that wasn't thread-safe. Furthermore, the result of MultiTerms.getTerms is definitely thread-safe and can be useful to cache as there is some overhead. Thoughts on this?

@rmuir
Copy link
Member Author

rmuir commented Mar 4, 2023

This was done on purpose as there's no such guarantee on Terms instances. Of particular concern are the ones returned back from TermVectors class.

You can see above in the iterations on this PR that we beefed up thread-confinement tests and documentation for both StoredFields and TermVectors: these data structures were always thread-confined before by ThreadLocal variables. Now it is on the user and (unfortunately) it opens up the possibility of mistakes.

We no longer "hand-hold" users with ThreadLocal variables, but at the same time, it allows for large memory reduction for applications that don't need them.

I don't think MultiTerms/Terms should be accessed by multiple threads either, there's no guarantee that here, that there isnt some unsafe publishing of variables updated or something like that: please don't cache them. If you insist on caching them, maybe consider using your own ThreadLocal for that purpose.

@hydrogen666
Copy link

hydrogen666 commented May 22, 2023

In previous version, StoredFieldsReader is cached in ThreadLocal, but now we need to clone StoredFieldsReader every time if we need to visit store fields. Will this PR cause any performance issue?

@hydrogen666
Copy link

hydrogen666 commented May 22, 2023

In previous version, StoredFieldsReader is cached in ThreadLocal, but now we need to clone StoredFieldsReader every time if we need to visit store fields. Will this PR cause any performance issue?

It is reasonable to clone StoredFieldsReader in one IndexSearcher context because one search request may hit many docs, but in some circumstances such as get by _id in Elasticsearch, cloning StoredFieldsReader every time may cause performance issue. Does my aforementioned concern make sense?

steffenvan pushed a commit to steffenvan/jackrabbit-oak that referenced this pull request Aug 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants