-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate away from per-segment-per-threadlocals on SegmentReader #11998
Conversation
…eader into o.a.l.index These apis "StoredFields" and "TermVectors" act like the other enums on fields/postings. The codec *Readers retain all the low-level details such as checkIntegrity/clone/etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the suggested direction.
For the record, another direction I've been contemplating consisted of passing a set of doc IDs to the IndexReader#document
API, but I think it's a worse option because it doesn't really work for merging. I like this one better.
As far as fixing tests is concerned, would you be ok with fixing them in a suboptimal mechanical way, e.g. replacing every call to indexReader.document(doc)
with indexreader.storedFields().document(doc)
, without taking care of reusing the same StoredFields
instance across multiple docs?
That's fine. or we could fix I wanted to throw the UOE, at least at first, to be sure i knew exactly what was calling old .document API (e.g. in case i forgot to fix a filter-reader). But it doesn't have to stay. |
there's no longer a final default implementation in CodecReader (it throws UOE), but as a practical matter, it seems to make tests happy and isn't too invasive.
I got the tests happy for now with 12a5dfa Maybe not the right solution in the end, but makes it easier to iterate when you have passing tests at least. |
...ne/test-framework/src/java/org/apache/lucene/tests/index/BaseStoredFieldsFormatTestCase.java
Outdated
Show resolved
Hide resolved
I pushed a fix for most call sites I think. The main remaining ones are in lucene/highlighter, which require a bit more changes. |
lucene/queries/src/java/org/apache/lucene/queries/mlt/MoreLikeThis.java
Outdated
Show resolved
Hide resolved
Makes tests look more natural, perform better, and give better coverage (e.g. exercise reuse of the same instance)
I cutover all the remaining stuff, by nuking the old api locally and making sure full gradle check passes. attached is the patch i used to nuke the old APIs. It reveals two more things to fix:
|
…s,Terms,Fields instances These should not be shared across threads, so add checks.
lucene/test-framework/src/java/org/apache/lucene/tests/index/AssertingLeafReader.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good. Are you good with me adding some notes on thread confinement to MIGRATE.txt and top-level javadocs of StoredFields
and TermVectors
?
+1 to improve docs |
Add new stored fields and termvectors interfaces: IndexReader.storedFields() and IndexReader.termVectors(). Deprecate IndexReader.document() and IndexReader.getTermVector(). The new APIs do not rely upon ThreadLocal storage for each index segment, which can greatly reduce RAM requirements when there are many threads and/or segments. Co-authored-by: Adrien Grand <jpountz@gmail.com>
See apache/lucene#11998 Only the StoredFieldsBenchmark is impacted. Otherwise luceneutil doesn't mess with stored fields/term vectors
Love the change here to migrate away from per-segment-per-threadlocals! When porting this to Solr, I noticed a new assertion that previously didn't exist. AssertingLeafReader now ensures that a Terms instance is only ever accessed by the thread that created it. Terms does not document wether it's thread-safe or not but I don't believe I've ever encountered one that wasn't thread-safe. Furthermore, the result of MultiTerms.getTerms is definitely thread-safe and can be useful to cache as there is some overhead. Thoughts on this? |
This was done on purpose as there's no such guarantee on You can see above in the iterations on this PR that we beefed up thread-confinement tests and documentation for both We no longer "hand-hold" users with I don't think MultiTerms/Terms should be accessed by multiple threads either, there's no guarantee that here, that there isnt some unsafe publishing of variables updated or something like that: please don't cache them. If you insist on caching them, maybe consider using your own ThreadLocal for that purpose. |
In previous version, |
It is reasonable to clone |
Background
Currently the stored fields and term vectors apis on the index are "stateless".
Unlike the other parts of the APIs, users can't call any iterators/enumerators, only:
Instead of adding any real iterator,
ThreadLocal
s were added on each segment behind-the-scenes to prevent from having toclone()
the stored fields or term vectors reader on every document. For example, this caching could reduce the amount ofNIOFSDirectory
buffer refills and other associated overhead.But the old API from a previous time, seems to only gets worse these days, because the implementations are more complicated and do block-compression, dictionaries, etc. In some cases, the cached resources can grow to extremely large amounts (see @luyuncheng writeup in #11987 as an example)
These per-segment threadlocals can cause memory issues if you have tons of segments, tons of threads, or especially both. Seems plenty of java developers can't help but run into it.
New API
I propose we deprecate these APIs:
IndexReader.document()
IndexReader.termVectors()
IndexSearcher.doc()
And replace the functionality with these APIs:
IndexReader.storedFields()
IndexReader.termVectors()
IndexSearcher.storedFields()
Instead of lucene internally caching a reader per-thread-per-segment, a user can get one themselves e.g. per-search:
It will re-use the datastructures across a search if someone has thousands and thousands of hits, but avoid the ThreadLocal pain.
Deprecated API
The deprecated APIs still work the same way, and still use ThreadLocals. This way apps can safely migrate to the new APIs at their convenience. Once all the deprecations are fixed, then no ThreadLocals will be created any more, and the app can enjoy the RAM savings.
All code and tests have been moved to the new API in this PR, after backporting to 9.x, we can commit this patch to remove the deprecated apis / threadlocal support completely: nukeDeprecated.patch.txt