New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-10280: Store BKD blocks with continuous ids more efficiently #510
Conversation
lucene/core/src/java/org/apache/lucene/search/DocIdSetIterator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll wait before merging to give @iverase a chance to give his opinion on this change but it looks good to me.
out.writeVInt(docIds[start]); | ||
return; | ||
} else if (min2max <= (count << 4)) { | ||
// Only trigger bitset optimization when max - min + 1 <= 16 * count in order to avoid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe assert that min2max > count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad we didn't introduce another implementation of DocIdSetIterator and all there dance around it.
Hi @jpountz ! Just to remind, maybe we can merge this now? :) By the way, I found that there is a PR about using readLELongs in BKD apache/lucene-solr#1538. The discussion of this issue has stopped since last Year. This looks promising and I'd like to play with it but i wonder why it stopped, if there are some problems with this idea or if there has already been someone working on this ? |
I will merge soon if Adrien does not beat me up. I worked on the PR about using #readLELongs but never get a meaningful speed up that justify the added complexity. Maybe now that we have little endian codecs might make more sense. I am not planing to continue that work so please feel free to have a go. |
@iverase Thanks for your explanation!
I find that we were trying to use #readLELongs to speed up 24/32 bit situation in the My current thoughts are about using readLELongs to speed up the sorted ids situation (means low or medium cardinality fields), whose bottleneck is reading docIds. For sorted arrays, we can compute the delta of the sorted ids and encode/decode them like what we do in I raised an ISSUE based on this idea. The benchmark result i post in the issue looks promising. Would you like to help take a look when you have free time? Thanks! |
For scenes that index is sorted on the field, blocks with continuous ids may be a common situation. In this case we can handle this situation more efficiently. We just need to check
to check if ids is continuous. If so, we can only write the first id of this block.