LUCENE-10280: Store BKD blocks with continuous ids more efficiently #510

gf2121 · 2021-12-03T07:31:28Z

For scenes that index is sorted on the field, blocks with continuous ids may be a common situation. In this case we can handle this situation more efficiently. We just need to check

stritylysorted && (docIds[start+count-1] - docids[start] + 1) == count

to check if ids is continuous. If so, we can only write the first id of this block.

lucene/core/src/java/org/apache/lucene/search/DocIdSetIterator.java

jpountz

I'll wait before merging to give @iverase a chance to give his opinion on this change but it looks good to me.

jpountz · 2021-12-03T14:19:42Z

lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java

+        out.writeVInt(docIds[start]);
+        return;
+      } else if (min2max <= (count << 4)) {
+        // Only trigger bitset optimization when max - min + 1 <= 16 * count in order to avoid


maybe assert that min2max > count?

iverase

Glad we didn't introduce another implementation of DocIdSetIterator and all there dance around it.

gf2121 · 2021-12-06T16:43:16Z

Hi @jpountz ! Just to remind, maybe we can merge this now? :)

By the way, I found that there is a PR about using readLELongs in BKD apache/lucene-solr#1538. The discussion of this issue has stopped since last Year. This looks promising and I'd like to play with it but i wonder why it stopped, if there are some problems with this idea or if there has already been someone working on this ?

iverase · 2021-12-06T18:22:20Z

I will merge soon if Adrien does not beat me up.

I worked on the PR about using #readLELongs but never get a meaningful speed up that justify the added complexity. Maybe now that we have little endian codecs might make more sense. I am not planing to continue that work so please feel free to have a go.

…510)

gf2121 · 2021-12-08T18:58:01Z

@iverase Thanks for your explanation!

I worked on the PR about using #readLELongs but never get a meaningful speed up that justify the added complexity.

I find that we were trying to use #readLELongs to speed up 24/32 bit situation in the DocIdsWriter, which means the ids in the block are unsorted, typically happening in high cardinarlity fields. I think queries on high cardinality fields spend most of their time on visitDocValues but not readDocIds, so maybe this is the reason that we can not see a obvious gain on E2E took?

My current thoughts are about using readLELongs to speed up the sorted ids situation (means low or medium cardinality fields), whose bottleneck is reading docIds. For sorted arrays, we can compute the delta of the sorted ids and encode/decode them like what we do in StoredFieldsInts.

I raised an ISSUE based on this idea. The benchmark result i post in the issue looks promising. Would you like to help take a look when you have free time? Thanks!

gf2121 added 8 commits December 3, 2021 15:18

init

d839923

init

1778a4c

fix

95489b3

Merge remote-tracking branch 'origin/main' into LUCENE-10280

aa47f19

and not

b8ff066

format

fd25e4d

CHANGES

5942530

iter

724099e

sonatype-lift bot reviewed Dec 3, 2021

View reviewed changes

lucene/core/src/java/org/apache/lucene/search/DocIdSetIterator.java Outdated Show resolved Hide resolved

gf2121 added 3 commits December 3, 2021 20:18

fix

1f7f317

format

5756acd

fix

3610091

jpountz approved these changes Dec 3, 2021

View reviewed changes

assert

10e1478

iverase approved these changes Dec 3, 2021

View reviewed changes

iverase merged commit 8525356 into apache:main Dec 7, 2021

iverase pushed a commit that referenced this pull request Dec 7, 2021

LUCENE-10280: Store BKD blocks with continuous ids more efficiently (#…

892e324

…510)

easyice mentioned this pull request Jul 6, 2023

Optimize DocIdsWriter for BKD in reverse case with index sorting #12420

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-10280: Store BKD blocks with continuous ids more efficiently #510

LUCENE-10280: Store BKD blocks with continuous ids more efficiently #510

gf2121 commented Dec 3, 2021

jpountz left a comment

jpountz Dec 3, 2021

gf2121 Dec 6, 2021 •

edited

iverase left a comment

gf2121 commented Dec 6, 2021 •

edited

iverase commented Dec 6, 2021

gf2121 commented Dec 8, 2021 •

edited

LUCENE-10280: Store BKD blocks with continuous ids more efficiently #510

LUCENE-10280: Store BKD blocks with continuous ids more efficiently #510

Conversation

gf2121 commented Dec 3, 2021

jpountz left a comment

Choose a reason for hiding this comment

jpountz Dec 3, 2021

Choose a reason for hiding this comment

gf2121 Dec 6, 2021 • edited

Choose a reason for hiding this comment

iverase left a comment

Choose a reason for hiding this comment

gf2121 commented Dec 6, 2021 • edited

iverase commented Dec 6, 2021

gf2121 commented Dec 8, 2021 • edited

gf2121 Dec 6, 2021 •

edited

gf2121 commented Dec 6, 2021 •

edited

gf2121 commented Dec 8, 2021 •

edited