LUCENE-10333: Speed up BinaryDocValues with a batch reading on LongValues #557

gf2121 · 2021-12-21T04:29:14Z

See https://issues.apache.org/jira/browse/LUCENE-10333
#11369

PS: I'm pushing these codes to quickly see if this approach makes sense to you, I'll add some more tests if this can pass the the first round review :)

rmuir · 2021-12-21T14:16:32Z

personally I'm not a fan of the api change on LongValues and the additional complexity in DirectReader just to read blocks of 2 values at a time.

I'd be more interested in a more general solution: e.g. a new docvaluesformat that compresses the values in blocks with FOR/PFOR, similar to the compression of the postings lists? It should be possible to do now that DocValues has forward-only next/advance api (again similar to Postings).

gf2121 · 2021-12-21T15:21:46Z

Thanks a lot for so quick feedback! Using FOR/PFOR sounds promising, i'll give it a try.

rmuir · 2021-12-21T16:00:35Z

Might be easiest to prototype a new DocValuesFormat (I would fork the existing Lucene90 as a start)? That's how I would attack the problem. Then I'd try to remove (at least some) usages of stuff like DirectReader and instead use iterators that decompress FOR/PFOR blocks, much like the postings lists?

It's a heavy duty investigation / big task, as there are two cases (dense and sparse). For dense case if we want docid 500, we know how to get to its value (500 % BLOCKSIZE). For the sparse case, some docs may be missing: perhaps we also need skipdata, again like the postings does.

gf2121 · 2021-12-21T16:56:53Z

Thank you very much for your guidance! This is of great help to me. I will fork a new DocvalueFormat for this experiment.

Actually, what I thought at first was to only change the structure of addresses, implementing a new LongValues to replace the DirectReader or DirectMonotonicReader to read addresses, e.g. a ForUtilLongValues. When users try to get long through an index, It will use ForUtil to decompress the required block (of course, caching the block there and if the next call in the same block we can reuse it). Then it seems that I don't need to care about the skipdata for sparse situations as IndexDISI will help me to get the index related to current docid. In this way, all i need to care is only the new implementation of LongValues. I wonder if this approach will make sense?

Thanks again for so quick feedback and your patience!

rmuir · 2021-12-21T18:47:03Z

This seems like it is worth investigating! We just try to speed up the "addressing" without dealing with all the rest of docvalues (which may be much more complex change).

jpountz · 2021-12-22T07:57:24Z

FYI @gsmiller and I looked into using a FOR-like encoding for doc values at https://issues.apache.org/jira/browse/LUCENE-10033. In short, this made queries that look at most docs like the Browse* faceting tasks much faster, but the queries that only look at a subset of the index like HighTermMonthSort much slower.

I like this change, it's a net win that is not only going to speed up binary doc values but also sorted set and sorted numeric doc values, which also use a direct monotonic reader to map doc IDs to value indexes? DirectMonotonicReader#binarySearch uses a similar idea to avoid doing the same operation over and over again. Let's update this pull request to limit the impact to the API to DirectMonotonicReader, since it's the only LongValues impl where we ever need to read two consecutive values at the same time?

gf2121 · 2021-12-22T08:24:43Z

@jpountz Thanks a lot for the suggestion!

Let's update this pull request to limit the impact to the API to DirectMonotonicReader, since it's the only LongValues impl where we ever need to read two consecutive values at the same time?

Actually, DirectMonotonicReader is not the only impl using this, line 176 in DirectMonotonicReader could call the DirectReader#get(index, twin) .

I tried to limit the change to DirectMonotonicReader, and the luceneUtil shows the QPS increasing (mentioned in jira) decreased from 20%+ to around 7%.

gf2121 · 2021-12-22T08:26:33Z

lucene/core/src/java/org/apache/lucene/util/packed/DirectMonotonicReader.java

+      block++;
+      twin.second = readers[block].get(0) + mins[block];
+    } else {
+      readers[block].get(blockIndex, twin);


This line could call DirectReader#get(index, twin)

rmuir · 2021-12-22T10:58:54Z

I like this change, it's a net win that is not only going to speed up binary doc values but also sorted set and sorted numeric doc values, which also use a direct monotonic reader to map doc IDs to value indexes? DirectMonotonicReader#binarySearch uses a similar idea to avoid doing the same operation over and over again. Let's update this pull request to limit the impact to the API to DirectMonotonicReader, since it's the only LongValues impl where we ever need to read two consecutive values at the same time?

I guess I see it as optimizing corner cases, which is why improvements aren't reflected in the current luceneutil benchmarks? But as I mentioned, my biggest concern is adding stuff to LongValues api for this, IMO it isn't justified. I also am not a fan of doubling the compression logic for every BPV. But I'm not personally going to block the change, just please help out on the LongValues api part, and I'd like to see some realistic use-cases where we have benchmarks that it speeds things up.

github-actions · 2024-01-09T00:15:17Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

gf2121 added 2 commits December 21, 2021 12:25

twin reading

108f8fb

spotless

cb34136

gf2121 commented Dec 22, 2021

View reviewed changes

github-actions bot added the Stale label Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-10333: Speed up BinaryDocValues with a batch reading on LongValues #557

LUCENE-10333: Speed up BinaryDocValues with a batch reading on LongValues #557

gf2121 commented Dec 21, 2021 •

edited by mocobeta

rmuir commented Dec 21, 2021

gf2121 commented Dec 21, 2021

rmuir commented Dec 21, 2021

gf2121 commented Dec 21, 2021 •

edited

rmuir commented Dec 21, 2021

jpountz commented Dec 22, 2021

gf2121 commented Dec 22, 2021

gf2121 Dec 22, 2021

rmuir commented Dec 22, 2021

github-actions bot commented Jan 9, 2024

LUCENE-10333: Speed up BinaryDocValues with a batch reading on LongValues #557

Are you sure you want to change the base?

LUCENE-10333: Speed up BinaryDocValues with a batch reading on LongValues #557

Conversation

gf2121 commented Dec 21, 2021 • edited by mocobeta

rmuir commented Dec 21, 2021

gf2121 commented Dec 21, 2021

rmuir commented Dec 21, 2021

gf2121 commented Dec 21, 2021 • edited

rmuir commented Dec 21, 2021

jpountz commented Dec 22, 2021

gf2121 commented Dec 22, 2021

gf2121 Dec 22, 2021

Choose a reason for hiding this comment

rmuir commented Dec 22, 2021

github-actions bot commented Jan 9, 2024

gf2121 commented Dec 21, 2021 •

edited by mocobeta

gf2121 commented Dec 21, 2021 •

edited