Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10333: Speed up BinaryDocValues with a batch reading on LongValues #557

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

gf2121
Copy link
Contributor

@gf2121 gf2121 commented Dec 21, 2021

See https://issues.apache.org/jira/browse/LUCENE-10333
#11369

PS: I'm pushing these codes to quickly see if this approach makes sense to you, I'll add some more tests if this can pass the the first round review :)

@rmuir
Copy link
Member

rmuir commented Dec 21, 2021

personally I'm not a fan of the api change on LongValues and the additional complexity in DirectReader just to read blocks of 2 values at a time.

I'd be more interested in a more general solution: e.g. a new docvaluesformat that compresses the values in blocks with FOR/PFOR, similar to the compression of the postings lists? It should be possible to do now that DocValues has forward-only next/advance api (again similar to Postings).

@gf2121
Copy link
Contributor Author

gf2121 commented Dec 21, 2021

Thanks a lot for so quick feedback! Using FOR/PFOR sounds promising, i'll give it a try.

@rmuir
Copy link
Member

rmuir commented Dec 21, 2021

Might be easiest to prototype a new DocValuesFormat (I would fork the existing Lucene90 as a start)? That's how I would attack the problem. Then I'd try to remove (at least some) usages of stuff like DirectReader and instead use iterators that decompress FOR/PFOR blocks, much like the postings lists?

It's a heavy duty investigation / big task, as there are two cases (dense and sparse). For dense case if we want docid 500, we know how to get to its value (500 % BLOCKSIZE). For the sparse case, some docs may be missing: perhaps we also need skipdata, again like the postings does.

@gf2121
Copy link
Contributor Author

gf2121 commented Dec 21, 2021

Thank you very much for your guidance! This is of great help to me. I will fork a new DocvalueFormat for this experiment.

Actually, what I thought at first was to only change the structure of addresses, implementing a new LongValues to replace the DirectReader or DirectMonotonicReader to read addresses, e.g. a ForUtilLongValues. When users try to get long through an index, It will use ForUtil to decompress the required block (of course, caching the block there and if the next call in the same block we can reuse it). Then it seems that I don't need to care about the skipdata for sparse situations as IndexDISI will help me to get the index related to current docid. In this way, all i need to care is only the new implementation of LongValues. I wonder if this approach will make sense?

Thanks again for so quick feedback and your patience!

@rmuir
Copy link
Member

rmuir commented Dec 21, 2021

This seems like it is worth investigating! We just try to speed up the "addressing" without dealing with all the rest of docvalues (which may be much more complex change).

@jpountz
Copy link
Contributor

jpountz commented Dec 22, 2021

FYI @gsmiller and I looked into using a FOR-like encoding for doc values at https://issues.apache.org/jira/browse/LUCENE-10033. In short, this made queries that look at most docs like the Browse* faceting tasks much faster, but the queries that only look at a subset of the index like HighTermMonthSort much slower.

I like this change, it's a net win that is not only going to speed up binary doc values but also sorted set and sorted numeric doc values, which also use a direct monotonic reader to map doc IDs to value indexes? DirectMonotonicReader#binarySearch uses a similar idea to avoid doing the same operation over and over again. Let's update this pull request to limit the impact to the API to DirectMonotonicReader, since it's the only LongValues impl where we ever need to read two consecutive values at the same time?

@gf2121
Copy link
Contributor Author

gf2121 commented Dec 22, 2021

@jpountz Thanks a lot for the suggestion!

Let's update this pull request to limit the impact to the API to DirectMonotonicReader, since it's the only LongValues impl where we ever need to read two consecutive values at the same time?

Actually, DirectMonotonicReader is not the only impl using this, line 176 in DirectMonotonicReader could call the DirectReader#get(index, twin) .

I tried to limit the change to DirectMonotonicReader, and the luceneUtil shows the QPS increasing (mentioned in jira) decreased from 20%+ to around 7%.

block++;
twin.second = readers[block].get(0) + mins[block];
} else {
readers[block].get(blockIndex, twin);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line could call DirectReader#get(index, twin)

@rmuir
Copy link
Member

rmuir commented Dec 22, 2021

I like this change, it's a net win that is not only going to speed up binary doc values but also sorted set and sorted numeric doc values, which also use a direct monotonic reader to map doc IDs to value indexes? DirectMonotonicReader#binarySearch uses a similar idea to avoid doing the same operation over and over again. Let's update this pull request to limit the impact to the API to DirectMonotonicReader, since it's the only LongValues impl where we ever need to read two consecutive values at the same time?

I guess I see it as optimizing corner cases, which is why improvements aren't reflected in the current luceneutil benchmarks? But as I mentioned, my biggest concern is adding stuff to LongValues api for this, IMO it isn't justified. I also am not a fan of doubling the compression logic for every BPV. But I'm not personally going to block the change, just please help out on the LongValues api part, and I'd like to see some realistic use-cases where we have benchmarks that it speeds things up.

Copy link

github-actions bot commented Jan 9, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants