New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-10333: Speed up BinaryDocValues with a batch reading on LongValues #557
base: main
Are you sure you want to change the base?
Conversation
personally I'm not a fan of the api change on LongValues and the additional complexity in DirectReader just to read blocks of 2 values at a time. I'd be more interested in a more general solution: e.g. a new docvaluesformat that compresses the values in blocks with FOR/PFOR, similar to the compression of the postings lists? It should be possible to do now that DocValues has forward-only next/advance api (again similar to Postings). |
Thanks a lot for so quick feedback! Using FOR/PFOR sounds promising, i'll give it a try. |
Might be easiest to prototype a new DocValuesFormat (I would fork the existing Lucene90 as a start)? That's how I would attack the problem. Then I'd try to remove (at least some) usages of stuff like It's a heavy duty investigation / big task, as there are two cases (dense and sparse). For dense case if we want docid 500, we know how to get to its value ( |
Thank you very much for your guidance! This is of great help to me. I will fork a new Actually, what I thought at first was to only change the structure of addresses, implementing a new Thanks again for so quick feedback and your patience! |
This seems like it is worth investigating! We just try to speed up the "addressing" without dealing with all the rest of docvalues (which may be much more complex change). |
FYI @gsmiller and I looked into using a FOR-like encoding for doc values at https://issues.apache.org/jira/browse/LUCENE-10033. In short, this made queries that look at most docs like the I like this change, it's a net win that is not only going to speed up binary doc values but also sorted set and sorted numeric doc values, which also use a direct monotonic reader to map doc IDs to value indexes? |
@jpountz Thanks a lot for the suggestion!
Actually, DirectMonotonicReader is not the only impl using this, line 176 in I tried to limit the change to |
block++; | ||
twin.second = readers[block].get(0) + mins[block]; | ||
} else { | ||
readers[block].get(blockIndex, twin); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line could call DirectReader#get(index, twin)
I guess I see it as optimizing corner cases, which is why improvements aren't reflected in the current luceneutil benchmarks? But as I mentioned, my biggest concern is adding stuff to |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
See https://issues.apache.org/jira/browse/LUCENE-10333
#11369
PS: I'm pushing these codes to quickly see if this approach makes sense to you, I'll add some more tests if this can pass the the first round review :)