-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-7714: Add a range query that takes advantage of index sorting. #715
Conversation
int mid = (low + high) >>> 1; | ||
if (comparator.compare(mid) <= 0) { | ||
high = mid - 1; | ||
comparator = loadComparator(sortField, lowerValue, context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be expensive since the worst case can be when the element does not exist, so there can be N comparisons. Can we do a worst case analysis of the number of reloads that we need, and how expensive they are in reality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on this, the worst case runs in logarithmic time so I don't understand the concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point was about the number of comparisons we do. In the worst case, the number of comparisons done would be log(n), and for each of them, we would reload the comparator. I am curious to understand the cost of loading the comparator for different scenarios i.e large n, dense docValues etc. I think we should run some tests where we take a large data set, force merge it to a single segment, sort it by a dense field and run a query which would require a large range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atris the issue description includes benchmarks on a realistic dataset with dense docvalues. I share your concern that this reload may be expensive and plan to run more benchmarks with different docvalues types (including sparse docvalues).
} | ||
|
||
SortedNumericDocValues sortedDocValues = reader.getSortedNumericDocValues(field); | ||
NumericDocValues docValues = DocValues.unwrapSingleton(sortedDocValues); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious to understand why we cannot support multiple values here. Is it because of duplicate elimination in binary searches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a document has multiple values, the index sort must choose one of them to decide where it should be placed. So a binary search may miss a document where the value we’ve sorted on is outside the range, but it also contains a value that falls within the range.
int high = maxDoc - 1; | ||
|
||
while (low <= high) { | ||
int mid = (low + high) >>> 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's ensure we take care of the known edge cases of binary ranges here. Maybe add some targeted tests?
* A doc ID set iterator that wraps a delegate iterator and only returns doc IDs in | ||
* the range [firstDocInclusive, lastDoc). | ||
*/ | ||
private static class BoundedDocSetIdIterator extends DocIdSetIterator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this class? Why not populate a DocSetIdIterator with just the relevant DocValues and return, much like what PointRangeQuery does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we don't need to create a bitset beforehand. Unlike points docvalues iterator access the values in the order of the internal document id so can use the original iterator there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, so the idea is to have the same original iterator wrapped in multiple BoundedDocSetIdIterators, each with a different range.
int mid = (low + high) >>> 1; | ||
if (comparator.compare(mid) < 0) { | ||
high = mid - 1; | ||
comparator = loadComparator(sortField, upperValue, context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, can we ensure the cost of this iteration does not override the benefits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jtibshirani . The approach looks good to me. However I wonder if we should expose this query as is or if we should use it only internally in the IndexOrDocValuesQuery
? We can decide on a per-segment basis the best strategy to match and if the segment is sorted and is eligible for the IndexSortDocValuesRangeQuery we can use it automatically or fallback to the other strategies otherwise. It would be more difficult for users to build queries with an IndexSortDocValuesRangeQuery directly since there are cases (multi-values) where it doesn't work even if the index is sorted.
int mid = (low + high) >>> 1; | ||
if (comparator.compare(mid) <= 0) { | ||
high = mid - 1; | ||
comparator = loadComparator(sortField, lowerValue, context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on this, the worst case runs in logarithmic time so I don't understand the concern.
* A doc ID set iterator that wraps a delegate iterator and only returns doc IDs in | ||
* the range [firstDocInclusive, lastDoc). | ||
*/ | ||
private static class BoundedDocSetIdIterator extends DocIdSetIterator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we don't need to create a bitset beforehand. Unlike points docvalues iterator access the values in the order of the internal document id so can use the original iterator there.
Thanks @atris and @jimczi for taking a look!
I agree that the query is not so helpful on its own. I was unsure about integrating it into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the query is not so helpful on its own. I was unsure about integrating it into IndexOrDocValuesQuery, since that deals with queries generally and this query is specifically for long ranges
Any numeric ranges should work as long as the index is sorted on the field. Numeric docvalues use longs internally so everything is translated into a long when querying (for instance doubles use NumericUtils#sortableLongToDouble
). I also agree that IndexOrDocValuesQuery is too generic to add the logic there but we could start by adding the optimization in SortedNumericDocValuesField#newSlowRangeQuery
and NumericDocValuesField#newSlowRangeQuery
. Then we can discuss the best way to ensure that IndexOrDocValuesQuery choose the docvalues one when the optimization can be applied. Does that makes sense ?
long topValue, | ||
LeafReaderContext context) throws IOException { | ||
@SuppressWarnings("unchecked") | ||
FieldComparator<Long> fieldComparator = (FieldComparator<Long>) sortField.getComparator(1, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we only need to compare values from a single source I don't think we should use a FieldComparator here. Using the docvalues iterator directly should be possible so we don't need to add this extra indirection ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used a field comparator because it dealt with missing values (otherwise I would have to handle missing values explicitly and also make sure to use the right default missing value). This approach felt a bit more solid, but I don't have a strong preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it thanks, we need to handle missing values too so +1 for this approach.
Got it, I understand your recommendation now -- I will give that a try and push some new commits. |
8e7795d
to
735be11
Compare
735be11
to
1b2bf46
Compare
@jimczi I tried moving the optimization into
|
We discussed offline with Julie and agreed that it doesn't really make sense to add this optimization in
Agreed, the old approach makes more sense for these cases that advance to a specific document directly. |
Thanks @jimczi, I pushed another commit to ensure we use the two-phase iterator for explain and match. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good @jtibshirani . I also discussed with @jpountz offline and he made two comments that I think will help to move further:
- First, we should move the query to the sandbox, it's still unclear if we're gonna keep the design as it is so moving it to the sandbox would help if we need to make big changes afterward (even removing it entirely).
- Instead of adding this alternative in the NumericDocValuesField#slowRangeQuery we could add an alternative query in the
SortedNumericDocValuesRangeQuery
that would be used if the condition to run in the sorted case are not met (multi-valued field or a different default value for instance). This alternative query could be aNumericDocValuesField#slowRangeQuery
or even anIndexOrDocValueQuery
, the important part is that it would be executed only as a fallback.
What do you think ?
This makes the most sense to me in terms of design, I had actually considered it originally but wasn't totally sure about it! I pushed some new changes that move the logic to a new query |
* | ||
* This optimized execution strategy is only used if the following conditions hold: | ||
* - The index is sorted, and its primary sort is on the same field as the query. | ||
* - The segments must have at most one field value per document (otherwise we cannot easily |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should use ul/li tags to format this list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
return this; | ||
} else { | ||
return new IndexSortSortedNumericDocValuesRangeQuery( | ||
field, lowerValue, upperValue, fallbackQuery); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should use rewrittenFallback
, otherwise rewriting would end up in infinite loops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, thanks for catching this! Will add a test.
return new ConstantScoreWeight(this, boost) { | ||
@Override | ||
public Scorer scorer(LeafReaderContext context) throws IOException { | ||
SortedNumericDocValues values = context.reader().getSortedNumericDocValues(field); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you replaced this with a call to DocValues#getSortedNumeric
, then this would work for both sorted numeric and single-valued numeric doc values. (Note that it can't return null.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
int maxDoc = context.reader().maxDoc(); | ||
if (maxDoc <= 0) { | ||
return DocIdSetIterator.empty(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem necessary as the below logic should work with a maxDoc of 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
} | ||
|
||
int lastDocIdExclusive = high + 1; | ||
return new BoundedDocSetIdIterator(firstDocIdInclusive, lastDocIdExclusive, delegate); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need the delegate, could we return a DocIdSetIterator#range
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, I should have read comments. :) I understand now why we have it now.
return doc -> { | ||
int value = leafFieldComparator.compareTop(doc); | ||
return direction * value; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to do the comparison manually instead of using the comparator because of missing values. The comparator will consider documents that don't have a value and documents that have the missing value equal, which doesn't work for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please discard my comment, this makes sense to me now.
Thanks @jpountz for the review, it's now ready for another look. |
I merged the pr manually, hence closing. |
I’m opening this draft PR to get feedback on an approach to LUCENE-7714.
The PR adds the new query type
IndexSortDocValuesRangeQuery
, a range query that takes advantage of the fact that the index is sorted on the same field as the query. It performs binary search on the field's doc values to find the doc IDs at the lower and upper ends of the range.The query can only be used if all of these conditions hold:
SortedNumericDocValues
.determine the matching document IDs through a binary search).
I was hoping for feedback on the overall approach, and also had a few open questions:
IndexSortDocValuesRangeQuery
if the right conditions are met, and otherwise fall back to a standard range query. This wrapper query would have similarities toIndexOrDocValuesQuery
.lowerValue
andupperValue
, then moving on to the two individual binary searches. However, these efforts didn’t show any performance improvements in my benchmarks. I plan to do more research around sparse + blocked doc values to understand the circumstances in which reloading can be more expensive.Benchmarking results
I ingested part of the http logs dataset (123M total documents) into an index sorted by
@timestamp
. Every document has a single@timestamp
, although they are not unique across documents.The following date ranges were tested: