Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-7714: Add a range query that takes advantage of index sorting. #715

Closed
wants to merge 12 commits into from

Conversation

jtibshirani
Copy link
Member

@jtibshirani jtibshirani commented Jun 12, 2019

I’m opening this draft PR to get feedback on an approach to LUCENE-7714.

The PR adds the new query type IndexSortDocValuesRangeQuery, a range query that takes advantage of the fact that the index is sorted on the same field as the query. It performs binary search on the field's doc values to find the doc IDs at the lower and upper ends of the range.

The query can only be used if all of these conditions hold:

  • The index is sorted, and its primary sort is on the same field as the query.
  • The query field has SortedNumericDocValues.
  • The segments must have at most one field value per document (otherwise we cannot easily
    determine the matching document IDs through a binary search).

I was hoping for feedback on the overall approach, and also had a few open questions:

  • I wasn’t sure on the best way to structure the query. As it stands, it requires that segments are sorted correctly and contain at most one value per document. Perhaps we could introduce a wrapper query that can make the decision on a segment-by-segment basis: it would only use IndexSortDocValuesRangeQuery if the right conditions are met, and otherwise fall back to a standard range query. This wrapper query would have similarities to IndexOrDocValuesQuery.
  • Because doc values only support forward iteration, we need to recreate the comparators every time we backtrack in the binary search. I assumed this recreation would be expensive and experimented with some strategies to avoid it, such as starting with a shared binary search that checks both lowerValue and upperValue, then moving on to the two individual binary searches. However, these efforts didn’t show any performance improvements in my benchmarks. I plan to do more research around sparse + blocked doc values to understand the circumstances in which reloading can be more expensive.

Benchmarking results
I ingested part of the http logs dataset (123M total documents) into an index sorted by @timestamp. Every document has a single @timestamp, although they are not unique across documents.

The following date ranges were tested:

  • range with single point [897303051, 897303051], 124 docs
  • small range (897633930, 897655999], ~2M docs
  • medium range (897623930, 897655999], ~5M docs
  • large range (897259801, 897503930], ~21M docs
| 50th percentile service time |            range-single |   7.50658 |     ms |
| 90th percentile service time |            range-single |   7.81589 |     ms |
| 50th percentile service time |  range-optimized-single |   8.02532 |     ms |
| 90th percentile service time |  range-optimized-single |   8.38834 |     ms |

| 50th percentile service time |               range-small |   12.7203 |     ms |
| 90th percentile service time |               range-small |   13.3798 |     ms |
| 50th percentile service time |     range-optimized-small |   7.43383 |     ms |
| 90th percentile service time |     range-optimized-small |   7.67655 |     ms |

| 50th percentile service time |               range-medium |   20.8554 |     ms |
| 90th percentile service time |               range-medium |   22.1637 |     ms |
| 50th percentile service time |     range-optimized-medium |   8.38864 |     ms |
| 90th percentile service time |     range-optimized-medium |   8.66427 |     ms |

| 50th percentile service time |           range-large |   53.8697 |     ms |
| 90th percentile service time |           range-large |   60.2573 |     ms |
| 50th percentile service time | range-optimized-large |   7.81584 |     ms |
| 90th percentile service time | range-optimized-large |   8.02755 |     ms |

int mid = (low + high) >>> 1;
if (comparator.compare(mid) <= 0) {
high = mid - 1;
comparator = loadComparator(sortField, lowerValue, context);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be expensive since the worst case can be when the element does not exist, so there can be N comparisons. Can we do a worst case analysis of the number of reloads that we need, and how expensive they are in reality?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this, the worst case runs in logarithmic time so I don't understand the concern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was about the number of comparisons we do. In the worst case, the number of comparisons done would be log(n), and for each of them, we would reload the comparator. I am curious to understand the cost of loading the comparator for different scenarios i.e large n, dense docValues etc. I think we should run some tests where we take a large data set, force merge it to a single segment, sort it by a dense field and run a query which would require a large range.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atris the issue description includes benchmarks on a realistic dataset with dense docvalues. I share your concern that this reload may be expensive and plan to run more benchmarks with different docvalues types (including sparse docvalues).

}

SortedNumericDocValues sortedDocValues = reader.getSortedNumericDocValues(field);
NumericDocValues docValues = DocValues.unwrapSingleton(sortedDocValues);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious to understand why we cannot support multiple values here. Is it because of duplicate elimination in binary searches?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a document has multiple values, the index sort must choose one of them to decide where it should be placed. So a binary search may miss a document where the value we’ve sorted on is outside the range, but it also contains a value that falls within the range.

int high = maxDoc - 1;

while (low <= high) {
int mid = (low + high) >>> 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's ensure we take care of the known edge cases of binary ranges here. Maybe add some targeted tests?

* A doc ID set iterator that wraps a delegate iterator and only returns doc IDs in
* the range [firstDocInclusive, lastDoc).
*/
private static class BoundedDocSetIdIterator extends DocIdSetIterator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this class? Why not populate a DocSetIdIterator with just the relevant DocValues and return, much like what PointRangeQuery does?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we don't need to create a bitset beforehand. Unlike points docvalues iterator access the values in the order of the internal document id so can use the original iterator there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, so the idea is to have the same original iterator wrapped in multiple BoundedDocSetIdIterators, each with a different range.

int mid = (low + high) >>> 1;
if (comparator.compare(mid) < 0) {
high = mid - 1;
comparator = loadComparator(sortField, upperValue, context);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, can we ensure the cost of this iteration does not override the benefits?

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jtibshirani . The approach looks good to me. However I wonder if we should expose this query as is or if we should use it only internally in the IndexOrDocValuesQuery ? We can decide on a per-segment basis the best strategy to match and if the segment is sorted and is eligible for the IndexSortDocValuesRangeQuery we can use it automatically or fallback to the other strategies otherwise. It would be more difficult for users to build queries with an IndexSortDocValuesRangeQuery directly since there are cases (multi-values) where it doesn't work even if the index is sorted.

int mid = (low + high) >>> 1;
if (comparator.compare(mid) <= 0) {
high = mid - 1;
comparator = loadComparator(sortField, lowerValue, context);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this, the worst case runs in logarithmic time so I don't understand the concern.

* A doc ID set iterator that wraps a delegate iterator and only returns doc IDs in
* the range [firstDocInclusive, lastDoc).
*/
private static class BoundedDocSetIdIterator extends DocIdSetIterator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we don't need to create a bitset beforehand. Unlike points docvalues iterator access the values in the order of the internal document id so can use the original iterator there.

@jtibshirani
Copy link
Member Author

jtibshirani commented Jun 12, 2019

Thanks @atris and @jimczi for taking a look!

However I wonder if we should expose this query as is or if we should use it only internally in the IndexOrDocValuesQuery ?

I agree that the query is not so helpful on its own. I was unsure about integrating it into IndexOrDocValuesQuery, since that deals with queries generally and this query is specifically for long ranges. Would IndexOrDocValuesQuery optionally accept a third query, and only run it if the segment is sorted and also contains NumericDocValues? That seemed a bit specific to add to a fairly general query type. Another idea is for IndexSortDocValuesRangeQuery to accept a fallback range query, and delegate to it if the necessary conditions aren't met?

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the query is not so helpful on its own. I was unsure about integrating it into IndexOrDocValuesQuery, since that deals with queries generally and this query is specifically for long ranges

Any numeric ranges should work as long as the index is sorted on the field. Numeric docvalues use longs internally so everything is translated into a long when querying (for instance doubles use NumericUtils#sortableLongToDouble). I also agree that IndexOrDocValuesQuery is too generic to add the logic there but we could start by adding the optimization in SortedNumericDocValuesField#newSlowRangeQuery and NumericDocValuesField#newSlowRangeQuery. Then we can discuss the best way to ensure that IndexOrDocValuesQuery choose the docvalues one when the optimization can be applied. Does that makes sense ?

long topValue,
LeafReaderContext context) throws IOException {
@SuppressWarnings("unchecked")
FieldComparator<Long> fieldComparator = (FieldComparator<Long>) sortField.getComparator(1, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we only need to compare values from a single source I don't think we should use a FieldComparator here. Using the docvalues iterator directly should be possible so we don't need to add this extra indirection ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used a field comparator because it dealt with missing values (otherwise I would have to handle missing values explicitly and also make sure to use the right default missing value). This approach felt a bit more solid, but I don't have a strong preference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it thanks, we need to handle missing values too so +1 for this approach.

@jtibshirani
Copy link
Member Author

I also agree that IndexOrDocValuesQuery is too generic to add the logic there but we could start by adding the optimization in SortedNumericDocValuesField#newSlowRangeQuery and NumericDocValuesField#newSlowRangeQuery. Then we can discuss the best way to ensure that IndexOrDocValuesQuery choose the docvalues one when the optimization can be applied. Does that makes sense ?

Got it, I understand your recommendation now -- I will give that a try and push some new commits.

@jtibshirani
Copy link
Member Author

@jimczi I tried moving the optimization into SortedNumericDocValuesRangeQuery. I had a couple questions:

  • It is now possible that match and explain will be slower for SortedNumericDocValuesRangeQuery, since we may perform a binary search instead of just checking a single document value. Perhaps I should override these methods and ensure we always use the old approach?
  • I don't see a good way to integrate the optimization into IndexOrDocValuesQuery, any guidance would be very welcome.

@jimczi
Copy link
Contributor

jimczi commented Jun 14, 2019

We discussed offline with Julie and agreed that it doesn't really make sense to add this optimization in IndexOrDocValuesQuery. If range queries are fast with docvalues on a field that is used to sort the index then it will become useless to also index the field as a point. For multi-valued fields points will remain the best solution since the optimization on docvalues cannot work in this case but it should be enough to introduce the optim in SortedNumericDocValuesField#newSlowRangeQuery and NumericDocValuesField#newSlowRangeQuery or in its own query if we don't want to mix.

It is now possible that match and explain will be slower for SortedNumericDocValuesRangeQuery, since we may perform a binary search instead of just checking a single document value. Perhaps I should override these methods and ensure we always use the old approach?

Agreed, the old approach makes more sense for these cases that advance to a specific document directly.

@jtibshirani
Copy link
Member Author

jtibshirani commented Jun 14, 2019

Thanks @jimczi, I pushed another commit to ensure we use the two-phase iterator for explain and match.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good @jtibshirani . I also discussed with @jpountz offline and he made two comments that I think will help to move further:

  • First, we should move the query to the sandbox, it's still unclear if we're gonna keep the design as it is so moving it to the sandbox would help if we need to make big changes afterward (even removing it entirely).
  • Instead of adding this alternative in the NumericDocValuesField#slowRangeQuery we could add an alternative query in the SortedNumericDocValuesRangeQuery that would be used if the condition to run in the sorted case are not met (multi-valued field or a different default value for instance). This alternative query could be a NumericDocValuesField#slowRangeQuery or even an IndexOrDocValueQuery, the important part is that it would be executed only as a fallback.

What do you think ?

@jtibshirani
Copy link
Member Author

Instead of adding this alternative in the NumericDocValuesField#slowRangeQuery we could add an alternative query in the SortedNumericDocValuesRangeQuery that would be used if the condition to run in the sorted case are not met (multi-valued field or a different default value for instance).

This makes the most sense to me in terms of design, I had actually considered it originally but wasn't totally sure about it!

I pushed some new changes that move the logic to a new query IndexSortSortedNumericDocValuesRangeQuery in sandbox. I undid the changes to the existing SortedNumericDocValuesRangeQuery. For now, the new query only handles SortedNumericDocValues fields, since I think that is the most important use case and it was a bit tricky to add support for NumericDocValues as well.

*
* This optimized execution strategy is only used if the following conditions hold:
* - The index is sorted, and its primary sort is on the same field as the query.
* - The segments must have at most one field value per document (otherwise we cannot easily
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use ul/li tags to format this list

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

return this;
} else {
return new IndexSortSortedNumericDocValuesRangeQuery(
field, lowerValue, upperValue, fallbackQuery);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should use rewrittenFallback, otherwise rewriting would end up in infinite loops

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a test for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, thanks for catching this! Will add a test.

return new ConstantScoreWeight(this, boost) {
@Override
public Scorer scorer(LeafReaderContext context) throws IOException {
SortedNumericDocValues values = context.reader().getSortedNumericDocValues(field);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you replaced this with a call to DocValues#getSortedNumeric, then this would work for both sorted numeric and single-valued numeric doc values. (Note that it can't return null.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

int maxDoc = context.reader().maxDoc();
if (maxDoc <= 0) {
return DocIdSetIterator.empty();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem necessary as the below logic should work with a maxDoc of 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

int lastDocIdExclusive = high + 1;
return new BoundedDocSetIdIterator(firstDocIdInclusive, lastDocIdExclusive, delegate);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need the delegate, could we return a DocIdSetIterator#range?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I should have read comments. :) I understand now why we have it now.

return doc -> {
int value = leafFieldComparator.compareTop(doc);
return direction * value;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to do the comparison manually instead of using the comparator because of missing values. The comparator will consider documents that don't have a value and documents that have the missing value equal, which doesn't work for us.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please discard my comment, this makes sense to me now.

@jtibshirani
Copy link
Member Author

Thanks @jpountz for the review, it's now ready for another look.

@jimczi
Copy link
Contributor

jimczi commented Jun 26, 2019

I merged the pr manually, hence closing.

@jimczi jimczi closed this Jun 26, 2019
@jtibshirani jtibshirani deleted the index-sort-range-query branch June 26, 2019 15:53
@jtibshirani jtibshirani changed the title LUCENE-7714 Add a range query that takes advantage of index sorting. LUCENE-7714: Add a range query that takes advantage of index sorting. Aug 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants