LUCENE-7714: Add a range query that takes advantage of index sorting. #715

jtibshirani · 2019-06-12T00:55:41Z

I’m opening this draft PR to get feedback on an approach to LUCENE-7714.

The PR adds the new query type IndexSortDocValuesRangeQuery, a range query that takes advantage of the fact that the index is sorted on the same field as the query. It performs binary search on the field's doc values to find the doc IDs at the lower and upper ends of the range.

The query can only be used if all of these conditions hold:

The index is sorted, and its primary sort is on the same field as the query.
The query field has SortedNumericDocValues.
The segments must have at most one field value per document (otherwise we cannot easily
determine the matching document IDs through a binary search).

I was hoping for feedback on the overall approach, and also had a few open questions:

I wasn’t sure on the best way to structure the query. As it stands, it requires that segments are sorted correctly and contain at most one value per document. Perhaps we could introduce a wrapper query that can make the decision on a segment-by-segment basis: it would only use IndexSortDocValuesRangeQuery if the right conditions are met, and otherwise fall back to a standard range query. This wrapper query would have similarities to IndexOrDocValuesQuery.
Because doc values only support forward iteration, we need to recreate the comparators every time we backtrack in the binary search. I assumed this recreation would be expensive and experimented with some strategies to avoid it, such as starting with a shared binary search that checks both lowerValue and upperValue, then moving on to the two individual binary searches. However, these efforts didn’t show any performance improvements in my benchmarks. I plan to do more research around sparse + blocked doc values to understand the circumstances in which reloading can be more expensive.

Benchmarking results
I ingested part of the http logs dataset (123M total documents) into an index sorted by @timestamp. Every document has a single @timestamp, although they are not unique across documents.

The following date ranges were tested:

range with single point [897303051, 897303051], 124 docs
small range (897633930, 897655999], ~2M docs
medium range (897623930, 897655999], ~5M docs
large range (897259801, 897503930], ~21M docs

| 50th percentile service time |            range-single |   7.50658 |     ms |
| 90th percentile service time |            range-single |   7.81589 |     ms |
| 50th percentile service time |  range-optimized-single |   8.02532 |     ms |
| 90th percentile service time |  range-optimized-single |   8.38834 |     ms |

| 50th percentile service time |               range-small |   12.7203 |     ms |
| 90th percentile service time |               range-small |   13.3798 |     ms |
| 50th percentile service time |     range-optimized-small |   7.43383 |     ms |
| 90th percentile service time |     range-optimized-small |   7.67655 |     ms |

| 50th percentile service time |               range-medium |   20.8554 |     ms |
| 90th percentile service time |               range-medium |   22.1637 |     ms |
| 50th percentile service time |     range-optimized-medium |   8.38864 |     ms |
| 90th percentile service time |     range-optimized-medium |   8.66427 |     ms |

| 50th percentile service time |           range-large |   53.8697 |     ms |
| 90th percentile service time |           range-large |   60.2573 |     ms |
| 50th percentile service time | range-optimized-large |   7.81584 |     ms |
| 90th percentile service time | range-optimized-large |   8.02755 |     ms |

atris · 2019-06-12T04:35:28Z

lucene/core/src/java/org/apache/lucene/document/IndexSortDocValuesRangeQuery.java

+      int mid = (low + high) >>> 1;
+      if (comparator.compare(mid) <= 0) {
+        high = mid - 1;
+        comparator = loadComparator(sortField, lowerValue, context);


This can be expensive since the worst case can be when the element does not exist, so there can be N comparisons. Can we do a worst case analysis of the number of reloads that we need, and how expensive they are in reality?

Can you elaborate on this, the worst case runs in logarithmic time so I don't understand the concern.

My point was about the number of comparisons we do. In the worst case, the number of comparisons done would be log(n), and for each of them, we would reload the comparator. I am curious to understand the cost of loading the comparator for different scenarios i.e large n, dense docValues etc. I think we should run some tests where we take a large data set, force merge it to a single segment, sort it by a dense field and run a query which would require a large range.

@atris the issue description includes benchmarks on a realistic dataset with dense docvalues. I share your concern that this reload may be expensive and plan to run more benchmarks with different docvalues types (including sparse docvalues).

atris · 2019-06-12T06:20:12Z

lucene/core/src/java/org/apache/lucene/document/IndexSortDocValuesRangeQuery.java

+        }
+
+        SortedNumericDocValues sortedDocValues = reader.getSortedNumericDocValues(field);
+        NumericDocValues docValues = DocValues.unwrapSingleton(sortedDocValues);


I am curious to understand why we cannot support multiple values here. Is it because of duplicate elimination in binary searches?

When a document has multiple values, the index sort must choose one of them to decide where it should be placed. So a binary search may miss a document where the value we’ve sorted on is outside the range, but it also contains a value that falls within the range.

atris · 2019-06-12T06:23:05Z

lucene/core/src/java/org/apache/lucene/document/IndexSortDocValuesRangeQuery.java

+    int high = maxDoc - 1;
+
+    while (low <= high) {
+      int mid = (low + high) >>> 1;


Let's ensure we take care of the known edge cases of binary ranges here. Maybe add some targeted tests?

atris · 2019-06-12T06:39:11Z

lucene/core/src/java/org/apache/lucene/document/IndexSortDocValuesRangeQuery.java

+   * A doc ID set iterator that wraps a delegate iterator and only returns doc IDs in
+   * the range [firstDocInclusive, lastDoc).
+   */
+  private static class BoundedDocSetIdIterator extends DocIdSetIterator {


Do we need this class? Why not populate a DocSetIdIterator with just the relevant DocValues and return, much like what PointRangeQuery does?

Because we don't need to create a bitset beforehand. Unlike points docvalues iterator access the values in the order of the internal document id so can use the original iterator there.

Got it, so the idea is to have the same original iterator wrapped in multiple BoundedDocSetIdIterators, each with a different range.

atris · 2019-06-12T06:41:37Z

lucene/core/src/java/org/apache/lucene/document/IndexSortDocValuesRangeQuery.java

+      int mid = (low + high) >>> 1;
+      if (comparator.compare(mid) < 0) {
+        high = mid - 1;
+        comparator = loadComparator(sortField, upperValue, context);


As mentioned above, can we ensure the cost of this iteration does not override the benefits?

jimczi

Thanks @jtibshirani . The approach looks good to me. However I wonder if we should expose this query as is or if we should use it only internally in the IndexOrDocValuesQuery ? We can decide on a per-segment basis the best strategy to match and if the segment is sorted and is eligible for the IndexSortDocValuesRangeQuery we can use it automatically or fallback to the other strategies otherwise. It would be more difficult for users to build queries with an IndexSortDocValuesRangeQuery directly since there are cases (multi-values) where it doesn't work even if the index is sorted.

jimczi · 2019-06-12T14:02:01Z

lucene/core/src/java/org/apache/lucene/document/IndexSortDocValuesRangeQuery.java

+      int mid = (low + high) >>> 1;
+      if (comparator.compare(mid) <= 0) {
+        high = mid - 1;
+        comparator = loadComparator(sortField, lowerValue, context);


Can you elaborate on this, the worst case runs in logarithmic time so I don't understand the concern.

jimczi · 2019-06-12T14:04:42Z

lucene/core/src/java/org/apache/lucene/document/IndexSortDocValuesRangeQuery.java

+   * A doc ID set iterator that wraps a delegate iterator and only returns doc IDs in
+   * the range [firstDocInclusive, lastDoc).
+   */
+  private static class BoundedDocSetIdIterator extends DocIdSetIterator {


Because we don't need to create a bitset beforehand. Unlike points docvalues iterator access the values in the order of the internal document id so can use the original iterator there.

jtibshirani · 2019-06-12T20:09:54Z

Thanks @atris and @jimczi for taking a look!

However I wonder if we should expose this query as is or if we should use it only internally in the IndexOrDocValuesQuery ?

I agree that the query is not so helpful on its own. I was unsure about integrating it into IndexOrDocValuesQuery, since that deals with queries generally and this query is specifically for long ranges. Would IndexOrDocValuesQuery optionally accept a third query, and only run it if the segment is sorted and also contains NumericDocValues? That seemed a bit specific to add to a fairly general query type. Another idea is for IndexSortDocValuesRangeQuery to accept a fallback range query, and delegate to it if the necessary conditions aren't met?

jimczi

I agree that the query is not so helpful on its own. I was unsure about integrating it into IndexOrDocValuesQuery, since that deals with queries generally and this query is specifically for long ranges

Any numeric ranges should work as long as the index is sorted on the field. Numeric docvalues use longs internally so everything is translated into a long when querying (for instance doubles use NumericUtils#sortableLongToDouble). I also agree that IndexOrDocValuesQuery is too generic to add the logic there but we could start by adding the optimization in SortedNumericDocValuesField#newSlowRangeQuery and NumericDocValuesField#newSlowRangeQuery. Then we can discuss the best way to ensure that IndexOrDocValuesQuery choose the docvalues one when the optimization can be applied. Does that makes sense ?

jimczi · 2019-06-13T08:36:31Z

lucene/core/src/java/org/apache/lucene/document/IndexSortDocValuesRangeQuery.java

+                                                long topValue,
+                                                LeafReaderContext context) throws IOException {
+    @SuppressWarnings("unchecked")
+    FieldComparator<Long> fieldComparator = (FieldComparator<Long>) sortField.getComparator(1, 0);


Since we only need to compare values from a single source I don't think we should use a FieldComparator here. Using the docvalues iterator directly should be possible so we don't need to add this extra indirection ?

I used a field comparator because it dealt with missing values (otherwise I would have to handle missing values explicitly and also make sure to use the right default missing value). This approach felt a bit more solid, but I don't have a strong preference.

Got it thanks, we need to handle missing values too so +1 for this approach.

jtibshirani · 2019-06-13T15:39:41Z

I also agree that IndexOrDocValuesQuery is too generic to add the logic there but we could start by adding the optimization in SortedNumericDocValuesField#newSlowRangeQuery and NumericDocValuesField#newSlowRangeQuery. Then we can discuss the best way to ensure that IndexOrDocValuesQuery choose the docvalues one when the optimization can be applied. Does that makes sense ?

Got it, I understand your recommendation now -- I will give that a try and push some new commits.

jtibshirani · 2019-06-13T22:15:27Z

@jimczi I tried moving the optimization into SortedNumericDocValuesRangeQuery. I had a couple questions:

It is now possible that match and explain will be slower for SortedNumericDocValuesRangeQuery, since we may perform a binary search instead of just checking a single document value. Perhaps I should override these methods and ensure we always use the old approach?
I don't see a good way to integrate the optimization into IndexOrDocValuesQuery, any guidance would be very welcome.

jimczi · 2019-06-14T08:08:01Z

We discussed offline with Julie and agreed that it doesn't really make sense to add this optimization in IndexOrDocValuesQuery. If range queries are fast with docvalues on a field that is used to sort the index then it will become useless to also index the field as a point. For multi-valued fields points will remain the best solution since the optimization on docvalues cannot work in this case but it should be enough to introduce the optim in SortedNumericDocValuesField#newSlowRangeQuery and NumericDocValuesField#newSlowRangeQuery or in its own query if we don't want to mix.

It is now possible that match and explain will be slower for SortedNumericDocValuesRangeQuery, since we may perform a binary search instead of just checking a single document value. Perhaps I should override these methods and ensure we always use the old approach?

Agreed, the old approach makes more sense for these cases that advance to a specific document directly.

jtibshirani · 2019-06-14T16:13:29Z

Thanks @jimczi, I pushed another commit to ensure we use the two-phase iterator for explain and match.

jimczi

The change looks good @jtibshirani . I also discussed with @jpountz offline and he made two comments that I think will help to move further:

First, we should move the query to the sandbox, it's still unclear if we're gonna keep the design as it is so moving it to the sandbox would help if we need to make big changes afterward (even removing it entirely).
Instead of adding this alternative in the NumericDocValuesField#slowRangeQuery we could add an alternative query in the SortedNumericDocValuesRangeQuery that would be used if the condition to run in the sorted case are not met (multi-valued field or a different default value for instance). This alternative query could be a NumericDocValuesField#slowRangeQuery or even an IndexOrDocValueQuery, the important part is that it would be executed only as a fallback.

What do you think ?

…uery.

jtibshirani · 2019-06-17T18:33:54Z

Instead of adding this alternative in the NumericDocValuesField#slowRangeQuery we could add an alternative query in the SortedNumericDocValuesRangeQuery that would be used if the condition to run in the sorted case are not met (multi-valued field or a different default value for instance).

This makes the most sense to me in terms of design, I had actually considered it originally but wasn't totally sure about it!

I pushed some new changes that move the logic to a new query IndexSortSortedNumericDocValuesRangeQuery in sandbox. I undid the changes to the existing SortedNumericDocValuesRangeQuery. For now, the new query only handles SortedNumericDocValues fields, since I think that is the most important use case and it was a bit tricky to add support for NumericDocValues as well.

jpountz · 2019-06-18T06:24:34Z

lucene/sandbox/src/java/org/apache/lucene/search/IndexSortSortedNumericDocValuesRangeQuery.java

+ * 
+ * This optimized execution strategy is only used if the following conditions hold:
+ * - The index is sorted, and its primary sort is on the same field as the query.
+ * - The segments must have at most one field value per document (otherwise we cannot easily


we should use ul/li tags to format this list

jpountz · 2019-06-18T06:27:49Z

lucene/sandbox/src/java/org/apache/lucene/search/IndexSortSortedNumericDocValuesRangeQuery.java

+      return this;
+    } else {
+      return new IndexSortSortedNumericDocValuesRangeQuery(
+          field, lowerValue, upperValue, fallbackQuery);


this should use rewrittenFallback, otherwise rewriting would end up in infinite loops

Should we add a test for this?

Oops, thanks for catching this! Will add a test.

jpountz · 2019-06-18T06:29:40Z

lucene/sandbox/src/java/org/apache/lucene/search/IndexSortSortedNumericDocValuesRangeQuery.java

+    return new ConstantScoreWeight(this, boost) {
+      @Override
+      public Scorer scorer(LeafReaderContext context) throws IOException {
+        SortedNumericDocValues values = context.reader().getSortedNumericDocValues(field);


if you replaced this with a call to DocValues#getSortedNumeric, then this would work for both sorted numeric and single-valued numeric doc values. (Note that it can't return null.)

jpountz · 2019-06-18T06:50:35Z

lucene/sandbox/src/java/org/apache/lucene/search/IndexSortSortedNumericDocValuesRangeQuery.java

+    int maxDoc = context.reader().maxDoc();
+    if (maxDoc <= 0) {
+      return DocIdSetIterator.empty();
+    }


This doesn't seem necessary as the below logic should work with a maxDoc of 0?

jpountz · 2019-06-18T06:56:52Z

lucene/sandbox/src/java/org/apache/lucene/search/IndexSortSortedNumericDocValuesRangeQuery.java

+    }
+
+    int lastDocIdExclusive = high + 1;
+    return new BoundedDocSetIdIterator(firstDocIdInclusive, lastDocIdExclusive, delegate);


why do we need the delegate, could we return a DocIdSetIterator#range?

Apologies, I should have read comments. :) I understand now why we have it now.

jpountz · 2019-06-18T07:02:52Z

lucene/sandbox/src/java/org/apache/lucene/search/IndexSortSortedNumericDocValuesRangeQuery.java

+    return doc -> {
+      int value = leafFieldComparator.compareTop(doc);
+      return direction * value;
+    };


I think we need to do the comparison manually instead of using the comparator because of missing values. The comparator will consider documents that don't have a value and documents that have the missing value equal, which doesn't work for us.

Please discard my comment, this makes sense to me now.

jtibshirani · 2019-06-18T16:31:51Z

Thanks @jpountz for the review, it's now ready for another look.

jimczi · 2019-06-26T07:45:34Z

I merged the pr manually, hence closing.

LUCENE-7714 Add a range query that takes advantage of index sorting.

d88dfe4

atris reviewed Jun 12, 2019

View reviewed changes

jimczi reviewed Jun 12, 2019

View reviewed changes

jimczi reviewed Jun 13, 2019

View reviewed changes

jtibshirani force-pushed the index-sort-range-query branch from 8e7795d to 735be11 Compare June 13, 2019 22:06

jtibshirani added 2 commits June 13, 2019 15:08

Move the optimization into SortedNumericDocValuesRangeQuery.

228d19a

Improve the javadoc around missing values.

1b2bf46

jtibshirani force-pushed the index-sort-range-query branch from 735be11 to 1b2bf46 Compare June 13, 2019 22:08

Use the two phase iterator for explain and match.

3a72eff

jimczi reviewed Jun 17, 2019

View reviewed changes

jtibshirani added 2 commits June 17, 2019 11:26

Move the logic into a new query IndexSortSortedNumericDocValuesRangeQ…

8141019

…uery.

Undo all changes to SortedNumericDocValuesRangeQuery.

1638530

jtibshirani added 3 commits June 17, 2019 11:35

Make sure to visit the fallback query as well.

7b0dbea

Remove a 'Repeat' annotation that was accidentally committed.

07ba929

Clarify the name testSameHitsAsPointRangeQuery.

9503c17

jpountz requested changes Jun 18, 2019

View reviewed changes

jtibshirani added 3 commits June 18, 2019 08:44

Improvements from code review feedback.

3ac0dec

Make sure to rewrite the fallback query.

e4c9543

Remove unused imports and fix javadoc tag.

297f99a

jpountz approved these changes Jun 18, 2019

View reviewed changes

jimczi approved these changes Jun 26, 2019

View reviewed changes

jimczi closed this Jun 26, 2019

jtibshirani deleted the index-sort-range-query branch June 26, 2019 15:53

jtibshirani changed the title ~~LUCENE-7714 Add a range query that takes advantage of index sorting.~~ LUCENE-7714: Add a range query that takes advantage of index sorting. Aug 14, 2019

LUCENE-7714: Add a range query that takes advantage of index sorting. #715

LUCENE-7714: Add a range query that takes advantage of index sorting. #715

Conversation

jtibshirani commented Jun 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani commented Jun 12, 2019 • edited Loading

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani commented Jun 13, 2019

jtibshirani commented Jun 13, 2019

jimczi commented Jun 14, 2019

jtibshirani commented Jun 14, 2019 • edited Loading

jimczi left a comment

Choose a reason for hiding this comment

jtibshirani commented Jun 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani commented Jun 18, 2019

jimczi commented Jun 26, 2019

jtibshirani commented Jun 12, 2019 •

edited

Loading

jtibshirani commented Jun 12, 2019 •

edited

Loading

jtibshirani commented Jun 14, 2019 •

edited

Loading