LUCENE-9449 Skip docs with _doc sort and "after" #1725

mayya-sharipova · 2020-08-07T15:49:24Z

Enhance DocComparator to provide an iterator over competitive documents
when searching with "after" FieldDoc.
This iterator can quickly position on the desired "after" document,
and skip all documents before "after" or even whole segments
that contain only documents before "after".

Related to LUCENE-9280

Enhance DocComparator to provide an iterator over competitive documents when searching with "after" FieldDoc. This iterator can quickly position on the desired "after" document, and skip all documents before "after" or even whole segments that contain only documents before "after". Related to LUCENE-9280

jimczi

I did a first pass and left some comments but it looks great. I like the fact that the optimization can be done entirely outside of the TopFieldCollector.

lucene/core/src/java/org/apache/lucene/search/FilteringDocLeafComparator.java

jimczi · 2020-08-11T18:17:01Z

lucene/core/src/java/org/apache/lucene/search/FilteringFieldComparator.java

   * @return comparator wrapped as a filtering comparator or the original comparator if the filtering functionality
   * is not implemented for it
   */
-  public static FieldComparator<?> wrapToFilteringComparator(FieldComparator<?> comparator, boolean reverse, boolean singleSort) {
+  public static FieldComparator<?> wrapToFilteringComparator(FieldComparator<?> comparator, boolean reverse, boolean singleSort,
+      boolean hasAfter) {


Do we really need to add the hasAfter ? Can we check the if the topValue in the DocComparator is greater than 0 instead ?

At that moment topValue is not set yet, it will be set later in the constructor of PagingFieldCollector.

jimczi · 2020-08-11T18:17:50Z

lucene/core/src/java/org/apache/lucene/search/FieldValueHitQueue.java

   */
  public static <T extends FieldValueHitQueue.Entry> FieldValueHitQueue<T> create(SortField[] fields, int size,
-      boolean filterNonCompetitiveDocs) {
+      boolean filterNonCompetitiveDocs, boolean hasAfter) {


Can we avoid adding hasAfter here ? See my comment below.

jimczi · 2020-08-11T18:18:16Z

lucene/core/src/java/org/apache/lucene/search/FieldValueHitQueue.java

@@ -121,7 +121,7 @@ protected boolean lessThan(final Entry hitA, final Entry hitB) {
  }

  // prevent instantiation and extension.
-  private FieldValueHitQueue(SortField[] fields, int size, boolean filterNonCompetitiveDocs) {
+  private FieldValueHitQueue(SortField[] fields, int size, boolean filterNonCompetitiveDocs, boolean hasAfter) {


Not sure that hasAfter is really needed here.

At this point of time, topValue for comparators is not set yet, that's why we need hasAfter.
As an alternative to this implementation, we can pass FieldDoc after to FieldValueHitQueue.create and setTopValue during FieldValueHitQueue creation.

jimczi · 2020-08-11T18:18:31Z

lucene/core/src/java/org/apache/lucene/search/FieldValueHitQueue.java

@@ -95,8 +95,8 @@ protected boolean lessThan(final Entry hitA, final Entry hitB) {
   */
  private static final class MultiComparatorsFieldValueHitQueue<T extends FieldValueHitQueue.Entry> extends FieldValueHitQueue<T> {

-    public MultiComparatorsFieldValueHitQueue(SortField[] fields, int size, boolean filterNonCompetitiveDocs) {
-      super(fields, size, filterNonCompetitiveDocs);
+    public MultiComparatorsFieldValueHitQueue(SortField[] fields, int size, boolean filterNonCompetitiveDocs, boolean hasAfter) {


Not sure that hasAfter is really needed here.

jimczi · 2020-08-11T18:20:17Z

lucene/core/src/java/org/apache/lucene/search/FilteringDocLeafComparator.java

+ * This comparator is used when there is sort by _doc asc together with "after" FieldDoc.
+ * The comparator provides an iterator that can quickly skip to the desired "after" document.
+ */
+public class FilteringDocLeafComparator implements FilteringLeafFieldComparator {


Maybe rename to AfterDocLeafComparator ?

I like AfterDocLeafComparator, but I renamed to FilteringAfterDocLeafComparator for consistency with all other filtering comparators. Please let me know if you still like it to be renamed

mayya-sharipova · 2020-08-12T21:54:21Z

@jimczi Thank you for the initial feedback. I tried to address it, can you please continue the review

mayya-sharipova · 2020-08-20T19:14:50Z

@jimczi Thank you for the initial feedback. I have tried to address it. Please continue to review, when you have time.

I have caught up with @jimczi offline, and his main comment was that we need to redesign all comparators is such a way that they all provide skipping functionality by default (without wrapping them into FilteringComparator and FilteringLeafComparator). While this be a target of following PRs, the goal of this PR is do this only for DocComparator.

So addressing @jimczi's feedback, in this PR:

DocComparator was redesigned to have skipping functionality be default.

jimczi

Thanks @mayya-sharipova . I left two comments that I find important, one regarding the introduction of new functions in the filtering leaf comparator. I don't think they are needed. And the other regarding the early termination of queries sorted by doc id. We don't need to visit more documents if the queue is full and the the hits threshold has been reached.

lucene/core/src/java/org/apache/lucene/search/FieldComparator.java

lucene/core/src/java/org/apache/lucene/search/FilteringLeafFieldComparator.java

jimczi · 2020-08-20T20:02:50Z

lucene/core/src/java/org/apache/lucene/search/comparators/DocComparator.java

+
+        @Override
+        public void setBottom(int slot) {
+            bottom = docIDs[slot];


We should be able to early terminate here if the total hits threshold has been reached. If it's not reached yet, we can early terminate later in setHitsThresholdReached.

Any query sorted by doc id can early terminate after N matches. That's an important aspect of the optimization since it can be handled by the hits threshold transparently. If there is no after value, the threshold should be an upper bound of the number of document that we will collect in the comparator.

@jimczi Thank you for your comment. I would like to clarify something:

can early terminate after N matches

What's the way to terminate after N matches here? Is it to update an iterator to an empty iterator?

Isn't this termination already handled in TopFieldCollector with the code around canEarlyTerminate?
Is the plan to remove the code around canEarlyTerminate in TopFieldCollector? Should we do after we also handle the case on early termination with the same index and query sort?

What's the way to terminate after N matches here? Is it to update an iterator to an empty iterator?

I think so, yes. Updating to an empty iterator is what we do for constant score queries for instance.

Isn't this termination already handled in TopFieldCollector with the code around canEarlyTerminate?

The code is only for sorted index and while I think that we should move this code in the sort comparator, I agree that it's out of the scope of this PR. Early termination should be handled in the field comparator so that we don't need to add new logic in the main collector.

@jimczi Thanks for the feedback. I have added an early termination after N matches for DocComparator in 21de242.

The code is only for sorted index

We also have an early termination in a collector based on _doc order. I can remove this code later once I can make sure that all queries use DefaultBulkScorer that uses collector's iterator.

Right, sorry I forgot that we added the early termination logic for _doc order in the collector. Maybe that's ok to leave it as it is then. We can revise after we ensure that all bulk scorer uses the collector's iterator ?

@jimczi That sounds good to me. Thank you for the feedback. Do you have any further comments for this PR?

lucene/core/src/java/org/apache/lucene/search/FilteringLeafFieldComparator.java

lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java

- Redesign numeric comparators so by default they provide skipping functionality. This resuled in moving comparators to a separate package. - Remove unnecessary filtering comparator classes, as by default comparators provide skipping functionality - Remove unncessary checks in TopFieldCollector

mayya-sharipova · 2020-08-26T18:42:55Z

@jimczi I have tried to address your comments in 746c8fa. Can you please continue to review when you have time.

jimczi

I left two questions but it's getting close. Thanks for iterating on this @mayya-sharipova !

jimczi · 2020-09-01T19:30:00Z

lucene/core/src/java/org/apache/lucene/search/comparators/DocComparator.java

+    /** Creates a new comparator based on document ids for {@code numHits} */
+    public DocComparator(int numHits, boolean reverse) {
+        this.docIDs = new int[numHits];
+        // skipping functionality is enabled if we are sorting by _doc in asc order


Should it be activated only if the comparator is used as the primary sort ?

@jimczi Thanks Jim. We also had an implicit protection for a primary sort in MultiLeafFieldComparator.java, but I agree it is good to explicitly indicate a primary sort.

Addressed in 485fe4f

We also had an implicit protection for a primary sort

Ok I see so the expectation is that setHitsThresholdReached is only called on the primary sort and we use it as a signal to enable skipping.

Addressed in 485fe4f

I was thinking that we could use the information exposed in SortField#getComparator, we don't need to add a new callback. However, I think we can rely on what you describe above, setHitsThresholdReached and competitiveIterator are the callbacks that we need to enable the optimization. Sorry for the back and forth but I we probably don't need 485fe4f ;).

@jimczi Thanks for the feedback.

I was thinking that we could use the information exposed in SortField#getComparator, we don't need to add a new callback

Good point, I haven't noticed that before, I've refactored the code to use sortPos from this function. Addressed in 92fa246

However, I think we can rely on what you describe above, setHitsThresholdReached and competitiveIterator are the callbacks that we need to enable the optimization. ... we probably don't need 485fe4f

Doing sort optimization only on primary sort in MultiLeafFieldComparator is implicit and some refactoring of it may bring wrong result. So I think it is good to have extra protection

jimczi · 2020-09-01T19:31:41Z

lucene/core/src/java/org/apache/lucene/search/comparators/DocComparator.java

+        private void updateIterator() {
+            if (enableSkipping == false || hitsThresholdReached == false) return;
+            if (bottomValueSet) {
+                // since we've collected top N matches, we can early terminate


Can you add a comment explaining that early termination is already implemented in the collector but we'll remove in a follow up ?

addressed in 485fe4f

Add a note to remove early termination in collector

This reverts commit 485fe4f.

Add a note to remove early termination in collector

jimczi

I left some comments around testing but the changes look good to me.

lucene/core/src/test/org/apache/lucene/search/TestFieldSortOptimizationSkipping.java

mayya-sharipova · 2020-09-04T10:33:04Z

@jimczi Thanks for the review so far, I am wondering if you have any further comments?

- Enhance DocComparator to provide an iterator over competitive documents when searching with "after". This iterator can quickly position on the desired "after" document skipping all documents and segments before "after". - Redesign numeric comparators to move to separate package. Backport for #LUCENE-9449

Fix bug how iterator with skipping functionality advances and produces docs Relates to apache#1725

Fix bug how iterator with skipping functionality advances and produces docs Relates to #1725

Fix bug how iterator with skipping functionality advances and produces docs Relates to #1725 Backport for #1903

Some collectors provide iterators that can efficiently skip non-competitive docs. When using DefaultBulkScorer#score function we create a conjunction of scorerIterator and collectorIterator. As collectorIterator always starts from a docID = -1, and for creation of conjunction iterator we need all of its sub-iterators to be on the same doc, the creation of conjuction iterator will fail if scorerIterator has already been advanced to some other document. This patch ensures that we create conjunction between scorerIterator and collectorIterator only if scorerIterator has not been advanced yet. Relates to apache#1725 Relates to apache#1937

mayya-sharipova requested a review from jimczi August 7, 2020 15:49

jimczi reviewed Aug 11, 2020

View reviewed changes

Address feedback

5fcddf7

Implement filtering functionality in DocComparator

fabfca5

mayya-sharipova force-pushed the sort-by-doc-optim branch from 8616330 to fabfca5 Compare August 20, 2020 19:26

jimczi requested changes Aug 20, 2020

View reviewed changes

mayya-sharipova added 3 commits August 31, 2020 16:27

DocComparator return empty iterator after topN hits

21de242

Merge remote-tracking branch 'upstream/master' into sort-by-doc-optim

26534c5

Add package info

4252fcd

jimczi reviewed Sep 1, 2020

View reviewed changes

mayya-sharipova added 3 commits September 1, 2020 17:24

Enable skipping functionality only on primary sort

485fe4f

Add a note to remove early termination in collector

Revert "Enable skipping functionality only on primary sort"

a7c4e85

This reverts commit 485fe4f.

Enable skipping functionality only on primary sort

92fa246

Add a note to remove early termination in collector

mayya-sharipova force-pushed the sort-by-doc-optim branch from 1c79ae6 to 92fa246 Compare September 2, 2020 14:34

Adding documentation to leaf comparators

27ff519

jimczi approved these changes Sep 4, 2020

View reviewed changes

mayya-sharipova added 3 commits September 8, 2020 10:10

Address feedback for test

493d9eb

Merge remote-tracking branch 'upstream/master' into sort-by-doc-optim

45577aa

Make a constructor of NumericComparator protected

e69fa27

mayya-sharipova merged commit 9922067 into apache:master Sep 8, 2020

mayya-sharipova deleted the sort-by-doc-optim branch September 8, 2020 18:16

mayya-sharipova mentioned this pull request Sep 10, 2020

LUCENE-9449 Skip docs with _doc sort and "after" (#1725) #1856

Merged

mayya-sharipova added a commit to mayya-sharipova/lucene-solr that referenced this pull request Sep 21, 2020

Fix bug in sort optimization

4773625

Fix bug how iterator with skipping functionality advances and produces docs Relates to apache#1725

mayya-sharipova mentioned this pull request Sep 21, 2020

Fix bug in sort optimization #1903

Merged

mayya-sharipova added a commit that referenced this pull request Sep 23, 2020

Fix bug in sort optimization (#1903)

7d90b85

Fix bug how iterator with skipping functionality advances and produces docs Relates to #1725

mayya-sharipova mentioned this pull request Sep 23, 2020

Fix bug in sort optimization (#1903) #1915

Merged

mayya-sharipova added a commit that referenced this pull request Sep 23, 2020

Fix bug in sort optimization (#1903) (#1915)

da03351

Fix bug how iterator with skipping functionality advances and produces docs Relates to #1725 Backport for #1903

mayya-sharipova mentioned this pull request Oct 2, 2020

LUCENE-9555: Advance conjuction Iterator for two phase iteration #1943

Merged

asfimport mentioned this pull request Jul 7, 2021

Skip non-competitive documents when sort by _doc with search after [LUCENE-9449] apache/lucene#10489

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-9449 Skip docs with _doc sort and "after" #1725

LUCENE-9449 Skip docs with _doc sort and "after" #1725

mayya-sharipova commented Aug 7, 2020

jimczi left a comment

jimczi Aug 11, 2020

mayya-sharipova Aug 12, 2020

jimczi Aug 11, 2020

jimczi Aug 11, 2020

mayya-sharipova Aug 12, 2020

jimczi Aug 11, 2020

jimczi Aug 11, 2020

mayya-sharipova Aug 12, 2020

mayya-sharipova commented Aug 12, 2020

mayya-sharipova commented Aug 20, 2020

jimczi left a comment

jimczi Aug 20, 2020

jimczi Aug 20, 2020

mayya-sharipova Aug 26, 2020

jimczi Aug 31, 2020

mayya-sharipova Aug 31, 2020

jimczi Aug 31, 2020

mayya-sharipova Sep 1, 2020

mayya-sharipova commented Aug 26, 2020

jimczi left a comment

jimczi Sep 1, 2020

mayya-sharipova Sep 1, 2020 •

edited

Loading

jimczi Sep 1, 2020

mayya-sharipova Sep 2, 2020 •

edited

Loading

jimczi Sep 1, 2020

mayya-sharipova Sep 1, 2020

jimczi left a comment

mayya-sharipova commented Sep 4, 2020

LUCENE-9449 Skip docs with _doc sort and "after" #1725

LUCENE-9449 Skip docs with _doc sort and "after" #1725

Conversation

mayya-sharipova commented Aug 7, 2020

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayya-sharipova commented Aug 12, 2020

mayya-sharipova commented Aug 20, 2020

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayya-sharipova commented Aug 26, 2020

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayya-sharipova Sep 1, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayya-sharipova Sep 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimczi left a comment

Choose a reason for hiding this comment

mayya-sharipova commented Sep 4, 2020

mayya-sharipova Sep 1, 2020 •

edited

Loading

mayya-sharipova Sep 2, 2020 •

edited

Loading