Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9449 Skip docs with _doc sort and "after" #1725

Merged
merged 14 commits into from
Sep 8, 2020

Conversation

mayya-sharipova
Copy link
Contributor

Enhance DocComparator to provide an iterator over competitive documents
when searching with "after" FieldDoc.
This iterator can quickly position on the desired "after" document,
and skip all documents before "after" or even whole segments
that contain only documents before "after".

Related to LUCENE-9280

Enhance DocComparator to provide an iterator over competitive documents
when searching with "after" FieldDoc.
This iterator can quickly position on the desired "after" document,
and skip all documents before "after" or even whole segments
that contain only documents before "after".

Related to LUCENE-9280
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass and left some comments but it looks great. I like the fact that the optimization can be done entirely outside of the TopFieldCollector.

* @return comparator wrapped as a filtering comparator or the original comparator if the filtering functionality
* is not implemented for it
*/
public static FieldComparator<?> wrapToFilteringComparator(FieldComparator<?> comparator, boolean reverse, boolean singleSort) {
public static FieldComparator<?> wrapToFilteringComparator(FieldComparator<?> comparator, boolean reverse, boolean singleSort,
boolean hasAfter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to add the hasAfter ? Can we check the if the topValue in the DocComparator is greater than 0 instead ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At that moment topValue is not set yet, it will be set later in the constructor of PagingFieldCollector.

*/
public static <T extends FieldValueHitQueue.Entry> FieldValueHitQueue<T> create(SortField[] fields, int size,
boolean filterNonCompetitiveDocs) {
boolean filterNonCompetitiveDocs, boolean hasAfter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid adding hasAfter here ? See my comment below.

@@ -121,7 +121,7 @@ protected boolean lessThan(final Entry hitA, final Entry hitB) {
}

// prevent instantiation and extension.
private FieldValueHitQueue(SortField[] fields, int size, boolean filterNonCompetitiveDocs) {
private FieldValueHitQueue(SortField[] fields, int size, boolean filterNonCompetitiveDocs, boolean hasAfter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that hasAfter is really needed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point of time, topValue for comparators is not set yet, that's why we need hasAfter.
As an alternative to this implementation, we can pass FieldDoc after to FieldValueHitQueue.create and setTopValue during FieldValueHitQueue creation.

@@ -95,8 +95,8 @@ protected boolean lessThan(final Entry hitA, final Entry hitB) {
*/
private static final class MultiComparatorsFieldValueHitQueue<T extends FieldValueHitQueue.Entry> extends FieldValueHitQueue<T> {

public MultiComparatorsFieldValueHitQueue(SortField[] fields, int size, boolean filterNonCompetitiveDocs) {
super(fields, size, filterNonCompetitiveDocs);
public MultiComparatorsFieldValueHitQueue(SortField[] fields, int size, boolean filterNonCompetitiveDocs, boolean hasAfter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that hasAfter is really needed here.

* This comparator is used when there is sort by _doc asc together with "after" FieldDoc.
* The comparator provides an iterator that can quickly skip to the desired "after" document.
*/
public class FilteringDocLeafComparator implements FilteringLeafFieldComparator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename to AfterDocLeafComparator ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like AfterDocLeafComparator, but I renamed to FilteringAfterDocLeafComparator for consistency with all other filtering comparators. Please let me know if you still like it to be renamed

@mayya-sharipova
Copy link
Contributor Author

@jimczi Thank you for the initial feedback. I tried to address it, can you please continue the review

@mayya-sharipova
Copy link
Contributor Author

@jimczi Thank you for the initial feedback. I have tried to address it. Please continue to review, when you have time.

I have caught up with @jimczi offline, and his main comment was that we need to redesign all comparators is such a way that they all provide skipping functionality by default (without wrapping them into FilteringComparator and FilteringLeafComparator). While this be a target of following PRs, the goal of this PR is do this only for DocComparator.

So addressing @jimczi's feedback, in this PR:

  • DocComparator was redesigned to have skipping functionality be default.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mayya-sharipova . I left two comments that I find important, one regarding the introduction of new functions in the filtering leaf comparator. I don't think they are needed. And the other regarding the early termination of queries sorted by doc id. We don't need to visit more documents if the queue is full and the the hits threshold has been reached.


@Override
public void setBottom(int slot) {
bottom = docIDs[slot];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to early terminate here if the total hits threshold has been reached. If it's not reached yet, we can early terminate later in setHitsThresholdReached.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any query sorted by doc id can early terminate after N matches. That's an important aspect of the optimization since it can be handled by the hits threshold transparently. If there is no after value, the threshold should be an upper bound of the number of document that we will collect in the comparator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi Thank you for your comment. I would like to clarify something:

can early terminate after N matches

What's the way to terminate after N matches here? Is it to update an iterator to an empty iterator?

Isn't this termination already handled in TopFieldCollector with the code around canEarlyTerminate?
Is the plan to remove the code around canEarlyTerminate in TopFieldCollector? Should we do after we also handle the case on early termination with the same index and query sort?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the way to terminate after N matches here? Is it to update an iterator to an empty iterator?

I think so, yes. Updating to an empty iterator is what we do for constant score queries for instance.

Isn't this termination already handled in TopFieldCollector with the code around canEarlyTerminate?

The code is only for sorted index and while I think that we should move this code in the sort comparator, I agree that it's out of the scope of this PR. Early termination should be handled in the field comparator so that we don't need to add new logic in the main collector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi Thanks for the feedback. I have added an early termination after N matches for DocComparator in 21de242.

The code is only for sorted index

We also have an early termination in a collector based on _doc order. I can remove this code later once I can make sure that all queries use DefaultBulkScorer that uses collector's iterator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, sorry I forgot that we added the early termination logic for _doc order in the collector. Maybe that's ok to leave it as it is then. We can revise after we ensure that all bulk scorer uses the collector's iterator ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi That sounds good to me. Thank you for the feedback. Do you have any further comments for this PR?

- Redesign numeric comparators so by default they provide skipping
functionality. This resuled in moving comparators to a separate
package.

- Remove unnecessary filtering comparator classes, as by default
comparators provide skipping functionality

- Remove unncessary checks in TopFieldCollector
@mayya-sharipova
Copy link
Contributor Author

@jimczi I have tried to address your comments in 746c8fa. Can you please continue to review when you have time.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two questions but it's getting close. Thanks for iterating on this @mayya-sharipova !

/** Creates a new comparator based on document ids for {@code numHits} */
public DocComparator(int numHits, boolean reverse) {
this.docIDs = new int[numHits];
// skipping functionality is enabled if we are sorting by _doc in asc order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be activated only if the comparator is used as the primary sort ?

Copy link
Contributor Author

@mayya-sharipova mayya-sharipova Sep 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi Thanks Jim. We also had an implicit protection for a primary sort in MultiLeafFieldComparator.java, but I agree it is good to explicitly indicate a primary sort.

Addressed in 485fe4f

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also had an implicit protection for a primary sort

Ok I see so the expectation is that setHitsThresholdReached is only called on the primary sort and we use it as a signal to enable skipping.

Addressed in 485fe4f

I was thinking that we could use the information exposed in SortField#getComparator, we don't need to add a new callback. However, I think we can rely on what you describe above, setHitsThresholdReached and competitiveIterator are the callbacks that we need to enable the optimization. Sorry for the back and forth but I we probably don't need 485fe4f ;).

Copy link
Contributor Author

@mayya-sharipova mayya-sharipova Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi Thanks for the feedback.

I was thinking that we could use the information exposed in SortField#getComparator, we don't need to add a new callback

Good point, I haven't noticed that before, I've refactored the code to use sortPos from this function. Addressed in 92fa246

However, I think we can rely on what you describe above, setHitsThresholdReached and competitiveIterator are the callbacks that we need to enable the optimization. ... we probably don't need 485fe4f

Doing sort optimization only on primary sort in MultiLeafFieldComparator is implicit and some refactoring of it may bring wrong result. So I think it is good to have extra protection

private void updateIterator() {
if (enableSkipping == false || hitsThresholdReached == false) return;
if (bottomValueSet) {
// since we've collected top N matches, we can early terminate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining that early termination is already implemented in the collector but we'll remove in a follow up ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in 485fe4f

Add a note to remove early termination in collector
Add a note to remove early termination in collector
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments around testing but the changes look good to me.

@mayya-sharipova
Copy link
Contributor Author

@jimczi Thanks for the review so far, I am wondering if you have any further comments?

@mayya-sharipova mayya-sharipova merged commit 9922067 into apache:master Sep 8, 2020
@mayya-sharipova mayya-sharipova deleted the sort-by-doc-optim branch September 8, 2020 18:16
mayya-sharipova added a commit that referenced this pull request Sep 17, 2020
- Enhance DocComparator to provide an iterator over competitive
documents when searching with "after". This iterator can quickly position
on the desired "after" document skipping all documents and segments before
"after".

- Redesign numeric comparators to move to separate package.

Backport for #LUCENE-9449
mayya-sharipova added a commit to mayya-sharipova/lucene-solr that referenced this pull request Sep 21, 2020
Fix bug how iterator with skipping functionality
advances and produces docs

Relates to apache#1725
mayya-sharipova added a commit that referenced this pull request Sep 23, 2020
Fix bug how iterator with skipping functionality
advances and produces docs

Relates to #1725
mayya-sharipova added a commit that referenced this pull request Sep 23, 2020
Fix bug how iterator with skipping functionality
advances and produces docs

Relates to #1725
Backport for #1903
mayya-sharipova added a commit to mayya-sharipova/lucene-solr that referenced this pull request Oct 2, 2020
Some collectors provide iterators that can efficiently skip
non-competitive docs. When using DefaultBulkScorer#score function
we create a conjunction of scorerIterator and collectorIterator.
As collectorIterator always starts from a docID = -1,
and for creation of conjunction iterator we need all of its
sub-iterators to be on the same doc, the creation of conjuction
iterator will fail if scorerIterator has already been advanced
to some other document.

This patch ensures that we create conjunction between scorerIterator
and collectorIterator only if scorerIterator has not been advanced yet.

Relates to apache#1725
Relates to apache#1937
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants