Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9280: Collectors to skip noncompetitive documents #1351

Merged

Conversation

mayya-sharipova
Copy link
Contributor

Similar how scorers can update their iterators to skip non-competitive
documents, collectors and comparators should also provide and update
iterators that allow them to skip non-competive documents

This could be useful if we want to sort by some field.

Similar how scorers can update their iterators to skip non-competitive
documents, collectors and comparators should also provide and update
iterators that allow them to skip non-competive documents

This could be useful if we want to sort by some field.
@mayya-sharipova
Copy link
Contributor Author

@jimczi I have created a draft PR for comparators and collectors to skip non-competitive docs. Can you please have a look at it and see if we are happy with this approach.

@mayya-sharipova mayya-sharipova changed the title Collectors to skip noncompetitive documents LUCENE-9280: Collectors to skip noncompetitive documents Mar 16, 2020
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a great start @mayya-sharipova ! I left some comments to make this change less invasive but I really like the simplicity of the new long sort field.
It could also be nice to run some early benchmarks with luceneutil to show how useful this change can be for numeric sort ?


public static abstract class IteratorSupplierComparator<T> extends FieldComparator<T> implements LeafFieldComparator {
abstract DocIdSetIterator iterator();
abstract void updateIterator() throws IOException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this ? We could update the iterator every time a bottom value is set ?

Copy link
Contributor Author

@mayya-sharipova mayya-sharipova Mar 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it is more straightforward to just update an iterator in setBottom function of a comparator.

But I was thinking it is better to have a special function for two reasons:

  1. After updating an iterator, in TopFieldCollector we need to change
    totalHitsRelation = TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO;

  2. we also need to check hitsThresholdChecker.isThresholdReached(), and passing not strictly related object hitsThresholdChecker to a comparator's constructor doesn't look nice to me.

Please let me know if you think otherwise

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 1. we could set the totalHitsRelation when we reach the total hits threshold in the TOP_DOCS mode ?
For 2. I wonder if we could pass the hitsThresholdChecker to the LeafFieldComparator like we do for the scorer ?
This way we can update the iterator internally when a new bottom is set or when compareBottom is called ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name seems to indicate that this is something that compares IteratorSuppliers, when in fact it is something that is a comparator that also supplies iterators. I'm not sure I understand yet where it fits, but given that, a better name might be IterableComparator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msokolov Thanks for the suggestion, naming is tough, addressed in 95e1bc1.

return PointValues.Relation.CELL_CROSSES_QUERY;
}
};
pointValues.intersect(visitor);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should update the iterator only if it allows to skip "lots" of documents, in distance feature query we set the threshold to a 8x reduction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6384b15

return iterator;
}

public void updateIterator() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should throttle the checks here (if the bottom value changes frequently). In the distance feature query we start throttling after 256 calls, we should replicate here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6384b15


@Override
public void setBottom(int slot) {
this.bottom = values[slot];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the iterator here ? We would need to check the total hits threshold so maybe pass the HitsThresholdChecker in the ctr somehow ?

} else {
LongPoint.encodeDimension(bottom, minValueAsBytes, 0);
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should also take the topValue into account here (searchAfter) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6384b15

lucene/core/src/java/org/apache/lucene/search/Weight.java Outdated Show resolved Hide resolved
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not think of any clever way to do this in IndexSearcher, I would appreciate your help if you can suggest any such way. I just redesigned DefaultBulkScorer to use a conjunction of a scorer's and collector's iterators.

I left some comments regarding the refactor but I like it better. I think you're right, the bulk scorer is a good entry point to handle the leaf collector iterator.

}
}

// conjunction iterator between scorer's iterator and collector's iterator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can replace this with ConjunctionDISI#intersectIterators ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 226 to 266
if (twoPhase == null) {
while (currentDoc < end) {
if (acceptDocs == null || acceptDocs.get(currentDoc)) {
collector.collect(currentDoc);
}
currentDoc = iterator.nextDoc();
}
return currentDoc;
} else {
final DocIdSetIterator approximation = twoPhase.approximation();
while (currentDoc < end) {
if ((acceptDocs == null || acceptDocs.get(currentDoc)) && twoPhase.matches()) {
collector.collect(currentDoc);
}
currentDoc = approximation.nextDoc();
while (currentDoc < end) {
if ((acceptDocs == null || acceptDocs.get(currentDoc)) && (twoPhase == null || twoPhase.matches())) {
collector.collect(currentDoc);
}
return currentDoc;
currentDoc = iterator.nextDoc();
}
return currentDoc;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change is not required ? I see hotspot in the javadoc comment above so we shouldn't touch it if it's not required ;).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d732d7e

doc = iterator.advance(min);
} else {
doc = twoPhase.approximation().advance(min);
if (doc < min) scorerIterator.advance(min);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (doc < min) scorerIterator.advance(min);
if (doc < min) {
doc = combinedIterator.advance(min);
}

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d732d7e

this.bottom = values[slot];
// can't use hitsThresholdChecker.isThresholdReached() as it uses > numHits,
// while we want to update iterator as soon as threshold reaches numHits
if (hitsThresholdChecker != null && (hitsThresholdChecker.getHitsThreshold() >= numHits)) {
Copy link
Contributor Author

@mayya-sharipova mayya-sharipova Mar 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi I am not very happy about this change because of 2 reasons:

  1. We can't use hitsThresholdChecker.isThresholdReached as it checks for greater than numHits, but we need to check starting with equal, as if there are no competitive docs later setBottom will not be called.
    Do you know the reason why hitsThresholdChecker.isThresholdReached checks for greater than numHits and not greater or equal numHits?
  2. totalHitsRelation may not end up to be set to TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO, as we set it only when we have later competitive hits.

I think it is better to have a previous implementation with a dedicated updateIterator function called from TopFieldCollector. WDYT?

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little fuzzy on my understanding of how you are making use of Points here, but I left a few micro-comments. I'll echo Jim's comment; it'd be great to see some results from luceneutil (or any reproducible benchmark) demonstrating this new idea.


public static abstract class IteratorSupplierComparator<T> extends FieldComparator<T> implements LeafFieldComparator {
abstract DocIdSetIterator iterator();
abstract void updateIterator() throws IOException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name seems to indicate that this is something that compares IteratorSuppliers, when in fact it is something that is a comparator that also supplies iterators. I'm not sure I understand yet where it fits, but given that, a better name might be IterableComparator?

return;
}

final byte[] maxValueAsBytes = reverse == false ? new byte[Long.BYTES] : hasTopValue ? new byte[Long.BYTES]: null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this initialization into the constructor, or is this not shareable and must be local storage? I think we call updateIterator in collect() right? If we can avoid object creation in an inner loop, that would be good. We could create both arrays unconditionally I think and set a boolean here to be used below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msokolov Thanks for the suggestion, indeed these values can be initialized in the comparator's constructor. As each topfieldcollector has its own comparator and processes segments sequentially, these values should be shareable. Addressed in 95e1bc1

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Mar 26, 2020

I have run some benchmarking using luceneutil.
As the new sort optimization uses a new LongDocValuesPointSortField that is not present in luceneutil, I had to hack luceneutil as follows:

  1. I added a sort task on a long field TermDateTimeSort to wikimedium.1M.nostopwords.tasks . This task was present in wikinightly.tasks , but was not able for wikimedium 1M and 10M tasks
  2. I indexed the corresponding field lastModNDV as LongPoint as well. It was only indexed as NumericDocValuesField before, but for the sort optimization we need long values to be indexed both as docValues and as points.
  3. I modified SearchTask.java to have TopFieldCollector with totalHitsThreshold set to topK: final TopFieldCollector c = TopFieldCollector.create(s, topN, null, topN); Sort optimization only works when we set total hits threshold.
  4. For the patch version , I modified sort in TaskParser.java. Instead of lastModNDVSort = new Sort(new SortField("lastModNDV", SortField.Type.LONG)); I useed the optimized sort: lastModNDVSort = new Sort(new LongDocValuesPointSortField("lastModNDV"));

Here the main point of comparison is TermDTSort as it is the only sort on long field. Other sorts are presented to demonstrate a possible regression or absence on them.


wikimedium1m

TaskQPS baseline QPS StdDevQPS my_modified_version QPS StdDevQPS
TermDTSort 507.20 (11.2%) 550.02 (16.1%)
HighTermMonthSort 550.06 (10.4%) 443.69 (16.1%)
HighTermDayOfYearSort 105.62 (24.9%) 91.93 (22.1%)

wikimedium10m

TaskQPS baseline QPS StdDevQPS my_modified_version QPS StdDevQPS
TermDTSort 147.64 (11.5%) 547.80 (6.6%)
HighTermMonthSort 147.85 (12.2%) 239.28 (7.3%)
HighTermDayOfYearSort 74.44 (7.7%) 42.56 (12.1%)

For wikimedium1m TermDTSort using LongDocValuesPointSortField doesn't seem to have much effect. As probably in this index segments are smaller, and probably optimization was completely skipped on those segments.
For wikimedium10m TermDTSort using LongDocValuesPointSortField instead of usual SortField.Type.LONG brings about 3x speedups.
There is some regression/speedups for the sort tasks of HighTermMonthSort and HighTermDayOfYearSort, which I don't know the reason why, as they should not be effected.

@mayya-sharipova mayya-sharipova marked this pull request as ready for review March 26, 2020 20:05
@msokolov
Copy link
Contributor

That 3x speedup is very nice! My experience with these benchmarks is they can be pretty noisy, maybe accounting for the regressions? I tend to increase comp.taskRepeatCount = 500. I'd also be interested to see how this optimization fares for higher values of topN - I think the default is 10, but you can edit in benchUtil.py. You did not sort the index right (eg: comp.newIndex('baseline', sourceData, facets=facets, indexSort='lastModNDV:long', addDVFields=True)? It would be interesting to see if this has the same impact for sorted index, large N, especially running with an executor (.competitor(...concurrentSearchers = True ).

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Mar 27, 2020

Update: these are wrong results. Please disregard them

@msokolov Thank for suggesting additional benchmarks that we can use.
Below are the results on the dataset wikimedium10m.

First I will repeat the results from the previous round of benchmarking:

topN=10, taskRepeatCount = 20, concurrentSearchers = False

TaskQPS baseline QPS StdDevQPS my_modified_version QPS StdDevQPS
TermDTSort 147.64 (11.5%) 547.80 (6.6%)
HighTermMonthSort 147.85 (12.2%) 239.28 (7.3%)
HighTermDayOfYearSort 74.44 (7.7%) 42.56 (12.1%)

topN=10, taskRepeatCount = 500, concurrentSearchers = False

TaskQPS baseline QPS StdDevQPS my_modified_version QPS StdDevQPS
TermDTSort 184.60 (8.2%) 3046.19 (4.4%)
HighTermMonthSort 209.43 (6.5%) 253.90 (10.5%)
HighTermDayOfYearSort 130.97 (5.8%) 73.25 (11.8%)

This seemed to speed up all operations, and here the speedups for TermDTSort even bigger: 16.5x times. There is also seems to be more regression for HighTermDayOfYearSort.


topN=500, taskRepeatCount = 20, concurrentSearchers = False

TaskQPS baseline QPS StdDevQPS my_modified_version QPS StdDevQPS
TermDTSort 210.24 (9.7%) 537.65 (6.7%)
HighTermMonthSort 116.02 (8.9%) 189.96 (13.5%)
HighTermDayOfYearSort 42.33 (7.6%) 67.93 (9.3%)

With increased topN the sort optimization has less speedups up to 2x, as it is expected as it will be possible to run it only after collecting topN docs.


topN=10, taskRepeatCount = 20, concurrentSearchers = True

TaskQPS baseline QPS StdDevQPS my_modified_version QPS StdDevQPS
TermDTSort 132.09 (14.3%) 287.93 (11.8%)
HighTermMonthSort 211.01 (12.2%) 116.46 (7.1%)
HighTermDayOfYearSort 72.28 (6.1%) 68.21 (11.4%)

With the concurrent searchers the speedups are also smaller up to 2x. This is expected as now segments are spread between several TopFieldCollects/Comparators and they don't exchange bottom values. As a follow-up on this PR, we can think how we can have a global bottom value similar how MaxScoreAccumulator is used to set up a global competitive min score.


with indexSort='lastModNDV:long' topN=10, taskRepeatCount = 20, concurrentSearchers = False

TaskQPS baseline QPS StdDevQPS my_modified_version QPS StdDevQPS
TermDTSort 321.75 (11.5%) 364.83 (7.8%)
HighTermMonthSort 205.20 (5.7%) 178.16 (7.8%)
HighTermDayOfYearSort 66.07 (12.0%) 58.84 (9.3%)

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very compelling! I think you've addressed most of the outstanding comments: it seems like only the question about when to update the iterator remains (descussion below about moving it to setBottom). I'm not too concerned either way.

private int maxDoc;
private int maxDocVisited;
private int updateCounter = 0;
private byte[] cmaxValueAsBytes = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these be final, and allocated only in the constructor? I think it might be clearer to add a boolean "hasTopValues" and set that in setTopValue, rather than use the existence of these byte[]? Then you could make these final and eliminate the local variables where they get copied below

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um I approved and then realized - there is still the mystery of the regressions you observed in the non-optimized cases. I think we should try to understand where that is coming from before committing this?

Ensure optimized sort works as expected (as long sort)
of a field that is not indexed with points.
@mayya-sharipova
Copy link
Contributor Author

@msokolov Thank you for an additional review. I realized I ran benchmarks incorrectly, not indexing documents with docValues. Sorry, I am still learning lucene benchmarking tool. Please disregard the previous benchmarking results, I will be rerunning them.

@msokolov
Copy link
Contributor

@mayya-sharipova sounds good - I'd also encourage you to post a PR with your modifications to luceneutil

@mayya-sharipova
Copy link
Contributor Author

@msokolov Sorry again for reporting incorrect benchmarking results. Below are are my latest results, and I feel quite confident in their correctness.

First about the benchmarking setup.

  1. Here are the changes made to luceneutil
  2. patch folder is checkout as this PR
  3. trunk folder is checkout as this PR as well with a modification. As there is no LongDocValuesPointSortField in master, I can't benchmark sorting using this field on master. What I did is just is on trunk
    folder delegated sorting to the traditional sorting on a long field like this:
public class LongDocValuesPointSortField extends SortField {
    public LongDocValuesPointSortField(String field) {
        super(field, SortField.Type.LONG);
    }
    public LongDocValuesPointSortField(String field, boolean reverse) {
        super(field, SortField.Type.LONG, reverse);
    }
}

So basically I was benchmarking a traditional long sort VS a long sort using a new field LongDocValuesPointSortField.

wikimedium10m: 10 millon docs, up to 2x speedups

 TaskQPS                     baseline   StdDevQPS     patch     StdDev    Pct diff
             TermDTSort       64.53      (6.4%)      155.29     (42.3%)  140.7% (  86% -  202%)
  HighTermDayOfYearSort       47.63      (5.4%)       50.47      (6.8%)    6.0% (  -5% -   19%)
       HighTermMonthSort      110.07     (7.3%)      121.13      (6.8%)   10.0% (  -3% -   26%)
WARNING: cat=TermDTSort: hit counts differ: 754451 vs 1669+

wikimediumall: about 33 million docs, up to 3.5 x speedups

 TaskQPS                     baseline   StdDevQPS     patch     StdDev    Pct diff
              TermDTSort       28.96      (4.3%)      108.45     (56.9%)  274.5% ( 204% -  350%)
   HighTermDayOfYearSort        9.69      (5.1%)        9.56      (6.1%)   -1.3% ( -11% -   10%)
       HighTermMonthSort       39.41      (4.7%)       47.99     (10.0%)   21.8% (   6% -   38%)
WARNING: cat=TermDTSort: hit counts differ: 1474717 vs 1070+

Please let me know if these results and methodology make sense.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API looks good to me: the additional ScoreMode enum constants and the new LeafFieldComparator#iterator method. I wonder whether we could make it easier to write implementations. I haven't spent much time thinking about it, but for instance would it be possible to wrap existing comparators to add the skipping functionality? Alternatively we could add the skipping logic to the existing comparators, but the fact that Lucene doesn't require that the same data be stored in indexes and doc values makes me a bit nervous about enabling it by default, and I'd like to avoid adding a new constructor argument.

@@ -93,4 +93,11 @@
*/
void collect(int doc) throws IOException;

/*
* optionally returns an iterator over competitive documents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document that the default is to return null which Lucene interprets as the collector doesn't filter any documents. It's probably worth making explicit as null iterators are elsewhere interpreted as matching no documents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jpountz

It's probably worth making explicit as null iterators are elsewhere interpreted as matching no documents

What is the way to make this explicit?

@mayya-sharipova
Copy link
Contributor Author

@jpountz Thank you for the review.

I wonder whether we could make it easier to write implementations. I haven't spent much time thinking about it, but for instance would it be possible to wrap existing comparators to add the skipping functionality? Alternatively we could add the skipping logic to the existing comparators, but the fact that Lucene doesn't require that the same data be stored in indexes and doc values makes me a bit nervous about enabling it by default, and I'd like to avoid adding a new constructor argument.

Would it make sense for each numeric FieldComparator to add an extra class that would wrap a numeric comparator and provide additional methods for skipping logic (getting an iterator and updating an iterator)?

Add a decorator for FieldComparatori to add a functionality to skip
 over non-competitive docs
@mayya-sharipova
Copy link
Contributor Author

@jpountz What do you think of this design in eeb23c1?

  1. IterableFieldComparator wraps an FieldComparator to provide skipping functionality. All numeric comparators are wrapped in corresponding iterable comparators.
  2. SortField has a new method allowSkipNonCompetitveDocs, that if set will use a comparator that provided skipping functionality.

In this case, we would not need other classes that I previously introduced LongDocValuesPointComparator and LongDocValuesPointSortField.

@romseygeek
Copy link
Contributor

I like the idea of wrapping things up, and I think we may be able to take this further by pushing more of the logic into the comparator:

  • add a wrapDocIdSetIterator(DocIdSetIterator in) method to LeafCollector that by default returns the passed-in iterator. This gets called in DefaultBulkScorer#score to wrap the iterator for a query.
  • add a wrapDocIdSetIterator(DocIdSetIterator in) method to FieldComparator that by default returns the passed-in iterator. TopFieldCollector delegates its wrapDocIdSetIterator method to this method on its first comparator. This allows us to completely contain the logic that combines a query's iterator with sorting shortcuts to the SortField and associated FieldComparator implementation.
  • Move the logic that checks whether or not to update the iterator into setBottom on the leaf comparator. I know this involves passing the HitsThresholdChecker into the leaf comparator constructor, but I think that's reasonable if the point of this API change is to make it possible for comparators to skip hits

mayya-sharipova added a commit to mayya-sharipova/luceneutil that referenced this pull request Jun 23, 2020
Sort optimization introduced in apache/lucene-solr#1351
depends on numeric fields being indexed both as doc_values and points.

This PR does the following:
- add a LongPoint field – lastModLP, last modified timestamp
- add an IntPoint field – dayOfYearIP, day of the year of the last modified timestamp
- add sort on the last modified timestamp to wikimedium.10M.nostopwords.tasks
- don't fail a task if hitCounts don't match in benchUtil.py. As we
don't collect all hits in the optimized runs, we don't expect hits total
to match.
@mayya-sharipova mayya-sharipova merged commit b0333ab into apache:master Jun 23, 2020
mayya-sharipova added a commit to mayya-sharipova/lucene-solr that referenced this pull request Jun 24, 2020
Backport for: LUCENE-9280: Collectors to skip noncompetitive documents (apache#1351)

Similar how scorers can update their iterators to skip non-competitive
documents, collectors and comparators should also provide and update
iterators that allow them to skip non-competive documents.

To enable sort optimization for numeric sort fields,
the following needs to be done:
1) the field should be indexed with both doc_values and points, that
must have the same field name and same data
2) SortField#setCanSkipNonCompetitiveDocs must be set
3) totalHitsThreshold should not be set to max value.
mayya-sharipova added a commit that referenced this pull request Jul 31, 2020
Backport for: LUCENE-9280: Collectors to skip noncompetitive documents (#1351)

Similar how scorers can update their iterators to skip non-competitive
documents, collectors and comparators should also provide and update
iterators that allow them to skip non-competive documents.

To enable sort optimization for numeric sort fields,
the following needs to be done:
1) the field should be indexed with both doc_values and points, that
must have the same field name and same data
2) SortField#setCanUsePoints must be set
3) totalHitsThreshold should not be set to max value.
gus-asf pushed a commit to gus-asf/lucene-solr that referenced this pull request Sep 4, 2020
Backport for: LUCENE-9280: Collectors to skip noncompetitive documents (apache#1351)

Similar how scorers can update their iterators to skip non-competitive
documents, collectors and comparators should also provide and update
iterators that allow them to skip non-competive documents.

To enable sort optimization for numeric sort fields,
the following needs to be done:
1) the field should be indexed with both doc_values and points, that
must have the same field name and same data
2) SortField#setCanUsePoints must be set
3) totalHitsThreshold should not be set to max value.
mayya-sharipova added a commit that referenced this pull request Oct 6, 2020
)

PR #1351 introduced a sort optimization where
documents can be skipped.
But there was a bug in case we were using two phase
approximation, as we would advance it without advancing
an overall conjunction iterator.

This patch fixed it.

Relates to #1351
mayya-sharipova added a commit that referenced this pull request Oct 6, 2020
)

PR #1351 introduced a sort optimization where
documents can be skipped.
But there was a bug in case we were using two phase
approximation, as we would advance it without advancing
an overall conjunction iterator.

This patch fixed it.

Relates to #1351
mayya-sharipova added a commit to mayya-sharipova/lucene-solr that referenced this pull request Oct 6, 2020
PR apache#1351 introduced a sort optimization where documents can be skipped.
But iteration over competitive iterators was not properly organized,
as they were not storing the current docID, and
when competitive iterator was updated the current doc ID was lost.

This patch fixed it.

Relates to apache#1351
mayya-sharipova added a commit that referenced this pull request Oct 6, 2020
PR #1351 introduced a sort optimization where documents can be skipped.
But iteration over competitive iterators was not properly organized,
as they were not storing the current docID, and
when competitive iterator was updated the current doc ID was lost.

This patch fixed it.

Relates to #1351
mayya-sharipova added a commit that referenced this pull request Oct 6, 2020
PR #1351 introduced a sort optimization where documents can be skipped.
But iteration over competitive iterators was not properly organized,
as they were not storing the current docID, and
when competitive iterator was updated the current doc ID was lost.

This patch fixed it.

Relates to #1351
mayya-sharipova added a commit to mikemccand/luceneutil that referenced this pull request Oct 22, 2020
Sort optimization introduced in apache/lucene-solr#1351
depends on numeric fields being indexed both as doc_values and points.

This PR does the following:
- add a LongPoint field – lastModLP, last modified timestamp
- add an IntPoint field – dayOfYearIP, day of the year of the last modified timestamp
- add sort on the last modified timestamp to wikimedium.10M.nostopwords.tasks

If we make a comparison with a run where sort optimization is not enabled,
as hits count may differ for a task not to fail,  `competition` in `localrun.py`
should be modified to:

```
 comp =  competition.Competition(verifyCounts=False)
```
msokolov pushed a commit to msokolov/luceneutil that referenced this pull request Oct 26, 2020
Sort optimization introduced in apache/lucene-solr#1351
depends on numeric fields being indexed both as doc_values and points.

This PR does the following:
- add a LongPoint field – lastModLP, last modified timestamp
- add an IntPoint field – dayOfYearIP, day of the year of the last modified timestamp
- add sort on the last modified timestamp to wikimedium.10M.nostopwords.tasks
- don't fail a task if hitCounts don't match in benchUtil.py. As we
don't collect all hits in the optimized runs, we don't expect hits total
to match.
mayya-sharipova added a commit to mayya-sharipova/lucene-solr that referenced this pull request Nov 4, 2020
Currently, if search sort is equal to index sort,  we have an early
termination in TopFieldCollector. As we work to enhance comparators
to provide skipping functionality (PR apache#1351), we would like to
move this termination functionality on index sort from
TopFieldCollector to comparators.

This patch does the following:
- Add method usesIndexSort to LeafFieldComparator
- Make numeric comparators aware of index sort and early terminate on
  collecting all competitive hits
- Move TermValComparator and TermOrdValComparator from FieldComparator
  to comparator package, for all comparators to be in the same package
- Enhance TermValComparator to provide skipping functionality when
  index is sorted

One item left for TODO for a following PR is to remove the logic of
early termniation from TopFieldCollector. We can do that once
we ensure that all BulkScorers are using iterators from collectors
than can skip non-competitive docs.

Relates to apache#1351
mayya-sharipova added a commit to mayya-sharipova/lucene-solr that referenced this pull request Nov 5, 2020
Currently, if search sort is equal to index sort,  we have an early
termination in TopFieldCollector. As we work to enhance comparators
to provide skipping functionality (PR apache#1351), we would like to
move this termination functionality on index sort from
TopFieldCollector to comparators.

This patch does the following:
- Add method usesIndexSort to LeafFieldComparator
- Make numeric comparators aware of index sort and early terminate on
  collecting all competitive hits
- Move TermValComparator and TermOrdValComparator from FieldComparator
  to comparator package, for all comparators to be in the same package
- Enhance TermOrdValComparator to provide skipping functionality when
  index is sorted

One item left for TODO for a following PR is to remove the logic of
early termination from TopFieldCollector. We can do that once
we ensure that all BulkScorers are using iterators from collectors
that can skip non-competitive docs.

Relates to apache#1351
mayya-sharipova added a commit to mayya-sharipova/lucene-solr that referenced this pull request Nov 10, 2020
Disable sort optimization in comparators on index sort.

Currently, if search sort is equal or a part of the index sort, we have
an early termination in TopFieldCollector.
But comparators are not aware of the index sort, and may run
sort optimization even if the search sort is congruent with
the index sort.

This patch:
- make leaf comparators aware that search sort is congruent
with the index sort.
- disables sort optimization in comparators in this case.
- removes a private  MultiComparatorLeafCollector class as the only
class that extended that class was TopFieldLeafCollector that
now incorporates the logic of the deleted class.

Relates to apache#1351
mayya-sharipova added a commit that referenced this pull request Dec 3, 2020
Disable sort optimization in comparators on index sort.

Currently, if search sort is equal or a part of the index sort, we have
an early termination in TopFieldCollector.
But comparators are not aware of the index sort, and may run
sort optimization even if the search sort is congruent with
the index sort.

This patch:
- adds `disableSkipping` method to `FieldComparator`,
 This method is called by `TopFieldCollector`, and currently called 
  when  the search sort is congruent with the index sort,
  but more conditions can be added. 
- disables sort optimization in comparators in this case.
- removes a private  `MultiComparatorLeafCollector` class, because the only
  class that extends `MultiComparatorLeafCollector` was `TopFieldLeafCollector`.
  The logic of the deleted `TopFieldLeafCollector` is added to `TopFieldLeafCollector`.

Relates to #1351
mayya-sharipova added a commit that referenced this pull request Dec 4, 2020
Disable sort optimization in comparators on index sort.

Currently, if search sort is equal or a part of the index sort, we have
an early termination in TopFieldCollector.
But comparators are not aware of the index sort, and may run
sort optimization even if the search sort is congruent with
the index sort.

This patch:
- adds `disableSkipping` method to `FieldComparator`,
This method is called by `TopFieldCollector`, when the search sort
is congruent with the index sort.
It is also called when we can't use points for sort optimization.
- disables sort optimization in comparators in these cases.

Relates to #1351
Backport for #2075
ctargett pushed a commit to ctargett/lucene-solr that referenced this pull request Dec 16, 2020
Disable sort optimization in comparators on index sort.

Currently, if search sort is equal or a part of the index sort, we have
an early termination in TopFieldCollector.
But comparators are not aware of the index sort, and may run
sort optimization even if the search sort is congruent with
the index sort.

This patch:
- adds `disableSkipping` method to `FieldComparator`,
 This method is called by `TopFieldCollector`, and currently called 
  when  the search sort is congruent with the index sort,
  but more conditions can be added. 
- disables sort optimization in comparators in this case.
- removes a private  `MultiComparatorLeafCollector` class, because the only
  class that extends `MultiComparatorLeafCollector` was `TopFieldLeafCollector`.
  The logic of the deleted `TopFieldLeafCollector` is added to `TopFieldLeafCollector`.

Relates to apache#1351
epugh pushed a commit to epugh/lucene-solr-1 that referenced this pull request Jan 15, 2021
Disable sort optimization in comparators on index sort.

Currently, if search sort is equal or a part of the index sort, we have
an early termination in TopFieldCollector.
But comparators are not aware of the index sort, and may run
sort optimization even if the search sort is congruent with
the index sort.

This patch:
- adds `disableSkipping` method to `FieldComparator`,
 This method is called by `TopFieldCollector`, and currently called 
  when  the search sort is congruent with the index sort,
  but more conditions can be added. 
- disables sort optimization in comparators in this case.
- removes a private  `MultiComparatorLeafCollector` class, because the only
  class that extends `MultiComparatorLeafCollector` was `TopFieldLeafCollector`.
  The logic of the deleted `TopFieldLeafCollector` is added to `TopFieldLeafCollector`.

Relates to apache#1351
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants