LUCENE-8929: Early Terminating CollectorManager with Global Hitcount #803

atris · 2019-07-23T08:08:12Z

This commit introduces a new collector and collectormanager with
capability to accurately early terminate across all collectors
when the total hits threshold is globally reached.

atris · 2019-07-23T08:10:04Z

cc @jpountz

atris · 2019-07-23T12:09:00Z

Updated tests to do forced single segments tests.

Ran beaster for 50 iterations on the new tests and it came in clean

atris · 2019-07-24T06:55:06Z

Updated PR to be inline with #806

This commit introduces a new collector and collectormanager with capability to accurately early terminate across all collectors when the total hits threshold is globally reached.

jpountz · 2019-07-25T18:11:49Z

lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java

+          // Hit was not locally competitive hence it cannot
+          // be globally competitive
+          return false;
+        }


I think you got a bit confused between field sort and score sort here. TopFieldCollector is about sorting by field, which can be score with SortField.SCORE, but is generally not the case. This logic of sharing score information between collectors about the min competitive score would be more applicable to TopScoreCollector, or TopFieldCollector when the first sort field is Field.SCORE.

If you want to check whether a hit is globally competitive with TopFieldCollector, you'd need to track hits of all slices in a single hit queue.

@jpountz Thanks for clarifying.

So, we will need a thread safe implementation of PriorityQueue to work in a shared manner across all the collectors belong to EarlyTerminatingCollectorManager?

Also,If I understand correctly, this PR's approach can be used for TopScoreCollector to create a new collector and collector manager, where each collector collects numHits hits, and then publishes the score of its worst hit (the bottom of the local PQ). This value can then be used by all other collectors to filter further values ( for eg, a collector may accept a value which is lower than its local PQ's bottom, but better than the global worst hit score.).

Makes sense?

So, we will need a thread safe implementation of PriorityQueue to work in a shared manner across all the collectors belong to EarlyTerminatingCollectorManager?

I think it's an idea worth playing with indeed. I'm curious whether @msokolov has thoughts on this, since it has some intersection with pro-rated collection given that it helps avoid extra work when collecting multiple slices in parallel.

this PR's approach can be used for TopScoreCollector to create a new collector and collector manager, where each collector collects numHits hits, and then publishes the score of its worst hit (the bottom of the local PQ). This value can then be used by all other collectors to filter further values ( for eg, a collector may accept a value which is lower than its local PQ's bottom, but better than the global worst hit score.).

Correct.

So, we will need a thread safe implementation of PriorityQueue to work in a shared manner across all the collectors belong to EarlyTerminatingCollectorManager?

I think it's an idea worth playing with indeed. I'm curious whether @msokolov has thoughts on this, since it has some intersection with pro-rated collection given that it helps avoid extra work when collecting multiple slices in parallel.

I was thinking if it is worth collecting in individual PQs until all collectors have collected numHits, then perform a merge operation and populate a global shared PQ with the top hits across all of the numHits hits. Post that, the second phase of collection can operate on the shared queue.

This can help in cases where number of relevant hits < numHits, and for cases where numHits is close to the totalHitsThreshold.

Thoughts?

this PR's approach can be used for TopScoreCollector to create a new collector and collector manager, where each collector collects numHits hits, and then publishes the score of its worst hit (the bottom of the local PQ). This value can then be used by all other collectors to filter further values ( for eg, a collector may accept a value which is lower than its local PQ's bottom, but better than the global worst hit score.).

Correct.

Ok, I will raise a separate PR for this

My point was not about merging, but about the fact that collecting less than numHits on each slice on average should be our goal, so there is no benefit to enable an optimization only after each slice has collected numHits hits already?

I was thinking it would help for cases when totalHitsThreshold is still larger than numHits.

I agree -- will start off with a shared PQ from get go.

IMO, the prorated number of hits is a great strategy for larger number of slices -- for a reasonable number of slices, we could use the proposed PR to allow accurate termination. Maybe have both as independent collectors?

I was thinking it would help for cases when totalHitsThreshold is still larger than numHits.

I think that your approach to have a shared AtomicInteger would be good enough for this case. So maybe we should split this in two and have a first simple change that is just about optimizing the hit count with this AtomicInteger, and later evaluate whether we can also speed up the retrieval of the top hits, e.g. with a shared PQ when sorting by field, and by propagating the min competitive score across collectors when sorting by score?

There are a lot of different ideas here! I agree that early terminating after collecting hits, but still counting only needs a shared counter.

The shared/global PQ idea might be worth trying. Of course it's appealing to have something that works well without having to rely on random document distribution. I wonder whether thread-contention would give up the gains from avoiding the extra collection work though. In the case of index-sort, the cost of collection is usually quite low (just read a docvalue and compare, no scoring function computation), so eking out gains here is tricky.

I wonder if you could combine this idea with pro-rating to avoid some contention by collecting into per-segment queues until the pro-rated size is reached, then posting the results collected so far to a shared queue, and finally collecting into the global queue? Then in the best (most common, when docs are random) case, you do all the collection with no coordination and merge-sort as we do today, and afterwards, check one hit in each segment, find it's not (globally) competitive, and exit.

So maybe we should split this in two and have a first simple change that is just about optimizing the hit count with this AtomicInteger, and later evaluate whether we can also speed up the retrieval of the top hits, e.g. with a shared PQ when sorting by field, and by propagating the min competitive score across collectors when sorting by score?

++, I agree. I will split this into three different collectors, starting with the shared counter one

I wonder if you could combine this idea with pro-rating to avoid some contention by collecting into per-segment queues until the pro-rated size is reached, then posting the results collected so far to a shared queue, and finally collecting into the global queue? Then in the best (most common, when docs are random) case, you do all the collection with no coordination and merge-sort as we do today, and afterwards, check one hit in each segment, find it's not (globally) competitive, and exit.

That is pretty much what I suggested above, except that I was thinking of collecting numHits per collector. I agree that this can work well for cases when the number of slices is quite large. However, given a low number of "heavy" slices, I would still be inclined towards a shared PQ approach.

I think there are three distinct use cases here, and each of them would require a slightly different approach. I will start with a patch for the shared counter approach, and then follow up with the shared PQ + prorated number of hits per local queue, and min competitive hit score propagation.

WDYT?

atris force-pushed the LUCENE-8929 branch 4 times, most recently from a7e3e35 to 5901ba0 Compare July 23, 2019 11:28

atris force-pushed the LUCENE-8929 branch from 5901ba0 to c6f2a68 Compare July 24, 2019 03:43

LUCENE-8929: Early Terminating CollectorManager with Global Hitcount

156cfb0

This commit introduces a new collector and collectormanager with capability to accurately early terminate across all collectors when the total hits threshold is globally reached.

atris force-pushed the LUCENE-8929 branch from c6f2a68 to 156cfb0 Compare July 25, 2019 04:49

jpountz reviewed Jul 25, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-8929: Early Terminating CollectorManager with Global Hitcount #803

LUCENE-8929: Early Terminating CollectorManager with Global Hitcount #803

atris commented Jul 23, 2019

atris commented Jul 23, 2019

atris commented Jul 23, 2019

atris commented Jul 24, 2019

jpountz Jul 25, 2019

atris Jul 26, 2019

jpountz Jul 26, 2019

atris Jul 26, 2019

atris Jul 26, 2019

atris Jul 26, 2019

jpountz Jul 26, 2019

msokolov Jul 28, 2019

atris Jul 29, 2019

atris Jul 29, 2019

LUCENE-8929: Early Terminating CollectorManager with Global Hitcount #803

Are you sure you want to change the base?

LUCENE-8929: Early Terminating CollectorManager with Global Hitcount #803

Conversation

atris commented Jul 23, 2019

atris commented Jul 23, 2019

atris commented Jul 23, 2019

atris commented Jul 24, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment