-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-8929: Early Terminating CollectorManager with Global Hitcount #803
base: master
Are you sure you want to change the base?
Conversation
cc @jpountz |
a7e3e35
to
5901ba0
Compare
Updated tests to do forced single segments tests. Ran beaster for 50 iterations on the new tests and it came in clean |
Updated PR to be inline with #806 |
This commit introduces a new collector and collectormanager with capability to accurately early terminate across all collectors when the total hits threshold is globally reached.
// Hit was not locally competitive hence it cannot | ||
// be globally competitive | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you got a bit confused between field sort and score sort here. TopFieldCollector is about sorting by field, which can be score with SortField.SCORE, but is generally not the case. This logic of sharing score information between collectors about the min competitive score would be more applicable to TopScoreCollector, or TopFieldCollector when the first sort field is Field.SCORE.
If you want to check whether a hit is globally competitive with TopFieldCollector, you'd need to track hits of all slices in a single hit queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz Thanks for clarifying.
So, we will need a thread safe implementation of PriorityQueue to work in a shared manner across all the collectors belong to EarlyTerminatingCollectorManager?
Also,If I understand correctly, this PR's approach can be used for TopScoreCollector to create a new collector and collector manager, where each collector collects numHits hits, and then publishes the score of its worst hit (the bottom of the local PQ). This value can then be used by all other collectors to filter further values ( for eg, a collector may accept a value which is lower than its local PQ's bottom, but better than the global worst hit score.).
Makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we will need a thread safe implementation of PriorityQueue to work in a shared manner across all the collectors belong to EarlyTerminatingCollectorManager?
I think it's an idea worth playing with indeed. I'm curious whether @msokolov has thoughts on this, since it has some intersection with pro-rated collection given that it helps avoid extra work when collecting multiple slices in parallel.
this PR's approach can be used for TopScoreCollector to create a new collector and collector manager, where each collector collects numHits hits, and then publishes the score of its worst hit (the bottom of the local PQ). This value can then be used by all other collectors to filter further values ( for eg, a collector may accept a value which is lower than its local PQ's bottom, but better than the global worst hit score.).
Correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we will need a thread safe implementation of PriorityQueue to work in a shared manner across all the collectors belong to EarlyTerminatingCollectorManager?
I think it's an idea worth playing with indeed. I'm curious whether @msokolov has thoughts on this, since it has some intersection with pro-rated collection given that it helps avoid extra work when collecting multiple slices in parallel.
I was thinking if it is worth collecting in individual PQs until all collectors have collected numHits, then perform a merge operation and populate a global shared PQ with the top hits across all of the numHits hits. Post that, the second phase of collection can operate on the shared queue.
This can help in cases where number of relevant hits < numHits, and for cases where numHits is close to the totalHitsThreshold.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this PR's approach can be used for TopScoreCollector to create a new collector and collector manager, where each collector collects numHits hits, and then publishes the score of its worst hit (the bottom of the local PQ). This value can then be used by all other collectors to filter further values ( for eg, a collector may accept a value which is lower than its local PQ's bottom, but better than the global worst hit score.).
Correct.
Ok, I will raise a separate PR for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point was not about merging, but about the fact that collecting less than
numHits
on each slice on average should be our goal, so there is no benefit to enable an optimization only after each slice has collectednumHits
hits already?
I was thinking it would help for cases when totalHitsThreshold is still larger than numHits.
I agree -- will start off with a shared PQ from get go.
IMO, the prorated number of hits is a great strategy for larger number of slices -- for a reasonable number of slices, we could use the proposed PR to allow accurate termination. Maybe have both as independent collectors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking it would help for cases when totalHitsThreshold is still larger than numHits.
I think that your approach to have a shared AtomicInteger would be good enough for this case. So maybe we should split this in two and have a first simple change that is just about optimizing the hit count with this AtomicInteger, and later evaluate whether we can also speed up the retrieval of the top hits, e.g. with a shared PQ when sorting by field, and by propagating the min competitive score across collectors when sorting by score?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot of different ideas here! I agree that early terminating after collecting hits, but still counting only needs a shared counter.
The shared/global PQ idea might be worth trying. Of course it's appealing to have something that works well without having to rely on random document distribution. I wonder whether thread-contention would give up the gains from avoiding the extra collection work though. In the case of index-sort, the cost of collection is usually quite low (just read a docvalue and compare, no scoring function computation), so eking out gains here is tricky.
I wonder if you could combine this idea with pro-rating to avoid some contention by collecting into per-segment queues until the pro-rated size is reached, then posting the results collected so far to a shared queue, and finally collecting into the global queue? Then in the best (most common, when docs are random) case, you do all the collection with no coordination and merge-sort as we do today, and afterwards, check one hit in each segment, find it's not (globally) competitive, and exit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So maybe we should split this in two and have a first simple change that is just about optimizing the hit count with this AtomicInteger, and later evaluate whether we can also speed up the retrieval of the top hits, e.g. with a shared PQ when sorting by field, and by propagating the min competitive score across collectors when sorting by score?
++, I agree. I will split this into three different collectors, starting with the shared counter one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you could combine this idea with pro-rating to avoid some contention by collecting into per-segment queues until the pro-rated size is reached, then posting the results collected so far to a shared queue, and finally collecting into the global queue? Then in the best (most common, when docs are random) case, you do all the collection with no coordination and merge-sort as we do today, and afterwards, check one hit in each segment, find it's not (globally) competitive, and exit.
That is pretty much what I suggested above, except that I was thinking of collecting numHits per collector. I agree that this can work well for cases when the number of slices is quite large. However, given a low number of "heavy" slices, I would still be inclined towards a shared PQ approach.
I think there are three distinct use cases here, and each of them would require a slightly different approach. I will start with a patch for the shared counter approach, and then follow up with the shared PQ + prorated number of hits per local queue, and min competitive hit score propagation.
WDYT?
This commit introduces a new collector and collectormanager with
capability to accurately early terminate across all collectors
when the total hits threshold is globally reached.