Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-8929: Early Terminating CollectorManager with Global Hitcount #803

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

atris
Copy link
Contributor

@atris atris commented Jul 23, 2019

This commit introduces a new collector and collectormanager with
capability to accurately early terminate across all collectors
when the total hits threshold is globally reached.

@atris
Copy link
Contributor Author

atris commented Jul 23, 2019

cc @jpountz

@atris atris force-pushed the LUCENE-8929 branch 4 times, most recently from a7e3e35 to 5901ba0 Compare July 23, 2019 11:28
@atris
Copy link
Contributor Author

atris commented Jul 23, 2019

Updated tests to do forced single segments tests.

Ran beaster for 50 iterations on the new tests and it came in clean

@atris
Copy link
Contributor Author

atris commented Jul 24, 2019

Updated PR to be inline with #806

This commit introduces a new collector and collectormanager with
capability to accurately early terminate across all collectors
when the total hits threshold is globally reached.
// Hit was not locally competitive hence it cannot
// be globally competitive
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you got a bit confused between field sort and score sort here. TopFieldCollector is about sorting by field, which can be score with SortField.SCORE, but is generally not the case. This logic of sharing score information between collectors about the min competitive score would be more applicable to TopScoreCollector, or TopFieldCollector when the first sort field is Field.SCORE.

If you want to check whether a hit is globally competitive with TopFieldCollector, you'd need to track hits of all slices in a single hit queue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz Thanks for clarifying.

So, we will need a thread safe implementation of PriorityQueue to work in a shared manner across all the collectors belong to EarlyTerminatingCollectorManager?

Also,If I understand correctly, this PR's approach can be used for TopScoreCollector to create a new collector and collector manager, where each collector collects numHits hits, and then publishes the score of its worst hit (the bottom of the local PQ). This value can then be used by all other collectors to filter further values ( for eg, a collector may accept a value which is lower than its local PQ's bottom, but better than the global worst hit score.).

Makes sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we will need a thread safe implementation of PriorityQueue to work in a shared manner across all the collectors belong to EarlyTerminatingCollectorManager?

I think it's an idea worth playing with indeed. I'm curious whether @msokolov has thoughts on this, since it has some intersection with pro-rated collection given that it helps avoid extra work when collecting multiple slices in parallel.

this PR's approach can be used for TopScoreCollector to create a new collector and collector manager, where each collector collects numHits hits, and then publishes the score of its worst hit (the bottom of the local PQ). This value can then be used by all other collectors to filter further values ( for eg, a collector may accept a value which is lower than its local PQ's bottom, but better than the global worst hit score.).

Correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we will need a thread safe implementation of PriorityQueue to work in a shared manner across all the collectors belong to EarlyTerminatingCollectorManager?

I think it's an idea worth playing with indeed. I'm curious whether @msokolov has thoughts on this, since it has some intersection with pro-rated collection given that it helps avoid extra work when collecting multiple slices in parallel.

I was thinking if it is worth collecting in individual PQs until all collectors have collected numHits, then perform a merge operation and populate a global shared PQ with the top hits across all of the numHits hits. Post that, the second phase of collection can operate on the shared queue.

This can help in cases where number of relevant hits < numHits, and for cases where numHits is close to the totalHitsThreshold.

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR's approach can be used for TopScoreCollector to create a new collector and collector manager, where each collector collects numHits hits, and then publishes the score of its worst hit (the bottom of the local PQ). This value can then be used by all other collectors to filter further values ( for eg, a collector may accept a value which is lower than its local PQ's bottom, but better than the global worst hit score.).

Correct.

Ok, I will raise a separate PR for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was not about merging, but about the fact that collecting less than numHits on each slice on average should be our goal, so there is no benefit to enable an optimization only after each slice has collected numHits hits already?

I was thinking it would help for cases when totalHitsThreshold is still larger than numHits.

I agree -- will start off with a shared PQ from get go.

IMO, the prorated number of hits is a great strategy for larger number of slices -- for a reasonable number of slices, we could use the proposed PR to allow accurate termination. Maybe have both as independent collectors?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking it would help for cases when totalHitsThreshold is still larger than numHits.

I think that your approach to have a shared AtomicInteger would be good enough for this case. So maybe we should split this in two and have a first simple change that is just about optimizing the hit count with this AtomicInteger, and later evaluate whether we can also speed up the retrieval of the top hits, e.g. with a shared PQ when sorting by field, and by propagating the min competitive score across collectors when sorting by score?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of different ideas here! I agree that early terminating after collecting hits, but still counting only needs a shared counter.

The shared/global PQ idea might be worth trying. Of course it's appealing to have something that works well without having to rely on random document distribution. I wonder whether thread-contention would give up the gains from avoiding the extra collection work though. In the case of index-sort, the cost of collection is usually quite low (just read a docvalue and compare, no scoring function computation), so eking out gains here is tricky.

I wonder if you could combine this idea with pro-rating to avoid some contention by collecting into per-segment queues until the pro-rated size is reached, then posting the results collected so far to a shared queue, and finally collecting into the global queue? Then in the best (most common, when docs are random) case, you do all the collection with no coordination and merge-sort as we do today, and afterwards, check one hit in each segment, find it's not (globally) competitive, and exit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe we should split this in two and have a first simple change that is just about optimizing the hit count with this AtomicInteger, and later evaluate whether we can also speed up the retrieval of the top hits, e.g. with a shared PQ when sorting by field, and by propagating the min competitive score across collectors when sorting by score?

++, I agree. I will split this into three different collectors, starting with the shared counter one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you could combine this idea with pro-rating to avoid some contention by collecting into per-segment queues until the pro-rated size is reached, then posting the results collected so far to a shared queue, and finally collecting into the global queue? Then in the best (most common, when docs are random) case, you do all the collection with no coordination and merge-sort as we do today, and afterwards, check one hit in each segment, find it's not (globally) competitive, and exit.

That is pretty much what I suggested above, except that I was thinking of collecting numHits per collector. I agree that this can work well for cases when the number of slices is quite large. However, given a low number of "heavy" slices, I would still be inclined towards a shared PQ approach.

I think there are three distinct use cases here, and each of them would require a slightly different approach. I will start with a patch for the shared counter approach, and then follow up with the shared PQ + prorated number of hits per local queue, and min competitive hit score propagation.

WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants