-
Notifications
You must be signed in to change notification settings - Fork 975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't let merged passages push out lower-scoring ones #11990
Don't let merged passages push out lower-scoring ones #11990
Conversation
@@ -89,8 +89,9 @@ public List<Passage> pickBest( | |||
} | |||
|
|||
// Best passages so far. | |||
int pqSize = Math.max(markers.size(), maxPassages); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can make the priority queue huge in degenerate cases when there is a lot of markers. And there may be for queries that expand to hundreds of terms (prefix queries are an example of this). I wonder if this shouldn't be a function of maxPassages rather than the number of markers...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at it and I think it was actually a deliberate decision (the selection of passages being independent from merging and capped at the user-requested max) so that regardless of how many markers there are, the overhead of the pq remains fairly low. I realize viewpoints will vary - I use this code in cases where the highlighter takes hits from multiple queries (and like I said, there can be hundreds of markers...). This change will degrade the performance significantly at literally no gain.
I'd try overestimating pqSize based on maxPassages: say, min(markers.size(), maxPassages * 3). The parameter could even be configurable so that the overhead can be tuned from the outside (with a reasonable default). WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min(markers.size(), maxPassages * 3)
Another option might be to just have a minimum size of 16, given that this is only really a problem when you're asking for two or three passages. Once you get above 10 passages then one or two less becomes less of an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Alan.
PassageScorer uses a priority queue of size maxPassages to keep track of which highlighted passages are worth returning to the user. Once all passages have been collected, we go through and merge overlapping passages together, but this reduction in the number of passages is not compensated for by re-adding the highest-scoring passages that were pushed out of the queue by passages which have been merged away. This commit increases the size of the priority queue to try and account for overlapping passages that will subsequently be merged together.
PassageScorer uses a priority queue of size
maxPassages
to keep track ofwhich highlighted passages are worth returning to the user. Once all
passages have been collected, we go through and merge overlapping
passages together, but this reduction in the number of passages is not
compensated for by re-adding the highest-scoring passages that were pushed
out of the queue by passages which have been merged away.
This commit increases the size of the priority queue to try and account for
overlapping passages that will subsequently be merged together.