Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't let merged passages push out lower-scoring ones #11990

Merged
merged 4 commits into from
Dec 1, 2022

Conversation

romseygeek
Copy link
Contributor

PassageScorer uses a priority queue of size maxPassages to keep track of
which highlighted passages are worth returning to the user. Once all
passages have been collected, we go through and merge overlapping
passages together, but this reduction in the number of passages is not
compensated for by re-adding the highest-scoring passages that were pushed
out of the queue by passages which have been merged away.

This commit increases the size of the priority queue to try and account for
overlapping passages that will subsequently be merged together.

@romseygeek romseygeek self-assigned this Nov 30, 2022
@@ -89,8 +89,9 @@ public List<Passage> pickBest(
}

// Best passages so far.
int pqSize = Math.max(markers.size(), maxPassages);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can make the priority queue huge in degenerate cases when there is a lot of markers. And there may be for queries that expand to hundreds of terms (prefix queries are an example of this). I wonder if this shouldn't be a function of maxPassages rather than the number of markers...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at it and I think it was actually a deliberate decision (the selection of passages being independent from merging and capped at the user-requested max) so that regardless of how many markers there are, the overhead of the pq remains fairly low. I realize viewpoints will vary - I use this code in cases where the highlighter takes hits from multiple queries (and like I said, there can be hundreds of markers...). This change will degrade the performance significantly at literally no gain.

I'd try overestimating pqSize based on maxPassages: say, min(markers.size(), maxPassages * 3). The parameter could even be configurable so that the overhead can be tuned from the outside (with a reasonable default). WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min(markers.size(), maxPassages * 3)

Another option might be to just have a minimum size of 16, given that this is only really a problem when you're asking for two or three passages. Once you get above 10 passages then one or two less becomes less of an issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me as well.

Copy link
Contributor

@dweiss dweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Alan.

@romseygeek romseygeek merged commit 72ff140 into apache:main Dec 1, 2022
@romseygeek romseygeek deleted the highlight/overlapping-matches branch December 1, 2022 12:25
asfgit pushed a commit that referenced this pull request Dec 1, 2022
PassageScorer uses a priority queue of size maxPassages to keep track of
which highlighted passages are worth returning to the user. Once all
passages have been collected, we go through and merge overlapping
passages together, but this reduction in the number of passages is not
compensated for by re-adding the highest-scoring passages that were pushed
out of the queue by passages which have been merged away.

This commit increases the size of the priority queue to try and account for
overlapping passages that will subsequently be merged together.
@rmuir rmuir added this to the 9.5.0 milestone Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants