-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sometimes intersect the essential clause and the best non-essential clause. #12589
Sometimes intersect the essential clause and the best non-essential clause. #12589
Conversation
…lause. The idea behind MAXSCORE is to run disjunctions as `+(essentialClause1 ... essentialClauseM) nonEssentialClause1 ... nonEssentialClauseN`, moving more and more clauses from the essential list to the non-essential list as the minimum competitive score increases. For instance, a query such as `the book of life` which I found in the Tantivy benchmark ends up running as `+book the of life` after some time, ie. with one required clause and other clauses optional. This is because matching `the`, `of` and `life` alone is not good enough for yielding a match. Here some statistics in that case: - min competitive score: 3.4781857 - max_window_score(book): 2.8796153 - max_window_score(life): 2.037863 - max_window_score(the): 0.103848875 - max_window_score(of): 0.19427927 Actually if you look at these statistics, we could do better, because a match may only be competitive if it matches both `book` and `life`, so this query could actually execute as `+book +life the of`, which may help evaluate fewer documents compared to `+book the of life`. Especially if you enable recursive graph bisection. This is what this PR tries to achieve: in the event when there is a single essential clause and matching all clauses but the best non-essential clause cannot produce a competitive match, then the scorer will only evaluate documents that match the intersection of the essential clause and the best non-essential clause. It's worth noting that this optimization would kick in very frequently on 2-clauses disjunctions.
Opening as a draft as I still need to figure out how to test this optimization. I tested on wikibigall where this yielded a good speedup. I would expect an even better speedup if recursive graph bisection is enabled.
|
…ng scores. This adds a `ScoreQuantizingCollector`, which quantizes scores with a configurable number of accuracy bits. This allows dynamic pruning to more efficiently skip hits that would have similar scores. While this should be considered rank-unsafe since top-hits are different compared to running the top-score collector on its own, it's worth noting that top hits are correct in quantized space.
I moved the optimization as part of the partitioning logic so that it's easier to test. It's ready for review. |
Updated luceneutil results using wikibigall:
|
I plan on merging in the next couple days if there are no objections. |
…lause. (#12589) The idea behind MAXSCORE is to run disjunctions as `+(essentialClause1 ... essentialClauseM) nonEssentialClause1 ... nonEssentialClauseN`, moving more and more clauses from the essential list to the non-essential list as the minimum competitive score increases. For instance, a query such as `the book of life` which I found in the Tantivy benchmark ends up running as `+book the of life` after some time, ie. with one required clause and other clauses optional. This is because matching `the`, `of` and `life` alone is not good enough for yielding a match. Here some statistics in that case: - min competitive score: 3.4781857 - max_window_score(book): 2.8796153 - max_window_score(life): 2.037863 - max_window_score(the): 0.103848875 - max_window_score(of): 0.19427927 Actually if you look at these statistics, we could do better, because a match may only be competitive if it matches both `book` and `life`, so this query could actually execute as `+book +life the of`, which may help evaluate fewer documents compared to `+book the of life`. Especially if you enable recursive graph bisection. This is what this PR tries to achieve: in the event when there is a single essential clause and matching all clauses but the best non-essential clause cannot produce a competitive match, then the scorer will only evaluate documents that match the intersection of the essential clause and the best non-essential clause. It's worth noting that this optimization would kick in very frequently on 2-clauses disjunctions.
The idea behind MAXSCORE is to run disjunctions as
+(essentialClause1 ... essentialClauseM) nonEssentialClause1 ... nonEssentialClauseN
, moving more and more clauses from the essential list to the non-essential list as the minimum competitive score increases. For instance, a query such asthe book of life
which I found in the Tantivy benchmark ends up running as+book the of life
after some time, ie. with one required clause and other clauses optional. This is because matchingthe
,of
andlife
alone is not good enough for yielding a match.Here some statistics in that case:
Actually if you look at these statistics, we could do better, because a match may only be competitive if it matches both
book
andlife
, so this query could actually execute as+book +life the of
, which may help evaluate fewer documents compared to+book the of life
. Especially if you enable recursive graph bisection.This is what this PR tries to achieve: in the event when there is a single essential clause and matching all clauses but the best non-essential clause cannot produce a competitive match, then the scorer will only evaluate documents that match the intersection of the essential clause and the best non-essential clause.
It's worth noting that this optimization would kick in very frequently on 2-clauses disjunctions.