Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes intersect the essential clause and the best non-essential clause. #12589

Merged
merged 14 commits into from
Oct 24, 2023

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Sep 25, 2023

The idea behind MAXSCORE is to run disjunctions as +(essentialClause1 ... essentialClauseM) nonEssentialClause1 ... nonEssentialClauseN, moving more and more clauses from the essential list to the non-essential list as the minimum competitive score increases. For instance, a query such as the book of life which I found in the Tantivy benchmark ends up running as +book the of life after some time, ie. with one required clause and other clauses optional. This is because matching the, of and life alone is not good enough for yielding a match.

Here some statistics in that case:

  • min competitive score: 3.4781857
  • max_window_score(book): 2.8796153
  • max_window_score(life): 2.037863
  • max_window_score(the): 0.103848875
  • max_window_score(of): 0.19427927

Actually if you look at these statistics, we could do better, because a match may only be competitive if it matches both book and life, so this query could actually execute as +book +life the of, which may help evaluate fewer documents compared to +book the of life. Especially if you enable recursive graph bisection.

This is what this PR tries to achieve: in the event when there is a single essential clause and matching all clauses but the best non-essential clause cannot produce a competitive match, then the scorer will only evaluate documents that match the intersection of the essential clause and the best non-essential clause.

It's worth noting that this optimization would kick in very frequently on 2-clauses disjunctions.

…lause.

The idea behind MAXSCORE is to run disjunctions as `+(essentialClause1 ...
essentialClauseM) nonEssentialClause1 ... nonEssentialClauseN`, moving more and
more clauses from the essential list to the non-essential list as the minimum
competitive score increases. For instance, a query such as `the book of life`
which I found in the Tantivy benchmark ends up running as `+book the of life`
after some time, ie. with one required clause and other clauses optional. This
is because matching `the`, `of` and `life` alone is not good enough for
yielding a match.

Here some statistics in that case:
 - min competitive score: 3.4781857
 - max_window_score(book): 2.8796153
 - max_window_score(life): 2.037863
 - max_window_score(the): 0.103848875
 - max_window_score(of): 0.19427927

Actually if you look at these statistics, we could do better, because a match
may only be competitive if it matches both `book` and `life`, so this query
could actually execute as `+book +life the of`, which may help evaluate fewer
documents compared to `+book the of life`. Especially if you enable recursive
graph bisection.

This is what this PR tries to achieve: in the event when there is a single
essential clause and matching all clauses but the best non-essential clause
cannot produce a competitive match, then the scorer will only evaluate
documents that match the intersection of the essential clause and the best
non-essential clause.

It's worth noting that this optimization would kick in very frequently on
2-clauses disjunctions.
@jpountz
Copy link
Contributor Author

jpountz commented Sep 25, 2023

Opening as a draft as I still need to figure out how to test this optimization.

I tested on wikibigall where this yielded a good speedup. I would expect an even better speedup if recursive graph bisection is enabled.

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                 CountOrHighHigh       59.20     (15.6%)       57.96     (14.8%)   -2.1% ( -28% -   33%) 0.671
                  CountOrHighMed       91.87     (15.6%)       90.04     (14.7%)   -2.0% ( -27% -   33%) 0.684
                         MedTerm      451.03      (9.1%)      446.68      (8.1%)   -1.0% ( -16% -   17%) 0.730
                         LowTerm      751.75      (8.3%)      744.56      (7.1%)   -1.0% ( -15% -   15%) 0.703
                       MedPhrase       40.42      (7.8%)       40.05      (7.9%)   -0.9% ( -15% -   16%) 0.717
                        HighTerm      339.69     (10.6%)      337.52      (9.5%)   -0.6% ( -18% -   21%) 0.844
                      HighPhrase       35.00      (6.4%)       34.82      (6.7%)   -0.5% ( -12% -   13%) 0.809
                     AndHighHigh       63.82      (4.2%)       63.55      (2.9%)   -0.4% (  -7% -    7%) 0.721
                      AndHighMed      182.62      (3.6%)      181.94      (2.4%)   -0.4% (  -6% -    5%) 0.705
                       LowPhrase       41.82      (4.8%)       41.69      (5.6%)   -0.3% ( -10% -   10%) 0.848
           HighTermDayOfYearSort      234.81      (1.5%)      234.33      (1.8%)   -0.2% (  -3% -    3%) 0.697
                      AndHighLow      823.51      (3.2%)      821.97      (2.6%)   -0.2% (  -5% -    5%) 0.846
                       OrHighLow      630.34      (2.9%)      630.52      (3.9%)    0.0% (  -6% -    6%) 0.979
                         Respell       66.16      (1.4%)       66.19      (1.7%)    0.0% (  -2% -    3%) 0.938
                          Fuzzy1      128.80      (1.4%)      128.97      (1.5%)    0.1% (  -2% -    3%) 0.777
                 CountAndHighMed      123.28      (2.8%)      123.60      (3.6%)    0.3% (  -5% -    6%) 0.803
                          Fuzzy2      114.67      (1.3%)      114.97      (1.5%)    0.3% (  -2% -    3%) 0.567
                     CountPhrase        3.38      (9.0%)        3.39      (9.3%)    0.3% ( -16% -   20%) 0.918
                CountAndHighHigh       40.81      (2.8%)       40.96      (3.8%)    0.4% (  -6% -    7%) 0.742
                        PKLookup      224.07      (2.3%)      225.12      (2.7%)    0.5% (  -4% -    5%) 0.566
                        Wildcard      144.23      (2.4%)      145.22      (3.1%)    0.7% (  -4% -    6%) 0.451
               HighTermMonthSort     5090.41      (2.8%)     5142.25      (4.2%)    1.0% (  -5% -    8%) 0.384
                         Prefix3      241.24      (4.3%)      244.33      (4.4%)    1.3% (  -7% -   10%) 0.368
                       CountTerm    17070.19      (5.1%)    17289.73      (4.9%)    1.3% (  -8% -   11%) 0.429
                          IntNRQ       88.47     (11.5%)       89.76     (13.8%)    1.5% ( -21% -   30%) 0.725
                       OrHighMed      191.58      (3.5%)      206.42      (3.7%)    7.7% (   0% -   15%) 0.000
                      OrHighHigh       60.88      (4.4%)       68.91      (4.4%)   13.2% (   4% -   23%) 0.000

…ng scores.

This adds a `ScoreQuantizingCollector`, which quantizes scores with a
configurable number of accuracy bits. This allows dynamic pruning to more
efficiently skip hits that would have similar scores. While this should be
considered rank-unsafe since top-hits are different compared to running the
top-score collector on its own, it's worth noting that top hits are correct in
quantized space.
@jpountz jpountz marked this pull request as ready for review October 18, 2023 08:30
@jpountz
Copy link
Contributor Author

jpountz commented Oct 18, 2023

I moved the optimization as part of the partitioning logic so that it's easier to test. It's ready for review.

@jpountz
Copy link
Contributor Author

jpountz commented Oct 18, 2023

Updated luceneutil results using wikibigall:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                        Wildcard       61.99      (4.0%)       61.33      (4.1%)   -1.1% (  -8% -    7%) 0.408
               HighTermMonthSort     4984.72      (2.3%)     4942.50      (2.6%)   -0.8% (  -5% -    4%) 0.272
                         Prefix3      230.04      (3.7%)      229.11      (3.5%)   -0.4% (  -7% -    7%) 0.720
                         Respell       81.35      (1.4%)       81.25      (2.0%)   -0.1% (  -3% -    3%) 0.816
                CountAndHighHigh       40.98      (4.2%)       40.97      (3.9%)   -0.0% (  -7% -    8%) 0.981
                       CountTerm    16343.93      (4.0%)    16345.74      (4.3%)    0.0% (  -7% -    8%) 0.993
                     CountPhrase        4.46      (3.5%)        4.46      (4.1%)    0.0% (  -7% -    7%) 0.980
           HighTermDayOfYearSort      238.80      (1.4%)      238.88      (1.4%)    0.0% (  -2% -    2%) 0.938
                        PKLookup      223.95      (2.1%)      224.09      (2.1%)    0.1% (  -4% -    4%) 0.925
                 CountAndHighMed      123.80      (3.9%)      123.90      (3.4%)    0.1% (  -6% -    7%) 0.943
                 CountOrHighHigh       57.72     (17.0%)       57.92     (17.7%)    0.3% ( -29% -   42%) 0.950
                      AndHighLow     1006.20      (3.0%)     1010.52      (2.7%)    0.4% (  -5% -    6%) 0.632
                  CountOrHighMed       89.50     (16.7%)       89.90     (17.5%)    0.4% ( -28% -   41%) 0.934
                     AndHighHigh       64.48      (3.6%)       64.85      (3.4%)    0.6% (  -6% -    7%) 0.608
                      AndHighMed      146.96      (3.5%)      147.93      (3.2%)    0.7% (  -5% -    7%) 0.530
                         LowTerm      994.04      (5.9%)     1008.50      (5.9%)    1.5% (  -9% -   14%) 0.434
                          Fuzzy1      150.93      (1.1%)      153.62      (2.0%)    1.8% (  -1% -    4%) 0.000
                         MedTerm      629.58      (6.2%)      641.76      (6.2%)    1.9% (  -9% -   15%) 0.324
                       MedPhrase       28.07      (5.7%)       28.62      (3.6%)    1.9% (  -6% -   11%) 0.201
                          Fuzzy2      107.70      (0.9%)      109.90      (1.7%)    2.0% (   0% -    4%) 0.000
                       LowPhrase       26.51      (6.3%)       27.15      (5.2%)    2.4% (  -8% -   14%) 0.184
                      HighPhrase       40.05      (6.5%)       41.02      (3.7%)    2.4% (  -7% -   13%) 0.147
                          IntNRQ      128.31     (15.0%)      131.43     (17.3%)    2.4% ( -25% -   40%) 0.636
                        HighTerm      399.42      (7.4%)      409.65      (8.0%)    2.6% ( -11% -   19%) 0.293
                       OrHighLow      647.57      (3.4%)      668.35      (3.6%)    3.2% (  -3% -   10%) 0.004
                       OrHighMed      143.30      (5.0%)      163.00      (6.1%)   13.7% (   2% -   26%) 0.000
                      OrHighHigh       56.47      (6.2%)       64.35      (7.8%)   14.0% (   0% -   29%) 0.000

@jpountz
Copy link
Contributor Author

jpountz commented Oct 23, 2023

I plan on merging in the next couple days if there are no objections.

@jpountz jpountz merged commit 611bbbd into apache:main Oct 24, 2023
4 checks passed
@jpountz jpountz deleted the run_disjunctions_as_conjunctions branch October 24, 2023 15:54
jpountz added a commit that referenced this pull request Oct 24, 2023
…lause. (#12589)

The idea behind MAXSCORE is to run disjunctions as `+(essentialClause1 ...
essentialClauseM) nonEssentialClause1 ... nonEssentialClauseN`, moving more and
more clauses from the essential list to the non-essential list as the minimum
competitive score increases. For instance, a query such as `the book of life`
which I found in the Tantivy benchmark ends up running as `+book the of life`
after some time, ie. with one required clause and other clauses optional. This
is because matching `the`, `of` and `life` alone is not good enough for
yielding a match.

Here some statistics in that case:
 - min competitive score: 3.4781857
 - max_window_score(book): 2.8796153
 - max_window_score(life): 2.037863
 - max_window_score(the): 0.103848875
 - max_window_score(of): 0.19427927

Actually if you look at these statistics, we could do better, because a match
may only be competitive if it matches both `book` and `life`, so this query
could actually execute as `+book +life the of`, which may help evaluate fewer
documents compared to `+book the of life`. Especially if you enable recursive
graph bisection.

This is what this PR tries to achieve: in the event when there is a single
essential clause and matching all clauses but the best non-essential clause
cannot produce a competitive match, then the scorer will only evaluate
documents that match the intersection of the essential clause and the best
non-essential clause.

It's worth noting that this optimization would kick in very frequently on
2-clauses disjunctions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant