Sometimes intersect the essential clause and the best non-essential clause. #12589

jpountz · 2023-09-25T14:48:34Z

The idea behind MAXSCORE is to run disjunctions as +(essentialClause1 ... essentialClauseM) nonEssentialClause1 ... nonEssentialClauseN, moving more and more clauses from the essential list to the non-essential list as the minimum competitive score increases. For instance, a query such as the book of life which I found in the Tantivy benchmark ends up running as +book the of life after some time, ie. with one required clause and other clauses optional. This is because matching the, of and life alone is not good enough for yielding a match.

Here some statistics in that case:

min competitive score: 3.4781857
max_window_score(book): 2.8796153
max_window_score(life): 2.037863
max_window_score(the): 0.103848875
max_window_score(of): 0.19427927

Actually if you look at these statistics, we could do better, because a match may only be competitive if it matches both book and life, so this query could actually execute as +book +life the of, which may help evaluate fewer documents compared to +book the of life. Especially if you enable recursive graph bisection.

This is what this PR tries to achieve: in the event when there is a single essential clause and matching all clauses but the best non-essential clause cannot produce a competitive match, then the scorer will only evaluate documents that match the intersection of the essential clause and the best non-essential clause.

It's worth noting that this optimization would kick in very frequently on 2-clauses disjunctions.

…lause. The idea behind MAXSCORE is to run disjunctions as `+(essentialClause1 ... essentialClauseM) nonEssentialClause1 ... nonEssentialClauseN`, moving more and more clauses from the essential list to the non-essential list as the minimum competitive score increases. For instance, a query such as `the book of life` which I found in the Tantivy benchmark ends up running as `+book the of life` after some time, ie. with one required clause and other clauses optional. This is because matching `the`, `of` and `life` alone is not good enough for yielding a match. Here some statistics in that case: - min competitive score: 3.4781857 - max_window_score(book): 2.8796153 - max_window_score(life): 2.037863 - max_window_score(the): 0.103848875 - max_window_score(of): 0.19427927 Actually if you look at these statistics, we could do better, because a match may only be competitive if it matches both `book` and `life`, so this query could actually execute as `+book +life the of`, which may help evaluate fewer documents compared to `+book the of life`. Especially if you enable recursive graph bisection. This is what this PR tries to achieve: in the event when there is a single essential clause and matching all clauses but the best non-essential clause cannot produce a competitive match, then the scorer will only evaluate documents that match the intersection of the essential clause and the best non-essential clause. It's worth noting that this optimization would kick in very frequently on 2-clauses disjunctions.

jpountz · 2023-09-25T14:50:10Z

Opening as a draft as I still need to figure out how to test this optimization.

I tested on wikibigall where this yielded a good speedup. I would expect an even better speedup if recursive graph bisection is enabled.

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                 CountOrHighHigh       59.20     (15.6%)       57.96     (14.8%)   -2.1% ( -28% -   33%) 0.671
                  CountOrHighMed       91.87     (15.6%)       90.04     (14.7%)   -2.0% ( -27% -   33%) 0.684
                         MedTerm      451.03      (9.1%)      446.68      (8.1%)   -1.0% ( -16% -   17%) 0.730
                         LowTerm      751.75      (8.3%)      744.56      (7.1%)   -1.0% ( -15% -   15%) 0.703
                       MedPhrase       40.42      (7.8%)       40.05      (7.9%)   -0.9% ( -15% -   16%) 0.717
                        HighTerm      339.69     (10.6%)      337.52      (9.5%)   -0.6% ( -18% -   21%) 0.844
                      HighPhrase       35.00      (6.4%)       34.82      (6.7%)   -0.5% ( -12% -   13%) 0.809
                     AndHighHigh       63.82      (4.2%)       63.55      (2.9%)   -0.4% (  -7% -    7%) 0.721
                      AndHighMed      182.62      (3.6%)      181.94      (2.4%)   -0.4% (  -6% -    5%) 0.705
                       LowPhrase       41.82      (4.8%)       41.69      (5.6%)   -0.3% ( -10% -   10%) 0.848
           HighTermDayOfYearSort      234.81      (1.5%)      234.33      (1.8%)   -0.2% (  -3% -    3%) 0.697
                      AndHighLow      823.51      (3.2%)      821.97      (2.6%)   -0.2% (  -5% -    5%) 0.846
                       OrHighLow      630.34      (2.9%)      630.52      (3.9%)    0.0% (  -6% -    6%) 0.979
                         Respell       66.16      (1.4%)       66.19      (1.7%)    0.0% (  -2% -    3%) 0.938
                          Fuzzy1      128.80      (1.4%)      128.97      (1.5%)    0.1% (  -2% -    3%) 0.777
                 CountAndHighMed      123.28      (2.8%)      123.60      (3.6%)    0.3% (  -5% -    6%) 0.803
                          Fuzzy2      114.67      (1.3%)      114.97      (1.5%)    0.3% (  -2% -    3%) 0.567
                     CountPhrase        3.38      (9.0%)        3.39      (9.3%)    0.3% ( -16% -   20%) 0.918
                CountAndHighHigh       40.81      (2.8%)       40.96      (3.8%)    0.4% (  -6% -    7%) 0.742
                        PKLookup      224.07      (2.3%)      225.12      (2.7%)    0.5% (  -4% -    5%) 0.566
                        Wildcard      144.23      (2.4%)      145.22      (3.1%)    0.7% (  -4% -    6%) 0.451
               HighTermMonthSort     5090.41      (2.8%)     5142.25      (4.2%)    1.0% (  -5% -    8%) 0.384
                         Prefix3      241.24      (4.3%)      244.33      (4.4%)    1.3% (  -7% -   10%) 0.368
                       CountTerm    17070.19      (5.1%)    17289.73      (4.9%)    1.3% (  -8% -   11%) 0.429
                          IntNRQ       88.47     (11.5%)       89.76     (13.8%)    1.5% ( -21% -   30%) 0.725
                       OrHighMed      191.58      (3.5%)      206.42      (3.7%)    7.7% (   0% -   15%) 0.000
                      OrHighHigh       60.88      (4.4%)       68.91      (4.4%)   13.2% (   4% -   23%) 0.000

…ng scores. This adds a `ScoreQuantizingCollector`, which quantizes scores with a configurable number of accuracy bits. This allows dynamic pruning to more efficiently skip hits that would have similar scores. While this should be considered rank-unsafe since top-hits are different compared to running the top-score collector on its own, it's worth noting that top hits are correct in quantized space.

jpountz · 2023-10-18T08:31:11Z

I moved the optimization as part of the partitioning logic so that it's easier to test. It's ready for review.

jpountz · 2023-10-18T09:10:45Z

Updated luceneutil results using wikibigall:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                        Wildcard       61.99      (4.0%)       61.33      (4.1%)   -1.1% (  -8% -    7%) 0.408
               HighTermMonthSort     4984.72      (2.3%)     4942.50      (2.6%)   -0.8% (  -5% -    4%) 0.272
                         Prefix3      230.04      (3.7%)      229.11      (3.5%)   -0.4% (  -7% -    7%) 0.720
                         Respell       81.35      (1.4%)       81.25      (2.0%)   -0.1% (  -3% -    3%) 0.816
                CountAndHighHigh       40.98      (4.2%)       40.97      (3.9%)   -0.0% (  -7% -    8%) 0.981
                       CountTerm    16343.93      (4.0%)    16345.74      (4.3%)    0.0% (  -7% -    8%) 0.993
                     CountPhrase        4.46      (3.5%)        4.46      (4.1%)    0.0% (  -7% -    7%) 0.980
           HighTermDayOfYearSort      238.80      (1.4%)      238.88      (1.4%)    0.0% (  -2% -    2%) 0.938
                        PKLookup      223.95      (2.1%)      224.09      (2.1%)    0.1% (  -4% -    4%) 0.925
                 CountAndHighMed      123.80      (3.9%)      123.90      (3.4%)    0.1% (  -6% -    7%) 0.943
                 CountOrHighHigh       57.72     (17.0%)       57.92     (17.7%)    0.3% ( -29% -   42%) 0.950
                      AndHighLow     1006.20      (3.0%)     1010.52      (2.7%)    0.4% (  -5% -    6%) 0.632
                  CountOrHighMed       89.50     (16.7%)       89.90     (17.5%)    0.4% ( -28% -   41%) 0.934
                     AndHighHigh       64.48      (3.6%)       64.85      (3.4%)    0.6% (  -6% -    7%) 0.608
                      AndHighMed      146.96      (3.5%)      147.93      (3.2%)    0.7% (  -5% -    7%) 0.530
                         LowTerm      994.04      (5.9%)     1008.50      (5.9%)    1.5% (  -9% -   14%) 0.434
                          Fuzzy1      150.93      (1.1%)      153.62      (2.0%)    1.8% (  -1% -    4%) 0.000
                         MedTerm      629.58      (6.2%)      641.76      (6.2%)    1.9% (  -9% -   15%) 0.324
                       MedPhrase       28.07      (5.7%)       28.62      (3.6%)    1.9% (  -6% -   11%) 0.201
                          Fuzzy2      107.70      (0.9%)      109.90      (1.7%)    2.0% (   0% -    4%) 0.000
                       LowPhrase       26.51      (6.3%)       27.15      (5.2%)    2.4% (  -8% -   14%) 0.184
                      HighPhrase       40.05      (6.5%)       41.02      (3.7%)    2.4% (  -7% -   13%) 0.147
                          IntNRQ      128.31     (15.0%)      131.43     (17.3%)    2.4% ( -25% -   40%) 0.636
                        HighTerm      399.42      (7.4%)      409.65      (8.0%)    2.6% ( -11% -   19%) 0.293
                       OrHighLow      647.57      (3.4%)      668.35      (3.6%)    3.2% (  -3% -   10%) 0.004
                       OrHighMed      143.30      (5.0%)      163.00      (6.1%)   13.7% (   2% -   26%) 0.000
                      OrHighHigh       56.47      (6.2%)       64.35      (7.8%)   14.0% (   0% -   29%) 0.000

jpountz · 2023-10-23T12:36:36Z

I plan on merging in the next couple days if there are no objections.

…lause. (#12589) The idea behind MAXSCORE is to run disjunctions as `+(essentialClause1 ... essentialClauseM) nonEssentialClause1 ... nonEssentialClauseN`, moving more and more clauses from the essential list to the non-essential list as the minimum competitive score increases. For instance, a query such as `the book of life` which I found in the Tantivy benchmark ends up running as `+book the of life` after some time, ie. with one required clause and other clauses optional. This is because matching `the`, `of` and `life` alone is not good enough for yielding a match. Here some statistics in that case: - min competitive score: 3.4781857 - max_window_score(book): 2.8796153 - max_window_score(life): 2.037863 - max_window_score(the): 0.103848875 - max_window_score(of): 0.19427927 Actually if you look at these statistics, we could do better, because a match may only be competitive if it matches both `book` and `life`, so this query could actually execute as `+book +life the of`, which may help evaluate fewer documents compared to `+book the of life`. Especially if you enable recursive graph bisection. This is what this PR tries to achieve: in the event when there is a single essential clause and matching all clauses but the best non-essential clause cannot produce a competitive match, then the scorer will only evaluate documents that match the intersection of the essential clause and the best non-essential clause. It's worth noting that this optimization would kick in very frequently on 2-clauses disjunctions.

jpountz added 7 commits October 9, 2023 08:47

Merge branch 'main' into run_disjunctions_as_conjunctions

9804947

iter

43c0d3f

iter

f86a620

Merge branch 'main' into run_disjunctions_as_conjunctions

6e9236b

Remove unrelated code.

0813fb3

iter

a6e23a6

jpountz marked this pull request as ready for review October 18, 2023 08:30

Undo unintended change.

a893fe0

forbidden API

9465a5d

jpountz added 4 commits October 24, 2023 15:58

Merge branch 'main' into run_disjunctions_as_conjunctions

52e1e58

Fix indentation

e01bd8e

CHANGES

d3f4eff

Oops deleted docs.

31bab4b

jpountz merged commit 611bbbd into apache:main Oct 24, 2023
4 checks passed

jpountz deleted the run_disjunctions_as_conjunctions branch October 24, 2023 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sometimes intersect the essential clause and the best non-essential clause. #12589

Sometimes intersect the essential clause and the best non-essential clause. #12589

jpountz commented Sep 25, 2023

jpountz commented Sep 25, 2023

jpountz commented Oct 18, 2023

jpountz commented Oct 18, 2023

jpountz commented Oct 23, 2023

Sometimes intersect the essential clause and the best non-essential clause. #12589

Sometimes intersect the essential clause and the best non-essential clause. #12589

Conversation

jpountz commented Sep 25, 2023

jpountz commented Sep 25, 2023

jpountz commented Oct 18, 2023

jpountz commented Oct 18, 2023

jpountz commented Oct 23, 2023