LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches to improve disjunction within conjunction #1006

zacharymorn · 2022-07-06T04:08:39Z

Description (or a Jira issue link if you have one)

Follow-up changes for https://issues.apache.org/jira/browse/LUCENE-10480 to improve performance for disjunction within conjunction queries.

Benchmark results with wikinightly.tasks boolean queries below:

AndHighHigh: +be +up # freq=2115632 freq=824628
AndHighHigh: +cite +had # freq=1367577 freq=1223103
AndHighHigh: +is +he # freq=4214104 freq=1663980
AndHighHigh: +no +4 # freq=1060681 freq=944177
AndHighHigh: +title +see # freq=2077102 freq=1100862
AndHighMed: +2010 +16 # freq=933686 freq=531050
AndHighMed: +5 +power # freq=849829 freq=257919
AndHighMed: +only +particularly # freq=895806 freq=100045
AndHighMed: +united +1983 # freq=1185528 freq=150075
AndHighMed: +who +ed # freq=1201585 freq=127497
OrHighHigh: are last # freq=1921211 freq=830278
OrHighHigh: at united # freq=2834104 freq=1185528
OrHighHigh: but year # freq=1484398 freq=1098425
OrHighHigh: name its # freq=2577591 freq=1160703
OrHighHigh: to but # freq=6105155 freq=1484398
OrHighMed: at mostly # freq=2834104 freq=89401
OrHighMed: his interview # freq=1771920 freq=94736
OrHighMed: http 9 # freq=3289683 freq=541405
OrHighMed: they hard # freq=1031516 freq=92045
OrHighMed: title bay # freq=2077102 freq=117167
AndHighOrMedMed: +be +(mostly interview) # freq=2115632 freq=89401 freq=94736
AndHighOrMedMed: +cite +(9 hard) # freq=1367577 freq=541405 freq=92045
AndHighOrMedMed: +is +(bay 16) # freq=4214104 freq=117167 freq=531050
AndHighOrMedMed: +no +(power particularly) # freq=1060681 freq=257919 freq=100045
AndHighOrMedMed: +title +(1983 ed) # freq=2077102 freq=150075 freq=127497
AndMedOrHighHigh: +mostly +(are last) # freq=89401 freq=1921211 freq=830278
AndMedOrHighHigh: +interview +(at united) # freq=94736 freq=2834104 freq=1185528
AndMedOrHighHigh: +hard +(but year) # freq=92045 freq=1484398 freq=1098425
AndMedOrHighHigh: +9 +(name its) # freq=541405 freq=2577591 freq=1160703
AndMedOrHighHigh: +bay +(to but) # freq=117167 freq=6105155 freq=1484398

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                     AndHighHigh       40.93      (2.8%)       40.72      (4.2%)   -0.5% (  -7% -    6%) 0.659
                      AndHighMed      150.71      (3.4%)      152.22      (3.7%)    1.0% (  -5% -    8%) 0.371
                        PKLookup      250.85      (8.7%)      257.51      (8.9%)    2.7% ( -13% -   22%) 0.340
                 AndHighOrMedMed       66.87      (4.0%)       68.70      (2.7%)    2.7% (  -3% -    9%) 0.012
                AndMedOrHighHigh       89.04      (2.6%)       93.28      (3.1%)    4.8% (   0% -   10%) 0.000
                      OrHighHigh       21.71      (6.0%)       34.50      (6.8%)   58.9% (  43% -   76%) 0.000
                       OrHighMed       85.11      (5.0%)      189.37      (8.0%)  122.5% ( 104% -  142%) 0.000

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                AndMedOrHighHigh       68.90      (4.5%)       67.15      (4.3%)   -2.5% ( -10% -    6%) 0.074
                     AndHighHigh       73.07      (3.0%)       72.11      (3.5%)   -1.3% (  -7% -    5%) 0.212
                      AndHighMed      146.94      (4.7%)      145.56      (4.9%)   -0.9% ( -10% -    9%) 0.550
                        PKLookup      252.01      (9.3%)      249.71     (13.2%)   -0.9% ( -21% -   23%) 0.806
                 AndHighOrMedMed       65.49      (5.8%)       66.09      (4.9%)    0.9% (  -9% -   12%) 0.600
                      OrHighHigh       21.34      (6.7%)       29.63      (6.7%)   38.8% (  23% -   55%) 0.000
                       OrHighMed      122.61      (8.2%)      227.04      (9.0%)   85.2% (  62% -  111%) 0.000

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      AndHighMed      113.58      (2.8%)      113.98      (4.8%)    0.3% (  -7% -    8%) 0.779
                     AndHighHigh       51.37      (3.2%)       51.58      (5.2%)    0.4% (  -7% -    9%) 0.759
                        PKLookup      272.05      (8.9%)      276.89     (12.6%)    1.8% ( -18% -   25%) 0.605
                 AndHighOrMedMed      102.86      (5.1%)      107.47      (5.4%)    4.5% (  -5% -   15%) 0.007
                AndMedOrHighHigh       91.55      (3.8%)       96.43      (5.2%)    5.3% (  -3% -   14%) 0.000
                      OrHighHigh       27.08      (6.5%)       47.16     (11.3%)   74.2% (  52% -   98%) 0.000
                       OrHighMed       78.78      (5.9%)      153.46     (12.1%)   94.8% (  72% -  119%) 0.000

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                        PKLookup      260.41      (9.8%)      261.79     (10.0%)    0.5% ( -17% -   22%) 0.866
                     AndHighHigh      122.91      (4.0%)      124.37      (5.0%)    1.2% (  -7% -   10%) 0.406
                      AndHighMed      112.99      (4.6%)      114.77      (5.9%)    1.6% (  -8% -   12%) 0.345
                 AndHighOrMedMed       81.97      (5.6%)       83.37      (5.9%)    1.7% (  -9% -   13%) 0.342
                AndMedOrHighHigh       91.34      (4.7%)       98.16      (5.8%)    7.5% (  -2% -   18%) 0.000
                      OrHighHigh       21.05      (5.5%)       30.30      (5.7%)   43.9% (  31% -   58%) 0.000
                       OrHighMed       98.48      (6.3%)      274.14     (11.2%)  178.4% ( 151% -  208%) 0.000

…o improve disjunction within conjunction

jpountz

We should be able to remove the while loop now, otherwise LGTM.

jpountz · 2022-07-06T07:19:11Z

lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java

+          do {
+            top.doc = top.iterator.nextDoc();
+            top = essentialsScorers.updateTop();
+          } while (top.doc == docId);


Advancing the priority queue here shouldn't be necessary, we should be able to remove this while loop. Scorers will advance on the next call to advance.

Yup good catch! I've removed it.

zacharymorn · 2022-07-07T01:59:11Z

Here are the latest benchmark results:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                        PKLookup      237.99     (11.8%)      239.14     (15.3%)    0.5% ( -23% -   31%) 0.911
                      AndHighMed      144.70      (5.9%)      146.20      (6.0%)    1.0% ( -10% -   13%) 0.581
                     AndHighHigh       38.99      (5.5%)       39.44      (5.8%)    1.2% (  -9% -   13%) 0.518
                 AndHighOrMedMed      107.06      (6.1%)      108.65      (5.0%)    1.5% (  -9% -   13%) 0.399
                AndMedOrHighHigh       93.02      (4.5%)       96.45      (5.7%)    3.7% (  -6% -   14%) 0.023
                      OrHighHigh       27.21      (6.8%)       48.15     (10.8%)   76.9% (  55% -  101%) 0.000
                       OrHighMed       79.43      (6.5%)      147.82     (12.8%)   86.1% (  62% -  112%) 0.000

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                     AndHighHigh      122.10      (4.2%)      119.71      (4.6%)   -2.0% ( -10% -    7%) 0.157
                      AndHighMed      110.48      (4.9%)      110.66      (4.8%)    0.2% (  -9% -   10%) 0.918
                        PKLookup      235.57     (11.8%)      237.25      (8.7%)    0.7% ( -17% -   24%) 0.829
                 AndHighOrMedMed       62.93      (6.2%)       64.80      (4.2%)    3.0% (  -7% -   14%) 0.078
                AndMedOrHighHigh       89.72      (4.6%)       94.33      (4.8%)    5.1% (  -4% -   15%) 0.001
                      OrHighHigh      107.23      (4.7%)      139.99     (14.4%)   30.6% (  10% -   52%) 0.000
                       OrHighMed       96.16      (6.2%)      261.62     (15.4%)  172.1% ( 141% -  206%) 0.000

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      AndHighMed      148.59      (4.0%)      148.27      (4.2%)   -0.2% (  -8% -    8%) 0.865
                     AndHighHigh       39.74      (3.8%)       39.88      (4.1%)    0.3% (  -7% -    8%) 0.783
                        PKLookup      253.64      (6.8%)      256.62     (10.0%)    1.2% ( -14% -   19%) 0.664
                AndMedOrHighHigh       87.58      (3.9%)       90.93      (2.7%)    3.8% (  -2% -   10%) 0.000
                 AndHighOrMedMed      106.05      (4.7%)      110.28      (4.5%)    4.0% (  -5% -   13%) 0.006
                      OrHighHigh       19.56      (6.3%)       25.00      (4.9%)   27.8% (  15% -   41%) 0.000
                       OrHighMed       83.40      (5.2%)      191.56      (8.3%)  129.7% ( 110% -  150%) 0.000

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                 AndHighOrMedMed       83.55      (5.2%)       82.09      (4.5%)   -1.7% ( -10% -    8%) 0.255
                        PKLookup      244.80      (7.2%)      243.63      (6.1%)   -0.5% ( -12% -   13%) 0.821
                     AndHighHigh       51.79      (4.8%)       51.71      (3.5%)   -0.2% (  -8% -    8%) 0.899
                      AndHighMed      141.27      (4.9%)      142.38      (4.1%)    0.8% (  -7% -   10%) 0.583
                AndMedOrHighHigh       33.17      (3.9%)       34.37      (3.4%)    3.6% (  -3% -   11%) 0.002
                      OrHighHigh       20.46      (4.3%)       24.23      (5.8%)   18.4% (   7% -   29%) 0.000
                       OrHighMed      122.35      (3.8%)      228.37      (7.8%)   86.7% (  72% -  102%) 0.000

jpountz

I don't understand why top-level disjunctions get faster but the change looks good to me.

zacharymorn · 2022-07-07T04:54:48Z

I don't understand why top-level disjunctions get faster but the change looks good to me.

Oh sorry @jpountz I should have mentioned it explicitly as well. The benchmark baseline I used was prior to all BMM changes (specifically, the head of baseline is 08a9dfd) , so that the results reflect net effect of BMM changes on those boolean queries.

Before changes in this PR, I was able to reproduce the slow-down consistently as noted in https://issues.apache.org/jira/browse/LUCENE-10480:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                AndMedOrHighHigh       85.67      (5.3%)       73.23      (5.7%)  -14.5% ( -24% -   -3%) 0.000
                        PKLookup      260.08     (13.4%)      253.74     (14.9%)   -2.4% ( -27% -   29%) 0.586
                     AndHighHigh       73.68      (4.7%)       72.70      (4.1%)   -1.3% (  -9% -    7%) 0.339
                      AndHighMed       89.52      (5.1%)       88.55      (4.4%)   -1.1% ( -10% -    8%) 0.470
                 AndHighOrMedMed       63.27      (6.5%)       70.48      (5.7%)   11.4% (   0% -   25%) 0.000
                      OrHighHigh       19.60      (5.3%)       25.62      (7.6%)   30.8% (  16% -   46%) 0.000
                       OrHighMed      121.08      (5.7%)      236.34     (10.2%)   95.2% (  74% -  117%) 0.000

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                AndMedOrHighHigh       86.88      (3.4%)       76.60      (3.1%)  -11.8% ( -17% -   -5%) 0.000
                     AndHighHigh       30.49      (3.5%)       30.36      (3.5%)   -0.4% (  -7% -    6%) 0.697
                      AndHighMed      192.76      (3.4%)      193.72      (3.9%)    0.5% (  -6% -    8%) 0.671
                        PKLookup      262.59      (5.5%)      264.52      (7.9%)    0.7% ( -11% -   14%) 0.731
                 AndHighOrMedMed       65.47      (3.8%)       73.43      (3.0%)   12.2% (   5% -   19%) 0.000
                      OrHighHigh       21.47      (4.1%)       36.94      (8.3%)   72.1% (  57% -   88%) 0.000
                       OrHighMed       99.91      (4.3%)      292.05     (12.9%)  192.3% ( 167% -  218%) 0.000

jpountz · 2022-07-07T07:05:19Z

Ah, that makes sense to me now! Thanks for explaining.

zacharymorn · 2022-07-07T08:10:30Z

Ah, that makes sense to me now! Thanks for explaining.

No problem!

…o improve disjunction within conjunction (apache#1006) (cherry picked from commit da8143b)

…o improve disjunction within conjunction (#1006) (#1008) (cherry picked from commit da8143b)

LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches t…

5264f87

…o improve disjunction within conjunction

zacharymorn requested a review from jpountz July 6, 2022 04:08

jpountz approved these changes Jul 6, 2022

View reviewed changes

remove no longer needed loop

c6a3e5a

zacharymorn force-pushed the LUCENE-10480-MoveScoringIntoMatches branch from 28c02cd to c6a3e5a Compare July 7, 2022 01:41

zacharymorn requested a review from jpountz July 7, 2022 01:59

jpountz approved these changes Jul 7, 2022

View reviewed changes

zacharymorn merged commit da8143b into apache:main Jul 7, 2022

zacharymorn added a commit to zacharymorn/lucene that referenced this pull request Jul 7, 2022

LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches t…

6e4c792

…o improve disjunction within conjunction (apache#1006) (cherry picked from commit da8143b)

zacharymorn mentioned this pull request Jul 7, 2022

LUCENE-10480: (Backporting) Move scoring from advance to TwoPhaseIterator#matches to improve disjunction within conjunction (#1006) #1008

Merged

zacharymorn added a commit that referenced this pull request Jul 8, 2022

LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches t…

090cbc5

…o improve disjunction within conjunction (#1006) (#1008) (cherry picked from commit da8143b)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches to improve disjunction within conjunction #1006

LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches to improve disjunction within conjunction #1006

zacharymorn commented Jul 6, 2022

jpountz left a comment

jpountz Jul 6, 2022

zacharymorn Jul 7, 2022

zacharymorn commented Jul 7, 2022

jpountz left a comment

zacharymorn commented Jul 7, 2022 •

edited

Loading

jpountz commented Jul 7, 2022

zacharymorn commented Jul 7, 2022

LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches to improve disjunction within conjunction #1006

LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches to improve disjunction within conjunction #1006

Conversation

zacharymorn commented Jul 6, 2022

Description (or a Jira issue link if you have one)

jpountz left a comment

Choose a reason for hiding this comment

jpountz Jul 6, 2022

Choose a reason for hiding this comment

zacharymorn Jul 7, 2022

Choose a reason for hiding this comment

zacharymorn commented Jul 7, 2022

jpountz left a comment

Choose a reason for hiding this comment

zacharymorn commented Jul 7, 2022 • edited Loading

jpountz commented Jul 7, 2022

zacharymorn commented Jul 7, 2022

zacharymorn commented Jul 7, 2022 •

edited

Loading