Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches to improve disjunction within conjunction #1006

Merged

Conversation

zacharymorn
Copy link
Contributor

Description (or a Jira issue link if you have one)

Follow-up changes for https://issues.apache.org/jira/browse/LUCENE-10480 to improve performance for disjunction within conjunction queries.

Benchmark results with wikinightly.tasks boolean queries below:

AndHighHigh: +be +up # freq=2115632 freq=824628
AndHighHigh: +cite +had # freq=1367577 freq=1223103
AndHighHigh: +is +he # freq=4214104 freq=1663980
AndHighHigh: +no +4 # freq=1060681 freq=944177
AndHighHigh: +title +see # freq=2077102 freq=1100862
AndHighMed: +2010 +16 # freq=933686 freq=531050
AndHighMed: +5 +power # freq=849829 freq=257919
AndHighMed: +only +particularly # freq=895806 freq=100045
AndHighMed: +united +1983 # freq=1185528 freq=150075
AndHighMed: +who +ed # freq=1201585 freq=127497
OrHighHigh: are last # freq=1921211 freq=830278
OrHighHigh: at united # freq=2834104 freq=1185528
OrHighHigh: but year # freq=1484398 freq=1098425
OrHighHigh: name its # freq=2577591 freq=1160703
OrHighHigh: to but # freq=6105155 freq=1484398
OrHighMed: at mostly # freq=2834104 freq=89401
OrHighMed: his interview # freq=1771920 freq=94736
OrHighMed: http 9 # freq=3289683 freq=541405
OrHighMed: they hard # freq=1031516 freq=92045
OrHighMed: title bay # freq=2077102 freq=117167
AndHighOrMedMed: +be +(mostly interview) # freq=2115632 freq=89401 freq=94736
AndHighOrMedMed: +cite +(9 hard) # freq=1367577 freq=541405 freq=92045
AndHighOrMedMed: +is +(bay 16) # freq=4214104 freq=117167 freq=531050
AndHighOrMedMed: +no +(power particularly) # freq=1060681 freq=257919 freq=100045
AndHighOrMedMed: +title +(1983 ed) # freq=2077102 freq=150075 freq=127497
AndMedOrHighHigh: +mostly +(are last) # freq=89401 freq=1921211 freq=830278
AndMedOrHighHigh: +interview +(at united) # freq=94736 freq=2834104 freq=1185528
AndMedOrHighHigh: +hard +(but year) # freq=92045 freq=1484398 freq=1098425
AndMedOrHighHigh: +9 +(name its) # freq=541405 freq=2577591 freq=1160703
AndMedOrHighHigh: +bay +(to but) # freq=117167 freq=6105155 freq=1484398
                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                     AndHighHigh       40.93      (2.8%)       40.72      (4.2%)   -0.5% (  -7% -    6%) 0.659
                      AndHighMed      150.71      (3.4%)      152.22      (3.7%)    1.0% (  -5% -    8%) 0.371
                        PKLookup      250.85      (8.7%)      257.51      (8.9%)    2.7% ( -13% -   22%) 0.340
                 AndHighOrMedMed       66.87      (4.0%)       68.70      (2.7%)    2.7% (  -3% -    9%) 0.012
                AndMedOrHighHigh       89.04      (2.6%)       93.28      (3.1%)    4.8% (   0% -   10%) 0.000
                      OrHighHigh       21.71      (6.0%)       34.50      (6.8%)   58.9% (  43% -   76%) 0.000
                       OrHighMed       85.11      (5.0%)      189.37      (8.0%)  122.5% ( 104% -  142%) 0.000
                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                AndMedOrHighHigh       68.90      (4.5%)       67.15      (4.3%)   -2.5% ( -10% -    6%) 0.074
                     AndHighHigh       73.07      (3.0%)       72.11      (3.5%)   -1.3% (  -7% -    5%) 0.212
                      AndHighMed      146.94      (4.7%)      145.56      (4.9%)   -0.9% ( -10% -    9%) 0.550
                        PKLookup      252.01      (9.3%)      249.71     (13.2%)   -0.9% ( -21% -   23%) 0.806
                 AndHighOrMedMed       65.49      (5.8%)       66.09      (4.9%)    0.9% (  -9% -   12%) 0.600
                      OrHighHigh       21.34      (6.7%)       29.63      (6.7%)   38.8% (  23% -   55%) 0.000
                       OrHighMed      122.61      (8.2%)      227.04      (9.0%)   85.2% (  62% -  111%) 0.000

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      AndHighMed      113.58      (2.8%)      113.98      (4.8%)    0.3% (  -7% -    8%) 0.779
                     AndHighHigh       51.37      (3.2%)       51.58      (5.2%)    0.4% (  -7% -    9%) 0.759
                        PKLookup      272.05      (8.9%)      276.89     (12.6%)    1.8% ( -18% -   25%) 0.605
                 AndHighOrMedMed      102.86      (5.1%)      107.47      (5.4%)    4.5% (  -5% -   15%) 0.007
                AndMedOrHighHigh       91.55      (3.8%)       96.43      (5.2%)    5.3% (  -3% -   14%) 0.000
                      OrHighHigh       27.08      (6.5%)       47.16     (11.3%)   74.2% (  52% -   98%) 0.000
                       OrHighMed       78.78      (5.9%)      153.46     (12.1%)   94.8% (  72% -  119%) 0.000
                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                        PKLookup      260.41      (9.8%)      261.79     (10.0%)    0.5% ( -17% -   22%) 0.866
                     AndHighHigh      122.91      (4.0%)      124.37      (5.0%)    1.2% (  -7% -   10%) 0.406
                      AndHighMed      112.99      (4.6%)      114.77      (5.9%)    1.6% (  -8% -   12%) 0.345
                 AndHighOrMedMed       81.97      (5.6%)       83.37      (5.9%)    1.7% (  -9% -   13%) 0.342
                AndMedOrHighHigh       91.34      (4.7%)       98.16      (5.8%)    7.5% (  -2% -   18%) 0.000
                      OrHighHigh       21.05      (5.5%)       30.30      (5.7%)   43.9% (  31% -   58%) 0.000
                       OrHighMed       98.48      (6.3%)      274.14     (11.2%)  178.4% ( 151% -  208%) 0.000

@zacharymorn zacharymorn requested a review from jpountz July 6, 2022 04:08
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to remove the while loop now, otherwise LGTM.

do {
top.doc = top.iterator.nextDoc();
top = essentialsScorers.updateTop();
} while (top.doc == docId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Advancing the priority queue here shouldn't be necessary, we should be able to remove this while loop. Scorers will advance on the next call to advance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup good catch! I've removed it.

@zacharymorn zacharymorn force-pushed the LUCENE-10480-MoveScoringIntoMatches branch from 28c02cd to c6a3e5a Compare July 7, 2022 01:41
@zacharymorn
Copy link
Contributor Author

Here are the latest benchmark results:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                        PKLookup      237.99     (11.8%)      239.14     (15.3%)    0.5% ( -23% -   31%) 0.911
                      AndHighMed      144.70      (5.9%)      146.20      (6.0%)    1.0% ( -10% -   13%) 0.581
                     AndHighHigh       38.99      (5.5%)       39.44      (5.8%)    1.2% (  -9% -   13%) 0.518
                 AndHighOrMedMed      107.06      (6.1%)      108.65      (5.0%)    1.5% (  -9% -   13%) 0.399
                AndMedOrHighHigh       93.02      (4.5%)       96.45      (5.7%)    3.7% (  -6% -   14%) 0.023
                      OrHighHigh       27.21      (6.8%)       48.15     (10.8%)   76.9% (  55% -  101%) 0.000
                       OrHighMed       79.43      (6.5%)      147.82     (12.8%)   86.1% (  62% -  112%) 0.000
                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                     AndHighHigh      122.10      (4.2%)      119.71      (4.6%)   -2.0% ( -10% -    7%) 0.157
                      AndHighMed      110.48      (4.9%)      110.66      (4.8%)    0.2% (  -9% -   10%) 0.918
                        PKLookup      235.57     (11.8%)      237.25      (8.7%)    0.7% ( -17% -   24%) 0.829
                 AndHighOrMedMed       62.93      (6.2%)       64.80      (4.2%)    3.0% (  -7% -   14%) 0.078
                AndMedOrHighHigh       89.72      (4.6%)       94.33      (4.8%)    5.1% (  -4% -   15%) 0.001
                      OrHighHigh      107.23      (4.7%)      139.99     (14.4%)   30.6% (  10% -   52%) 0.000
                       OrHighMed       96.16      (6.2%)      261.62     (15.4%)  172.1% ( 141% -  206%) 0.000
                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      AndHighMed      148.59      (4.0%)      148.27      (4.2%)   -0.2% (  -8% -    8%) 0.865
                     AndHighHigh       39.74      (3.8%)       39.88      (4.1%)    0.3% (  -7% -    8%) 0.783
                        PKLookup      253.64      (6.8%)      256.62     (10.0%)    1.2% ( -14% -   19%) 0.664
                AndMedOrHighHigh       87.58      (3.9%)       90.93      (2.7%)    3.8% (  -2% -   10%) 0.000
                 AndHighOrMedMed      106.05      (4.7%)      110.28      (4.5%)    4.0% (  -5% -   13%) 0.006
                      OrHighHigh       19.56      (6.3%)       25.00      (4.9%)   27.8% (  15% -   41%) 0.000
                       OrHighMed       83.40      (5.2%)      191.56      (8.3%)  129.7% ( 110% -  150%) 0.000
                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                 AndHighOrMedMed       83.55      (5.2%)       82.09      (4.5%)   -1.7% ( -10% -    8%) 0.255
                        PKLookup      244.80      (7.2%)      243.63      (6.1%)   -0.5% ( -12% -   13%) 0.821
                     AndHighHigh       51.79      (4.8%)       51.71      (3.5%)   -0.2% (  -8% -    8%) 0.899
                      AndHighMed      141.27      (4.9%)      142.38      (4.1%)    0.8% (  -7% -   10%) 0.583
                AndMedOrHighHigh       33.17      (3.9%)       34.37      (3.4%)    3.6% (  -3% -   11%) 0.002
                      OrHighHigh       20.46      (4.3%)       24.23      (5.8%)   18.4% (   7% -   29%) 0.000
                       OrHighMed      122.35      (3.8%)      228.37      (7.8%)   86.7% (  72% -  102%) 0.000

@zacharymorn zacharymorn requested a review from jpountz July 7, 2022 01:59
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why top-level disjunctions get faster but the change looks good to me.

@zacharymorn
Copy link
Contributor Author

zacharymorn commented Jul 7, 2022

I don't understand why top-level disjunctions get faster but the change looks good to me.

Oh sorry @jpountz I should have mentioned it explicitly as well. The benchmark baseline I used was prior to all BMM changes (specifically, the head of baseline is 08a9dfd) , so that the results reflect net effect of BMM changes on those boolean queries.

Before changes in this PR, I was able to reproduce the slow-down consistently as noted in https://issues.apache.org/jira/browse/LUCENE-10480:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                AndMedOrHighHigh       85.67      (5.3%)       73.23      (5.7%)  -14.5% ( -24% -   -3%) 0.000
                        PKLookup      260.08     (13.4%)      253.74     (14.9%)   -2.4% ( -27% -   29%) 0.586
                     AndHighHigh       73.68      (4.7%)       72.70      (4.1%)   -1.3% (  -9% -    7%) 0.339
                      AndHighMed       89.52      (5.1%)       88.55      (4.4%)   -1.1% ( -10% -    8%) 0.470
                 AndHighOrMedMed       63.27      (6.5%)       70.48      (5.7%)   11.4% (   0% -   25%) 0.000
                      OrHighHigh       19.60      (5.3%)       25.62      (7.6%)   30.8% (  16% -   46%) 0.000
                       OrHighMed      121.08      (5.7%)      236.34     (10.2%)   95.2% (  74% -  117%) 0.000 
                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                AndMedOrHighHigh       86.88      (3.4%)       76.60      (3.1%)  -11.8% ( -17% -   -5%) 0.000
                     AndHighHigh       30.49      (3.5%)       30.36      (3.5%)   -0.4% (  -7% -    6%) 0.697
                      AndHighMed      192.76      (3.4%)      193.72      (3.9%)    0.5% (  -6% -    8%) 0.671
                        PKLookup      262.59      (5.5%)      264.52      (7.9%)    0.7% ( -11% -   14%) 0.731
                 AndHighOrMedMed       65.47      (3.8%)       73.43      (3.0%)   12.2% (   5% -   19%) 0.000
                      OrHighHigh       21.47      (4.1%)       36.94      (8.3%)   72.1% (  57% -   88%) 0.000
                       OrHighMed       99.91      (4.3%)      292.05     (12.9%)  192.3% ( 167% -  218%) 0.000 

@jpountz
Copy link
Contributor

jpountz commented Jul 7, 2022

Ah, that makes sense to me now! Thanks for explaining.

@zacharymorn
Copy link
Contributor Author

Ah, that makes sense to me now! Thanks for explaining.

No problem!

@zacharymorn zacharymorn merged commit da8143b into apache:main Jul 7, 2022
zacharymorn added a commit to zacharymorn/lucene that referenced this pull request Jul 7, 2022
…o improve disjunction within conjunction (apache#1006)

(cherry picked from commit da8143b)
zacharymorn added a commit that referenced this pull request Jul 8, 2022
…o improve disjunction within conjunction (#1006) (#1008)

(cherry picked from commit da8143b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants