Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run top-level conjunctions of term queries with a specialized BulkScorer. #12382

Merged
merged 12 commits into from
Sep 25, 2023

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Jun 22, 2023

This implements a specialized BlockMaxConjunctionBulkScorer, which is similar to BlockMaxConjunctionScorer with the following differences:

  • Implemented as a BulkScorer instead of a Scorer (ie. for top-level conjunctions only).
  • Only used for clauses that have simple iterators (not two-phase iterators).
  • Checks if a competitive score is still possible before advancing the next clause. This helps because it's often less costly to do this check than advancing the next clause, which might in-turn require decoding an entire block of postings.

…rer.

This implements a specialized `BlockMaxConjunctionBulkScorer`, which is really
the same as `BlockMaxConjunctionScorer`, but as a `BulkScorer` instead of a
`Scorer`. Also it doesn't support two-phase iterators in order to focus on the
common case when queries, such as term queries, do not have two-phase
iterators. If a clause has a two-phase iterator, it will keep running as a
`BlockMaxConjunctionScorer` wrapped in a `DefaultBulkScorer`.
@jpountz jpountz added this to the 9.8.0 milestone Jun 22, 2023
@jpountz jpountz marked this pull request as draft June 22, 2023 17:17
@jpountz
Copy link
Contributor Author

jpountz commented Jun 23, 2023

The speedup is not as high as I had hoped, so I'm leaning towards not merging this PR. Maybe someone figures out how to make this code run faster!

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      TermDTSort      183.14      (1.6%)      179.82      (5.1%)   -1.8% (  -8% -    5%) 0.133
                          IntNRQ      105.92     (15.4%)      104.00     (13.4%)   -1.8% ( -26% -   32%) 0.693
           HighTermDayOfYearSort      436.32      (1.8%)      428.70      (5.8%)   -1.7% (  -9% -    6%) 0.203
               HighTermTitleSort      151.24      (6.0%)      148.87      (8.8%)   -1.6% ( -15% -   14%) 0.512
                      AndHighLow     1692.56      (3.7%)     1666.70      (2.8%)   -1.5% (  -7% -    5%) 0.140
     BrowseRandomLabelSSDVFacets       14.89      (8.3%)       14.75      (6.4%)   -0.9% ( -14% -   14%) 0.685
           BrowseMonthTaxoFacets       30.24      (6.9%)       29.95      (8.3%)   -0.9% ( -15% -   15%) 0.698
                      OrHighHigh       47.35      (6.6%)       46.96      (4.2%)   -0.8% ( -10% -   10%) 0.630
                        Wildcard      220.59      (4.7%)      218.79      (4.4%)   -0.8% (  -9% -    8%) 0.570
       BrowseDayOfYearTaxoFacets       45.10      (3.5%)       44.82      (2.7%)   -0.6% (  -6% -    5%) 0.525
                       OrHighLow      650.91      (4.4%)      647.28      (3.1%)   -0.6% (  -7% -    7%) 0.646
            HighTermTitleBDVSort       25.83      (1.7%)       25.70      (3.6%)   -0.5% (  -5% -    4%) 0.575
                       OrHighMed      167.05      (5.0%)      166.32      (3.6%)   -0.4% (  -8% -    8%) 0.752
             MedIntervalsOrdered       17.59      (3.5%)       17.53      (4.1%)   -0.4% (  -7% -    7%) 0.758
                      AndHighMed      391.35      (4.4%)      390.02      (3.7%)   -0.3% (  -8% -    8%) 0.789
                    OrNotHighLow     1442.98      (3.7%)     1438.18      (3.3%)   -0.3% (  -7% -    7%) 0.767
            BrowseDateTaxoFacets       44.30      (3.0%)       44.19      (3.0%)   -0.2% (  -6% -    5%) 0.795
                          Fuzzy1      139.23      (1.1%)      138.94      (1.0%)   -0.2% (  -2% -    1%) 0.539
                          Fuzzy2      112.21      (0.8%)      112.00      (0.9%)   -0.2% (  -1% -    1%) 0.502
                 MedSloppyPhrase       35.95      (2.2%)       35.90      (2.1%)   -0.1% (  -4% -    4%) 0.836
                         Respell       95.08      (1.5%)       94.97      (1.4%)   -0.1% (  -2% -    2%) 0.795
                     MedSpanNear      100.90      (5.9%)      100.89      (7.0%)   -0.0% ( -12% -   13%) 0.997
                        PKLookup      242.29      (5.0%)      242.27      (4.2%)   -0.0% (  -8% -    9%) 0.995
                HighSloppyPhrase       24.75      (3.2%)       24.75      (4.6%)   -0.0% (  -7% -    8%) 0.997
             LowIntervalsOrdered       13.76      (4.3%)       13.77      (4.2%)    0.0% (  -8% -    8%) 0.976
          OrHighMedDayTaxoFacets       17.13      (4.5%)       17.14      (3.6%)    0.1% (  -7% -    8%) 0.937
       BrowseDayOfYearSSDVFacets       20.81     (13.8%)       20.83     (13.5%)    0.1% ( -23% -   31%) 0.980
                         Prefix3      249.54      (5.3%)      250.06      (2.6%)    0.2% (  -7% -    8%) 0.875
                     LowSpanNear       51.89      (2.1%)       52.03      (2.7%)    0.3% (  -4% -    5%) 0.740
                    HighSpanNear       30.44      (4.2%)       30.54      (5.2%)    0.3% (  -8% -   10%) 0.841
                    OrNotHighMed      435.83      (6.3%)      437.24      (5.3%)    0.3% ( -10% -   12%) 0.860
            BrowseDateSSDVFacets        5.52      (7.9%)        5.54     (10.2%)    0.4% ( -16% -   20%) 0.894
                   OrHighNotHigh      332.17      (7.0%)      333.72      (6.0%)    0.5% ( -11% -   14%) 0.821
                         LowTerm      852.62      (6.5%)      856.98      (5.7%)    0.5% ( -10% -   13%) 0.791
                 LowSloppyPhrase       34.09      (4.2%)       34.28      (3.8%)    0.6% (  -7% -    8%) 0.654
                    OrHighNotMed      389.58      (7.0%)      392.08      (6.6%)    0.6% ( -12% -   15%) 0.765
            HighIntervalsOrdered       16.82      (5.4%)       16.93      (7.3%)    0.6% ( -11% -   14%) 0.750
                    OrHighNotLow      503.57      (7.5%)      507.57      (6.6%)    0.8% ( -12% -   16%) 0.721
                         MedTerm      954.20      (8.0%)      962.60      (6.3%)    0.9% ( -12% -   16%) 0.699
                   OrNotHighHigh      483.09      (6.5%)      487.85      (5.4%)    1.0% ( -10% -   13%) 0.601
           BrowseMonthSSDVFacets       21.00      (9.5%)       21.25     (10.3%)    1.2% ( -16% -   23%) 0.697
         AndHighMedDayTaxoFacets       20.84      (2.6%)       21.18      (2.9%)    1.6% (  -3% -    7%) 0.060
        AndHighHighDayTaxoFacets       28.62      (2.1%)       29.09      (2.3%)    1.6% (  -2% -    6%) 0.018
                      HighPhrase      211.10      (4.2%)      214.68      (3.3%)    1.7% (  -5% -    9%) 0.154
                        HighTerm      380.81      (8.8%)      387.70      (6.9%)    1.8% ( -12% -   19%) 0.472
                       MedPhrase       22.08      (4.8%)       22.48      (3.8%)    1.8% (  -6% -   10%) 0.180
            MedTermDayTaxoFacets       80.33      (3.1%)       81.88      (2.7%)    1.9% (  -3% -    7%) 0.036
                       LowPhrase      348.19      (3.5%)      355.21      (2.9%)    2.0% (  -4% -    8%) 0.048
               HighTermMonthSort     3789.19      (4.2%)     3879.83      (3.0%)    2.4% (  -4% -   10%) 0.039
     BrowseRandomLabelTaxoFacets       35.61      (6.2%)       36.77      (2.3%)    3.3% (  -4% -   12%) 0.028
                     AndHighHigh       51.92      (5.9%)       54.91      (5.0%)    5.7% (  -4% -   17%) 0.001

@jpountz jpountz closed this Jul 8, 2023
@jpountz jpountz reopened this Sep 21, 2023
@jpountz jpountz marked this pull request as ready for review September 21, 2023 13:25
@jpountz jpountz modified the milestones: 9.8.0, 9.9.0 Sep 21, 2023
@jpountz
Copy link
Contributor Author

jpountz commented Sep 21, 2023

Reopening as I'm now seeing speedups. It's possible it's related to other changes that happened since last time I looked, or to the specific tasks that get picked by luceneutil. Here's the output of a run on wikibigall:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
               HighTermMonthSort     2585.95      (5.2%)     2525.81      (7.8%)   -2.3% ( -14% -   11%) 0.268
                         LowTerm      910.97      (5.9%)      890.32      (5.8%)   -2.3% ( -13% -    9%) 0.218
                        HighTerm      520.07      (5.2%)      510.12      (5.2%)   -1.9% ( -11% -    8%) 0.245
                         MedTerm      456.03      (5.5%)      448.52      (5.2%)   -1.6% ( -11% -    9%) 0.329
                       OrHighMed      190.06      (2.9%)      187.85      (3.6%)   -1.2% (  -7% -    5%) 0.263
           HighTermDayOfYearSort      268.40      (3.2%)      265.61      (2.6%)   -1.0% (  -6% -    4%) 0.261
                     MedSpanNear       43.74      (2.9%)       43.33      (2.8%)   -0.9% (  -6% -    4%) 0.297
                    HighSpanNear        4.33      (2.7%)        4.31      (3.0%)   -0.5% (  -6% -    5%) 0.571
                          Fuzzy2       58.11      (2.0%)       57.83      (2.3%)   -0.5% (  -4% -    3%) 0.482
                       MedPhrase       78.83      (2.3%)       78.47      (3.0%)   -0.5% (  -5% -    4%) 0.579
                     LowSpanNear       11.19      (1.6%)       11.15      (2.1%)   -0.4% (  -3% -    3%) 0.546
                       OrHighLow      625.94      (3.1%)      624.02      (3.4%)   -0.3% (  -6% -    6%) 0.768
                CountAndHighHigh       45.07      (3.5%)       44.94      (4.3%)   -0.3% (  -7% -    7%) 0.813
                      OrHighHigh       41.35      (5.5%)       41.23      (5.9%)   -0.3% ( -11% -   11%) 0.881
                 CountAndHighMed      149.52      (3.2%)      149.33      (3.3%)   -0.1% (  -6% -    6%) 0.903
                 LowSloppyPhrase       52.33      (1.7%)       52.28      (2.5%)   -0.1% (  -4% -    4%) 0.892
                         Prefix3      271.79      (3.4%)      271.62      (3.8%)   -0.1% (  -7% -    7%) 0.955
                        PKLookup      211.98      (3.4%)      211.93      (2.6%)   -0.0% (  -5% -    6%) 0.981
                       LowPhrase       53.95      (2.1%)       53.96      (2.4%)    0.0% (  -4% -    4%) 0.975
                     CountPhrase        2.78      (4.3%)        2.79      (2.0%)    0.1% (  -5% -    6%) 0.950
                          Fuzzy1      138.29      (2.1%)      138.39      (2.6%)    0.1% (  -4% -    4%) 0.918
                      HighPhrase       28.83      (3.2%)       28.86      (4.7%)    0.1% (  -7% -    8%) 0.932
                HighSloppyPhrase        7.74      (1.8%)        7.75      (2.6%)    0.2% (  -4% -    4%) 0.764
                  CountOrHighMed      138.64     (15.8%)      138.99     (15.0%)    0.2% ( -26% -   36%) 0.959
                         Respell       56.79      (1.8%)       56.95      (2.3%)    0.3% (  -3% -    4%) 0.664
                 CountOrHighHigh       40.89     (17.1%)       41.02     (15.9%)    0.3% ( -27% -   40%) 0.951
                      AndHighLow     1009.08      (3.2%)     1013.60      (4.4%)    0.4% (  -6% -    8%) 0.713
                       CountTerm     8120.00      (7.4%)     8193.70     (11.2%)    0.9% ( -16% -   21%) 0.763
                          IntNRQ      302.71      (6.2%)      305.63      (6.7%)    1.0% ( -11% -   14%) 0.634
                 MedSloppyPhrase       19.03      (4.7%)       19.28      (4.9%)    1.3% (  -7% -   11%) 0.387
                        Wildcard       25.15      (4.7%)       25.59      (4.3%)    1.7% (  -6% -   11%) 0.223
                      AndHighMed      171.18      (2.9%)      230.86      (4.4%)   34.9% (  26% -   43%) 0.000
                     AndHighHigh       46.05      (4.4%)       78.03      (5.6%)   69.4% (  56% -   82%) 0.000

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numbers are impressive. I have some code design/flow comments.

@jpountz
Copy link
Contributor Author

jpountz commented Sep 21, 2023

The numbers are impressive

I agree... it makes me suspicous. I'll verify my localrun.py and run with higher taskCountPerCat and taskRepeatCount to see if I'm still getting such good results.

@benwtrent
Copy link
Member

I agree... it makes me suspicous. I'll verify my localrun.py and run with higher taskCountPerCat and taskRepeatCount to see if I'm still getting such good results.

I doubt this is your problem, but I have made the mistake in the past of benchmarking the incorrect commit because either:

  • I had merged in the wrong things and didn't double check
  • Forgot to point to the correct directory and use ./gradlew clean && ./gradlew jar to ensure everything built ONLY with my changes.

@jpountz
Copy link
Contributor Author

jpountz commented Sep 21, 2023

Running with taskCountPerCat=5 and taskRepeatCount=50 gave very similar results:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                        HighTerm      472.68      (4.7%)      464.58      (3.8%)   -1.7% (  -9% -    7%) 0.392
                         MedTerm      585.82      (3.6%)      580.49      (3.7%)   -0.9% (  -7% -    6%) 0.598
           HighTermDayOfYearSort      286.30      (1.2%)      284.36      (1.9%)   -0.7% (  -3% -    2%) 0.362
                      OrHighHigh       51.81      (6.7%)       51.54      (4.5%)   -0.5% ( -10% -   11%) 0.842
                        Wildcard      225.50      (3.3%)      224.32      (3.3%)   -0.5% (  -6% -    6%) 0.734
                       OrHighLow      611.45      (2.6%)      608.55      (3.0%)   -0.5% (  -5% -    5%) 0.723
                       OrHighMed      118.00      (5.0%)      117.46      (3.6%)   -0.5% (  -8% -    8%) 0.823
                         LowTerm     1000.83      (3.3%)      996.92      (3.5%)   -0.4% (  -7% -    6%) 0.809
                        PKLookup      222.91      (1.0%)      222.27      (3.2%)   -0.3% (  -4% -    4%) 0.800
                          IntNRQ      106.47     (20.2%)      106.19     (19.5%)   -0.3% ( -33% -   49%) 0.977
                       LowPhrase       26.20      (1.9%)       26.20      (1.5%)   -0.0% (  -3% -    3%) 0.990
                         Respell       73.45      (1.1%)       73.50      (1.5%)    0.1% (  -2% -    2%) 0.914
                          Fuzzy2      119.32      (0.4%)      119.64      (0.5%)    0.3% (   0% -    1%) 0.182
               HighTermMonthSort     5878.05      (1.9%)     5897.37      (2.7%)    0.3% (  -4% -    5%) 0.764
                 CountAndHighMed      121.92      (2.9%)      122.35      (3.2%)    0.4% (  -5% -    6%) 0.809
                      HighPhrase       46.45      (3.4%)       46.62      (2.1%)    0.4% (  -4% -    6%) 0.776
                          Fuzzy1      104.63      (0.7%)      105.18      (0.6%)    0.5% (   0% -    1%) 0.105
                       MedPhrase       27.38      (4.0%)       27.55      (2.2%)    0.6% (  -5% -    7%) 0.692
                         Prefix3      227.82      (3.7%)      229.47      (2.7%)    0.7% (  -5% -    7%) 0.633
                       CountTerm    20661.05      (2.5%)    20826.02      (2.3%)    0.8% (  -3% -    5%) 0.476
                CountAndHighHigh       40.20      (2.6%)       40.78      (3.4%)    1.4% (  -4% -    7%) 0.313
                  CountOrHighMed       86.48     (14.9%)       88.87     (17.9%)    2.8% ( -26% -   41%) 0.721
                 CountOrHighHigh       55.70     (15.0%)       57.27     (18.0%)    2.8% ( -26% -   42%) 0.719
                     CountPhrase        3.32      (7.6%)        3.45     (10.2%)    3.9% ( -12% -   23%) 0.356
                      AndHighLow      567.28      (3.7%)      742.59      (4.6%)   30.9% (  21% -   40%) 0.000
                      AndHighMed      115.23      (2.8%)      161.49      (4.9%)   40.1% (  31% -   49%) 0.000
                     AndHighHigh       32.62      (4.8%)       47.72      (4.8%)   46.3% (  34% -   58%) 0.000

I did a diff -r between my baseline and candidate and saw the code changes from this PR as expected.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@jpountz
Copy link
Contributor Author

jpountz commented Sep 25, 2023

FWIW I ran the benchmark from https://tantivy-search.github.io/bench/ and also observed a speedup on conjunctions, so I think that the speedup is indeed real.

@jpountz jpountz merged commit f2bd0bb into apache:main Sep 25, 2023
4 checks passed
@jpountz jpountz deleted the specialized_conjunction_bulk_scorer branch September 25, 2023 11:36
jpountz added a commit that referenced this pull request Sep 25, 2023
…rer. (#12382)

This implements a specialized `BlockMaxConjunctionBulkScorer`, which is really
the same as `BlockMaxConjunctionScorer`, but as a `BulkScorer` instead of a
`Scorer`. Also it doesn't support two-phase iterators in order to focus on the
common case when queries, such as term queries, do not have two-phase
iterators. If a clause has a two-phase iterator, it will keep running as a
`BlockMaxConjunctionScorer` wrapped in a `DefaultBulkScorer`.
@jpountz
Copy link
Contributor Author

jpountz commented Sep 29, 2023

This gave a good speedup to AndHighHigh on nightlies: +15%, but AndHighMed had a small drop: -1.5%. It still looks positive overall as it's making the slow query (AndHighHigh, 22 QPS) faster and the fast query (AndHighMed, 51 QPS) slightly slower but I'm still a bit puzzled why I'm seeing such different numbers on my local benchmark and on nightlies.

@hossman
Copy link
Member

hossman commented Oct 3, 2023

git bisect has identified f2bd0bbcdd38cd3c681a9d302bdb856f1a62208d as the cause of a recent jenkins failure in TestBlockMaxConjunction.testRandom that reproduces reliably for me locally on main

Example of failure...

org.apache.lucene.search.TestBlockMaxConjunction > testRandom FAILED
    java.lang.AssertionError: Hit 0 docnumbers don't match
    Hits length1=1      length2=1
    hit=0: doc4=9.615059 shardIndex=-1,  doc1215=9.615059 shardIndex=-1
    for query:+foo:9 +foo:10 +foo:11 +foo:12 +foo:13
        at __randomizedtesting.SeedInfo.seed([C738B747E3320853:B57492485252BE20]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.apache.lucene.tests.search.CheckHits.checkEqual(CheckHits.java:229)
        at org.apache.lucene.tests.search.CheckHits.doCheckTopScores(CheckHits.java:709)
        at org.apache.lucene.tests.search.CheckHits.checkTopScores(CheckHits.java:694)
        at org.apache.lucene.search.TestBlockMaxConjunction.testRandom(TestBlockMaxConjunction.java:81)
...
  2> NOTE: reproduce with: gradlew test --tests TestBlockMaxConjunction.testRandom -Dtests.seed=C738B747E3320853 -Dtests.multiplier=2 -Dtests.locale=or -Dtests.timezone=Australia/Tasmania -Dtests.asserts=true -Dtests.file.encoding=UTF-8

git bisect log...

$ git bisect log
# bad: [1dd05c89b0836531d367d2692ea5eae7d54b78fd] Add missing create github release step to release wizard (#12607)
# good: [d62ca4a01f3693e0c6043a48080b33d03fdee8b4] add missing changelog entry for #12498
git bisect start '1dd05c89b0836531d367d2692ea5eae7d54b78fd' 'd62ca4a01f3693e0c6043a48080b33d03fdee8b4'
# good: [51ade888f3274ea42ed4beb2bf000d7f922de4c7] Update wrong PR number in CHANGES.txt
git bisect good 51ade888f3274ea42ed4beb2bf000d7f922de4c7
# bad: [ce464c7d6d20f49c2fe29126fcf400ee1cfeb112] Fix test failure.
git bisect bad ce464c7d6d20f49c2fe29126fcf400ee1cfeb112
# good: [3deead0ed32494d7159c0023dcc86c218c43f4eb] Remove deprecated IndexSearcher#getExecutor method (#12580)
git bisect good 3deead0ed32494d7159c0023dcc86c218c43f4eb
# good: [d48913a957392e2746b489fe5aef77a21250e4b4] Allow reading / writing binary stored fields as DataInput (#12581)
git bisect good d48913a957392e2746b489fe5aef77a21250e4b4
# bad: [483d28853a03aa57e727f0639032918e725a7032] Move CHANGES entry to correct version.
git bisect bad 483d28853a03aa57e727f0639032918e725a7032
# bad: [f2bd0bbcdd38cd3c681a9d302bdb856f1a62208d] Run top-level conjunctions of term queries with a specialized BulkScorer. (#12382)
git bisect bad f2bd0bbcdd38cd3c681a9d302bdb856f1a62208d
# first bad commit: [f2bd0bbcdd38cd3c681a9d302bdb856f1a62208d] Run top-level conjunctions of term queries with a specialized BulkScorer. (#12382)

The week ending on 2023-09-15, TestBlockMaxConjunction.testRandom had a roughly ~2% jenkins failure rate per week -- i believe the first known failures of this test? In the past 7 days (rolling window) 17 of 746 jenkins runs have failed. I have not looked into the failures of the other jenkins builds to confirm if the tes failure messages are identical.

EDIT: Updated misleading final paragraph because i realized i had misread the calendar

@jpountz
Copy link
Contributor Author

jpountz commented Oct 3, 2023

Thanks Hoss! I had missed this failure, it looks like a real one. I'm looking.

@jpountz
Copy link
Contributor Author

jpountz commented Oct 3, 2023

I just pushed a fix: 3f81f2f.

jpountz added a commit to jpountz/lucene that referenced this pull request Oct 24, 2023
PR apache#12382 added a bulk scorer for top-k hits on conjunctions that yielded a
significant speedup (annotation
[FP](http://people.apache.org/~mikemccand/lucenebench/AndHighHigh.html)). This
change proposes a similar change for exhaustive collection of conjunctive
queries, e.g. for counting, faceting, etc.
jpountz added a commit that referenced this pull request Oct 30, 2023
PR #12382 added a bulk scorer for top-k hits on conjunctions that yielded a
significant speedup (annotation
[FP](http://people.apache.org/~mikemccand/lucenebench/AndHighHigh.html)). This
change proposes a similar change for exhaustive collection of conjunctive
queries, e.g. for counting, faceting, etc.
jpountz added a commit that referenced this pull request Oct 30, 2023
PR #12382 added a bulk scorer for top-k hits on conjunctions that yielded a
significant speedup (annotation
[FP](http://people.apache.org/~mikemccand/lucenebench/AndHighHigh.html)). This
change proposes a similar change for exhaustive collection of conjunctive
queries, e.g. for counting, faceting, etc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants