Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the MAXSCORE scorer to a bulk scorer. #12361

Merged
merged 14 commits into from
Jun 20, 2023
Merged

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Jun 9, 2023

We currently use block-max maxscore for top-level disjunctions, implemented as
a scorer. Since we only use it for top-level disjunctions, we could actually
implement it as a bulk scorer, which helps save some overhead. luceneutil
reports the following numbers on wikimedium10m:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                    HighSpanNear        9.15      (3.7%)        9.03      (3.5%)   -1.4% (  -8% -    6%) 0.224
                         Prefix3      434.41      (2.2%)      429.16      (2.1%)   -1.2% (  -5% -    3%) 0.080
            MedTermDayTaxoFacets       37.33      (6.3%)       36.88      (6.6%)   -1.2% ( -13% -   12%) 0.558
                      AndHighLow     1315.31      (3.1%)     1299.90      (3.9%)   -1.2% (  -7% -    6%) 0.294
                     MedSpanNear       42.42      (2.5%)       41.96      (2.3%)   -1.1% (  -5% -    3%) 0.153
          OrHighMedDayTaxoFacets        5.66      (4.9%)        5.60      (4.9%)   -1.1% ( -10% -    9%) 0.488
                HighSloppyPhrase       16.72      (3.5%)       16.57      (5.4%)   -0.9% (  -9% -    8%) 0.539
                        Wildcard      129.90      (3.9%)      128.92      (3.8%)   -0.8% (  -8% -    7%) 0.537
                      HighPhrase       68.61      (5.5%)       68.10      (4.4%)   -0.7% ( -10% -    9%) 0.637
                       MedPhrase       27.46      (3.9%)       27.26      (3.5%)   -0.7% (  -7% -    6%) 0.538
     BrowseRandomLabelSSDVFacets       14.70      (7.4%)       14.61      (7.7%)   -0.7% ( -14% -   15%) 0.779
                         LowTerm      816.63      (5.6%)      811.34      (5.0%)   -0.6% ( -10% -   10%) 0.699
           BrowseMonthSSDVFacets       20.41      (1.1%)       20.28      (2.0%)   -0.6% (  -3% -    2%) 0.207
                       LowPhrase       43.61      (3.4%)       43.35      (3.0%)   -0.6% (  -6% -    6%) 0.561
                          Fuzzy1      135.81      (1.2%)      135.42      (1.5%)   -0.3% (  -3% -    2%) 0.504
                     LowSpanNear      114.09      (1.7%)      113.78      (1.8%)   -0.3% (  -3% -    3%) 0.626
                          Fuzzy2       71.78      (1.1%)       71.60      (1.0%)   -0.2% (  -2% -    1%) 0.454
        AndHighHighDayTaxoFacets       31.06      (2.2%)       30.98      (2.3%)   -0.2% (  -4% -    4%) 0.730
                          IntNRQ       88.61      (5.8%)       88.46      (5.7%)   -0.2% ( -11% -   12%) 0.926
               HighTermMonthSort     3779.27      (3.8%)     3775.75      (3.3%)   -0.1% (  -6% -    7%) 0.934
         AndHighMedDayTaxoFacets       58.44      (1.8%)       58.42      (1.9%)   -0.0% (  -3% -    3%) 0.948
                         Respell       80.73      (1.6%)       80.82      (1.4%)    0.1% (  -2% -    3%) 0.815
                         MedTerm      731.12      (6.8%)      732.37      (7.1%)    0.2% ( -12% -   15%) 0.938
                        PKLookup      236.79      (4.4%)      237.48      (4.6%)    0.3% (  -8% -    9%) 0.838
                      TermDTSort      181.53      (2.7%)      182.14      (2.1%)    0.3% (  -4% -    5%) 0.661
           HighTermDayOfYearSort      422.38      (3.2%)      423.81      (3.6%)    0.3% (  -6% -    7%) 0.752
                 LowSloppyPhrase       46.81      (2.7%)       46.98      (3.0%)    0.3% (  -5% -    6%) 0.696
                      AndHighMed      342.09      (4.6%)      343.63      (3.8%)    0.4% (  -7% -    9%) 0.737
                     AndHighHigh       46.06      (6.6%)       46.28      (5.8%)    0.5% ( -11% -   13%) 0.809
            HighTermTitleBDVSort       23.23      (3.6%)       23.34      (3.2%)    0.5% (  -6% -    7%) 0.650
                        HighTerm      685.44      (7.3%)      689.42      (7.5%)    0.6% ( -13% -   16%) 0.804
               HighTermTitleSort      156.76      (5.8%)      157.96      (5.6%)    0.8% ( -10% -   12%) 0.671
            HighIntervalsOrdered       25.11      (5.2%)       25.32      (5.2%)    0.8% (  -9% -   11%) 0.607
                    OrNotHighLow     1803.79      (3.7%)     1819.26      (3.5%)    0.9% (  -6% -    8%) 0.452
             LowIntervalsOrdered       62.41      (3.9%)       63.01      (3.7%)    1.0% (  -6% -    8%) 0.423
                    OrNotHighMed      456.34      (3.5%)      460.92      (4.3%)    1.0% (  -6% -    9%) 0.419
                    OrHighNotLow      365.78      (8.4%)      369.70      (9.0%)    1.1% ( -15% -   20%) 0.698
                   OrNotHighHigh      272.99      (6.9%)      276.13      (7.7%)    1.2% ( -12% -   16%) 0.618
                   OrHighNotHigh      438.11      (6.5%)      443.93      (7.3%)    1.3% ( -11% -   16%) 0.543
                    OrHighNotMed      371.40      (7.4%)      376.34      (8.4%)    1.3% ( -13% -   18%) 0.595
                 MedSloppyPhrase        6.47      (4.8%)        6.56      (4.7%)    1.4% (  -7% -   11%) 0.357
            BrowseDateSSDVFacets        5.50      (9.0%)        5.61     (10.0%)    2.0% ( -15% -   23%) 0.509
             MedIntervalsOrdered       29.98      (5.3%)       30.58      (5.5%)    2.0% (  -8% -   13%) 0.242
       BrowseDayOfYearSSDVFacets       19.88      (8.8%)       20.62     (14.4%)    3.7% ( -17% -   29%) 0.322
           BrowseMonthTaxoFacets       26.63     (20.6%)       27.75     (15.2%)    4.2% ( -26% -   50%) 0.462
                       OrHighLow      555.47      (3.1%)      579.50      (2.4%)    4.3% (  -1% -   10%) 0.000
     BrowseRandomLabelTaxoFacets       31.81     (23.1%)       33.32     (19.2%)    4.8% ( -30% -   61%) 0.477
            BrowseDateTaxoFacets       38.87     (24.8%)       40.81     (18.7%)    5.0% ( -30% -   64%) 0.471
       BrowseDayOfYearTaxoFacets       39.19     (24.8%)       41.52     (18.8%)    5.9% ( -30% -   65%) 0.394
                       OrHighMed      138.70      (3.4%)      149.08      (3.4%)    7.5% (   0% -   14%) 0.000
                      OrHighHigh       44.38      (3.3%)       51.20      (3.8%)   15.4% (   8% -   23%) 0.000

OrHighHigh, OrHighMed and orHighLow all get a speedup with this change.

@jpountz jpountz requested a review from zacharymorn June 9, 2023 16:42
Copy link
Contributor

@zacharymorn zacharymorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jpountz for the PR! I had the thought before to compare your approach again after mine was merged, but somehow lost track of it. The changes look good to me, although I'm wondering if some of the utility methods like list re-partitioning can potentially be shared among two scorer implementations. But we can take those as future improvements.

@jpountz
Copy link
Contributor Author

jpountz commented Jun 13, 2023

Your comment helped me remember that I had planned to remove the scorer (as opposed to bulk scorer) implementation of block-max maxscore. I just pushed a commit that does it, does it make sense to you?

@jpountz jpountz merged commit 8703e44 into apache:main Jun 20, 2023
4 checks passed
@jpountz jpountz deleted the maxscore branch June 20, 2023 16:55
@jpountz jpountz added this to the 9.8.0 milestone Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants