Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10488: Optimize Facets#getTopDims in IntTaxonomyFacets #779

Merged
merged 9 commits into from May 13, 2022

Conversation

Yuti-G
Copy link
Contributor

@Yuti-G Yuti-G commented Mar 31, 2022

Description

This change overrides and optimizes the default implementation of getTopDims in IntTaxonomyFacets which is extended by FastTaxonomyFacetCounts and TaxonomyFacetSumIntAssociations.

Solution

Override getTopDims and refactor the getTopChildren function in IntTaxonomyFacets to get dimCount (aggregated dim values) more efficiently by checking if dimCount has been populated in indexing time for a dim that is hierarchical or multiValued && requireDimCount, before aggregating dimCount by iterating its child ordinal.

Tests

Added new testing for the overridden implementations of getTopDims in IntTaxonomyFacets

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.

@Yuti-G
Copy link
Contributor Author

Yuti-G commented Mar 31, 2022

Please see attached for the benchmark results (updated after rebase):

                     TaskQPS baseline      StdDevQPS candidate      StdDev                Pct diff p-value
       BrowseMonthTaxoFacets       19.31     (43.0%)       16.36     (20.9%)  -15.3% ( -55% -   85%) 0.153
   BrowseDayOfYearTaxoFacets       17.31     (29.5%)       15.50     (19.3%)  -10.4% ( -45% -   54%) 0.186
        BrowseDateTaxoFacets       17.15     (29.9%)       15.36     (19.7%)  -10.4% ( -46% -   55%) 0.194
 BrowseRandomLabelTaxoFacets       14.70     (29.6%)       13.19     (18.5%)  -10.3% ( -44% -   53%) 0.189
        BrowseDateSSDVFacets        2.49     (18.7%)        2.38     (14.1%)   -4.1% ( -31% -   35%) 0.432
                    Wildcard      105.30      (8.1%)      101.12      (9.2%)   -4.0% ( -19% -   14%) 0.145
                      IntNRQ      469.42     (19.0%)      452.49     (19.3%)   -3.6% ( -35% -   42%) 0.552
       BrowseMonthSSDVFacets       16.03     (19.9%)       15.56     (14.8%)   -2.9% ( -31% -   39%) 0.595
                OrNotHighMed     1234.76     (11.6%)     1201.10      (8.0%)   -2.7% ( -19% -   19%) 0.386
                  OrHighHigh       33.50      (8.8%)       32.59      (7.4%)   -2.7% ( -17% -   14%) 0.291
                 AndHighHigh       82.12      (6.8%)       80.06      (3.2%)   -2.5% ( -11% -    7%) 0.133
                   OrHighMed      104.31     (10.0%)      102.16      (6.4%)   -2.1% ( -16% -   15%) 0.438
               OrHighNotHigh     1219.05     (13.8%)     1194.62      (7.5%)   -2.0% ( -20% -   22%) 0.569
                  AndHighLow     1643.61      (9.2%)     1610.70      (8.5%)   -2.0% ( -18% -   17%) 0.474
 BrowseRandomLabelSSDVFacets       10.44     (11.4%)       10.26      (9.5%)   -1.7% ( -20% -   21%) 0.600
        MedTermDayTaxoFacets       37.45      (8.6%)       36.82      (5.2%)   -1.7% ( -14% -   13%) 0.453
                     Prefix3      149.96     (15.7%)      147.52     (15.2%)   -1.6% ( -28% -   34%) 0.739
      OrHighMedDayTaxoFacets        7.40      (9.3%)        7.28      (7.2%)   -1.6% ( -16% -   16%) 0.552
                OrHighNotLow     1201.46     (10.6%)     1186.95      (6.0%)   -1.2% ( -16% -   17%) 0.656
                    HighTerm     1805.03     (12.4%)     1785.06      (7.9%)   -1.1% ( -19% -   21%) 0.737
                OrHighNotMed     1365.33      (9.5%)     1351.74      (6.1%)   -1.0% ( -15% -   16%) 0.694
                  AndHighMed      157.55     (10.2%)      156.24      (3.0%)   -0.8% ( -12% -   13%) 0.725
   BrowseDayOfYearSSDVFacets       14.80     (20.6%)       14.69     (17.7%)   -0.7% ( -32% -   47%) 0.904
        HighTermTitleBDVSort      156.67     (10.9%)      155.64     (10.6%)   -0.7% ( -19% -   23%) 0.847
     AndHighMedDayTaxoFacets      100.40      (8.4%)       99.95      (3.3%)   -0.5% ( -11% -   12%) 0.822
                     MedTerm     2083.36      (9.2%)     2074.64      (7.1%)   -0.4% ( -15% -   17%) 0.872
    AndHighHighDayTaxoFacets       13.79      (7.4%)       13.74      (5.1%)   -0.3% ( -11% -   13%) 0.865
               OrNotHighHigh     1023.39     (10.7%)     1023.03      (6.8%)   -0.0% ( -15% -   19%) 0.990
                    PKLookup      214.31     (10.7%)      214.33      (5.7%)    0.0% ( -14% -   18%) 0.998
                   MedPhrase       46.47      (7.4%)       46.49      (6.1%)    0.0% ( -12% -   14%) 0.983
       HighTermDayOfYearSort      138.48     (17.1%)      138.68      (9.7%)    0.1% ( -22% -   32%) 0.974
                     LowTerm     2157.78     (11.5%)     2161.03      (5.9%)    0.2% ( -15% -   19%) 0.958
             MedSloppyPhrase       24.77      (7.4%)       24.84      (4.3%)    0.3% ( -10% -   12%) 0.883
                 MedSpanNear       30.14      (8.0%)       30.30      (6.7%)    0.5% ( -13% -   16%) 0.824
                      Fuzzy2       64.73      (8.6%)       65.10      (8.2%)    0.6% ( -14% -   19%) 0.830
                OrNotHighLow     1471.82     (13.3%)     1487.55      (7.7%)    1.1% ( -17% -   25%) 0.756
                  HighPhrase      160.02     (10.2%)      162.37      (4.8%)    1.5% ( -12% -   18%) 0.560
                   OrHighLow      562.09     (13.0%)      571.39      (4.6%)    1.7% ( -14% -   22%) 0.591
            HighSloppyPhrase       31.41     (11.8%)       31.97      (5.5%)    1.8% ( -13% -   21%) 0.546
             LowSloppyPhrase      274.40     (12.7%)      279.79      (6.1%)    2.0% ( -14% -   23%) 0.533
                 LowSpanNear       17.01      (9.3%)       17.38      (4.5%)    2.2% ( -10% -   17%) 0.351
                   LowPhrase       55.00     (11.7%)       56.20      (3.3%)    2.2% ( -11% -   19%) 0.423
                      Fuzzy1       90.05     (12.5%)       92.02      (6.0%)    2.2% ( -14% -   23%) 0.480
         LowIntervalsOrdered      142.98      (8.2%)      146.35      (4.6%)    2.4% (  -9% -   16%) 0.262
                HighSpanNear       18.03      (7.2%)       18.52      (5.7%)    2.7% (  -9% -   16%) 0.188
                     Respell       51.46     (10.6%)       52.87      (3.6%)    2.8% ( -10% -   18%) 0.272
         MedIntervalsOrdered       27.30      (8.1%)       28.26      (6.8%)    3.5% ( -10% -   20%) 0.139
           HighTermMonthSort       88.62     (12.2%)       91.98     (12.2%)    3.8% ( -18% -   32%) 0.326
                  TermDTSort      134.30     (10.4%)      139.50     (13.5%)    3.9% ( -18% -   31%) 0.310
        HighIntervalsOrdered        7.31      (6.7%)        7.66      (9.2%)    4.8% ( -10% -   22%) 0.057

@gautamworah96
Copy link
Contributor

I have not taken a detailed look at the PR yet. From the benchmark results posted, I see that we've got a regression in several taxonomy tasks that utilize BInaryDocValues. See (-15, -10, -10, -10). Would you mind re-running the benchmark just to be sure @Yuti-G (also verify that the only difference between mainline and candidate are your commits). Thanks for the effort btw!

                 TaskQPS baseline      StdDevQPS candidate      StdDev                Pct diff p-value
       BrowseMonthTaxoFacets       19.31     (43.0%)       16.36     (20.9%)  -15.3% ( -55% -   85%) 0.153
   BrowseDayOfYearTaxoFacets       17.31     (29.5%)       15.50     (19.3%)  -10.4% ( -45% -   54%) 0.186
        BrowseDateTaxoFacets       17.15     (29.9%)       15.36     (19.7%)  -10.4% ( -46% -   55%) 0.194
 BrowseRandomLabelTaxoFacets       14.70     (29.6%)       13.19     (18.5%)  -10.3% ( -44% -   53%) 0.189

@Yuti-G
Copy link
Contributor Author

Yuti-G commented Apr 18, 2022

Hi @gautamworah96, thank you so much! I have re-run the benchmark with the up-to-date mainline, and the regressions have gone, so they could be noise. Please see the results.

                           TaskQPS baseline      StdDevQPS my_modified_version      StdDev Pct diff p-value
            HighTermTitleBDVSort      102.88     (19.0%)       96.21     (13.6%)   -6.5% ( -32% -   32%) 0.214
               HighTermMonthSort      121.33     (19.8%)      113.96     (14.0%)   -6.1% ( -33% -   34%) 0.263
                        Wildcard      235.05      (9.0%)      222.60     (10.0%)   -5.3% ( -22% -   14%) 0.077
                         Prefix3      138.51     (11.3%)      132.26     (12.2%)   -4.5% ( -25% -   21%) 0.225
                      TermDTSort       83.76     (16.7%)       80.34     (14.5%)   -4.1% ( -30% -   32%) 0.409
     BrowseRandomLabelSSDVFacets        6.23     (10.3%)        6.14      (7.9%)   -1.5% ( -17% -   18%) 0.598
       BrowseDayOfYearSSDVFacets        9.97     (19.8%)        9.88     (20.4%)   -0.9% ( -34% -   49%) 0.888
                       MedPhrase       72.98      (3.9%)       72.42      (3.7%)   -0.8% (  -8% -    7%) 0.521
            MedTermDayTaxoFacets       19.40      (4.7%)       19.26      (5.0%)   -0.7% (  -9% -    9%) 0.627
                        PKLookup      137.96      (3.6%)      136.99      (3.4%)   -0.7% (  -7% -    6%) 0.521
                     MedSpanNear       13.70      (2.8%)       13.60      (3.5%)   -0.7% (  -6% -    5%) 0.483
                    HighSpanNear        4.41      (3.3%)        4.38      (4.5%)   -0.7% (  -8% -    7%) 0.581
                     LowSpanNear       12.33      (3.0%)       12.25      (4.2%)   -0.7% (  -7% -    6%) 0.553
                 MedSloppyPhrase       45.31      (1.2%)       45.07      (2.8%)   -0.5% (  -4% -    3%) 0.426
         AndHighMedDayTaxoFacets       98.70      (2.0%)       98.17      (1.7%)   -0.5% (  -4% -    3%) 0.351
          OrHighMedDayTaxoFacets        9.51      (4.2%)        9.47      (3.3%)   -0.4% (  -7% -    7%) 0.714
                      HighPhrase      230.28      (2.5%)      229.28      (2.8%)   -0.4% (  -5% -    5%) 0.610
           HighTermDayOfYearSort       30.38     (21.9%)       30.29     (22.8%)   -0.3% ( -36% -   56%) 0.963
                       LowPhrase      239.16      (3.9%)      238.51      (4.1%)   -0.3% (  -7% -    8%) 0.831
                     AndHighHigh       50.68      (4.8%)       50.56      (3.6%)   -0.2% (  -8% -    8%) 0.858
                HighSloppyPhrase        6.20      (4.1%)        6.19      (4.7%)   -0.2% (  -8% -    8%) 0.878
                         LowTerm     1323.68      (4.6%)     1321.10      (5.7%)   -0.2% ( -10% -   10%) 0.906
                      AndHighLow      752.80      (2.8%)      751.45      (3.5%)   -0.2% (  -6% -    6%) 0.859
        AndHighHighDayTaxoFacets       25.28      (2.0%)       25.23      (2.8%)   -0.2% (  -4% -    4%) 0.827
                 LowSloppyPhrase       28.05      (1.6%)       28.01      (2.7%)   -0.1% (  -4% -    4%) 0.838
                          Fuzzy1       66.15      (1.8%)       66.09      (1.6%)   -0.1% (  -3% -    3%) 0.875
                          IntNRQ      445.57      (1.3%)      445.54      (2.0%)   -0.0% (  -3% -    3%) 0.989
                       OrHighLow      496.07      (2.0%)      496.07      (2.5%)   -0.0% (  -4% -    4%) 1.000
                    OrHighNotLow      827.15      (4.2%)      827.69      (3.7%)    0.1% (  -7% -    8%) 0.958
                      AndHighMed      189.81      (3.4%)      190.17      (3.4%)    0.2% (  -6% -    7%) 0.857
            HighIntervalsOrdered        6.60      (3.4%)        6.61      (3.6%)    0.2% (  -6% -    7%) 0.835
                    OrNotHighLow      811.18      (2.9%)      813.56      (2.1%)    0.3% (  -4% -    5%) 0.718
             MedIntervalsOrdered       14.97      (2.2%)       15.02      (2.9%)    0.3% (  -4% -    5%) 0.699
             LowIntervalsOrdered        6.50      (2.0%)        6.52      (2.5%)    0.3% (  -4% -    5%) 0.648
                   OrHighNotHigh      609.78      (4.4%)      612.30      (3.6%)    0.4% (  -7% -    8%) 0.744
                          Fuzzy2       56.06      (1.6%)       56.34      (1.6%)    0.5% (  -2% -    3%) 0.327
                    OrNotHighMed      623.85      (4.5%)      627.41      (4.1%)    0.6% (  -7% -    9%) 0.675
       BrowseDayOfYearTaxoFacets       18.77     (19.2%)       18.88     (16.5%)    0.6% ( -29% -   44%) 0.916
                       OrHighMed      153.63      (3.6%)      154.59      (3.2%)    0.6% (  -5% -    7%) 0.562
                      OrHighHigh       32.59      (4.1%)       32.83      (4.1%)    0.8% (  -7% -    9%) 0.560
                         Respell       47.99      (2.0%)       48.35      (2.0%)    0.8% (  -3% -    4%) 0.224
            BrowseDateSSDVFacets        1.86      (8.9%)        1.88      (9.0%)    1.0% ( -15% -   20%) 0.730
                   OrNotHighHigh      828.78      (4.0%)      839.81      (4.3%)    1.3% (  -6% -   10%) 0.309
            BrowseDateTaxoFacets       17.75     (18.4%)       18.00     (16.1%)    1.4% ( -27% -   43%) 0.800
                        HighTerm     1293.69      (5.5%)     1313.19      (5.5%)    1.5% (  -9% -   13%) 0.387
                    OrHighNotMed      792.26      (4.1%)      805.29      (4.2%)    1.6% (  -6% -   10%) 0.211
                         MedTerm     1234.34      (4.7%)     1255.40      (5.8%)    1.7% (  -8% -   12%) 0.308
     BrowseRandomLabelTaxoFacets       12.65     (15.6%)       13.30     (11.8%)    5.2% ( -19% -   38%) 0.236
           BrowseMonthTaxoFacets       18.53     (21.8%)       19.68     (18.6%)    6.2% ( -28% -   59%) 0.331
           BrowseMonthSSDVFacets       10.69     (17.8%)       11.68     (27.4%)    9.2% ( -30% -   66%) 0.206

Copy link
Contributor

@gsmiller gsmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks really good. Thanks! I left a few small comments.

@Yuti-G Yuti-G force-pushed the Lucene-10488-TaxonomyFacets branch from fd537e9 to 1285431 Compare May 11, 2022 03:39
Copy link
Contributor

@gsmiller gsmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the feedback! I've got one more comment for you. Thanks again!

@Yuti-G
Copy link
Contributor Author

Yuti-G commented May 13, 2022

Thanks! I changed the parameter name to pathLength and added more documentation to avoid confusion. Thanks for your feedback again and please let me know if there is any question!

Copy link
Contributor

@gsmiller gsmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I'll work on merging. Thanks @Yuti-G !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants