LUCENE-10488: Optimize Facets#getTopDims in IntTaxonomyFacets #779

Yuti-G · 2022-03-31T20:54:41Z

Description

This change overrides and optimizes the default implementation of getTopDims in IntTaxonomyFacets which is extended by FastTaxonomyFacetCounts and TaxonomyFacetSumIntAssociations.

Solution

Override getTopDims and refactor the getTopChildren function in IntTaxonomyFacets to get dimCount (aggregated dim values) more efficiently by checking if dimCount has been populated in indexing time for a dim that is hierarchical or multiValued && requireDimCount, before aggregating dimCount by iterating its child ordinal.

Tests

Added new testing for the overridden implementations of getTopDims in IntTaxonomyFacets

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.

Yuti-G · 2022-03-31T21:26:52Z

Please see attached for the benchmark results (updated after rebase):

                     TaskQPS baseline      StdDevQPS candidate      StdDev                Pct diff p-value
       BrowseMonthTaxoFacets       19.31     (43.0%)       16.36     (20.9%)  -15.3% ( -55% -   85%) 0.153
   BrowseDayOfYearTaxoFacets       17.31     (29.5%)       15.50     (19.3%)  -10.4% ( -45% -   54%) 0.186
        BrowseDateTaxoFacets       17.15     (29.9%)       15.36     (19.7%)  -10.4% ( -46% -   55%) 0.194
 BrowseRandomLabelTaxoFacets       14.70     (29.6%)       13.19     (18.5%)  -10.3% ( -44% -   53%) 0.189
        BrowseDateSSDVFacets        2.49     (18.7%)        2.38     (14.1%)   -4.1% ( -31% -   35%) 0.432
                    Wildcard      105.30      (8.1%)      101.12      (9.2%)   -4.0% ( -19% -   14%) 0.145
                      IntNRQ      469.42     (19.0%)      452.49     (19.3%)   -3.6% ( -35% -   42%) 0.552
       BrowseMonthSSDVFacets       16.03     (19.9%)       15.56     (14.8%)   -2.9% ( -31% -   39%) 0.595
                OrNotHighMed     1234.76     (11.6%)     1201.10      (8.0%)   -2.7% ( -19% -   19%) 0.386
                  OrHighHigh       33.50      (8.8%)       32.59      (7.4%)   -2.7% ( -17% -   14%) 0.291
                 AndHighHigh       82.12      (6.8%)       80.06      (3.2%)   -2.5% ( -11% -    7%) 0.133
                   OrHighMed      104.31     (10.0%)      102.16      (6.4%)   -2.1% ( -16% -   15%) 0.438
               OrHighNotHigh     1219.05     (13.8%)     1194.62      (7.5%)   -2.0% ( -20% -   22%) 0.569
                  AndHighLow     1643.61      (9.2%)     1610.70      (8.5%)   -2.0% ( -18% -   17%) 0.474
 BrowseRandomLabelSSDVFacets       10.44     (11.4%)       10.26      (9.5%)   -1.7% ( -20% -   21%) 0.600
        MedTermDayTaxoFacets       37.45      (8.6%)       36.82      (5.2%)   -1.7% ( -14% -   13%) 0.453
                     Prefix3      149.96     (15.7%)      147.52     (15.2%)   -1.6% ( -28% -   34%) 0.739
      OrHighMedDayTaxoFacets        7.40      (9.3%)        7.28      (7.2%)   -1.6% ( -16% -   16%) 0.552
                OrHighNotLow     1201.46     (10.6%)     1186.95      (6.0%)   -1.2% ( -16% -   17%) 0.656
                    HighTerm     1805.03     (12.4%)     1785.06      (7.9%)   -1.1% ( -19% -   21%) 0.737
                OrHighNotMed     1365.33      (9.5%)     1351.74      (6.1%)   -1.0% ( -15% -   16%) 0.694
                  AndHighMed      157.55     (10.2%)      156.24      (3.0%)   -0.8% ( -12% -   13%) 0.725
   BrowseDayOfYearSSDVFacets       14.80     (20.6%)       14.69     (17.7%)   -0.7% ( -32% -   47%) 0.904
        HighTermTitleBDVSort      156.67     (10.9%)      155.64     (10.6%)   -0.7% ( -19% -   23%) 0.847
     AndHighMedDayTaxoFacets      100.40      (8.4%)       99.95      (3.3%)   -0.5% ( -11% -   12%) 0.822
                     MedTerm     2083.36      (9.2%)     2074.64      (7.1%)   -0.4% ( -15% -   17%) 0.872
    AndHighHighDayTaxoFacets       13.79      (7.4%)       13.74      (5.1%)   -0.3% ( -11% -   13%) 0.865
               OrNotHighHigh     1023.39     (10.7%)     1023.03      (6.8%)   -0.0% ( -15% -   19%) 0.990
                    PKLookup      214.31     (10.7%)      214.33      (5.7%)    0.0% ( -14% -   18%) 0.998
                   MedPhrase       46.47      (7.4%)       46.49      (6.1%)    0.0% ( -12% -   14%) 0.983
       HighTermDayOfYearSort      138.48     (17.1%)      138.68      (9.7%)    0.1% ( -22% -   32%) 0.974
                     LowTerm     2157.78     (11.5%)     2161.03      (5.9%)    0.2% ( -15% -   19%) 0.958
             MedSloppyPhrase       24.77      (7.4%)       24.84      (4.3%)    0.3% ( -10% -   12%) 0.883
                 MedSpanNear       30.14      (8.0%)       30.30      (6.7%)    0.5% ( -13% -   16%) 0.824
                      Fuzzy2       64.73      (8.6%)       65.10      (8.2%)    0.6% ( -14% -   19%) 0.830
                OrNotHighLow     1471.82     (13.3%)     1487.55      (7.7%)    1.1% ( -17% -   25%) 0.756
                  HighPhrase      160.02     (10.2%)      162.37      (4.8%)    1.5% ( -12% -   18%) 0.560
                   OrHighLow      562.09     (13.0%)      571.39      (4.6%)    1.7% ( -14% -   22%) 0.591
            HighSloppyPhrase       31.41     (11.8%)       31.97      (5.5%)    1.8% ( -13% -   21%) 0.546
             LowSloppyPhrase      274.40     (12.7%)      279.79      (6.1%)    2.0% ( -14% -   23%) 0.533
                 LowSpanNear       17.01      (9.3%)       17.38      (4.5%)    2.2% ( -10% -   17%) 0.351
                   LowPhrase       55.00     (11.7%)       56.20      (3.3%)    2.2% ( -11% -   19%) 0.423
                      Fuzzy1       90.05     (12.5%)       92.02      (6.0%)    2.2% ( -14% -   23%) 0.480
         LowIntervalsOrdered      142.98      (8.2%)      146.35      (4.6%)    2.4% (  -9% -   16%) 0.262
                HighSpanNear       18.03      (7.2%)       18.52      (5.7%)    2.7% (  -9% -   16%) 0.188
                     Respell       51.46     (10.6%)       52.87      (3.6%)    2.8% ( -10% -   18%) 0.272
         MedIntervalsOrdered       27.30      (8.1%)       28.26      (6.8%)    3.5% ( -10% -   20%) 0.139
           HighTermMonthSort       88.62     (12.2%)       91.98     (12.2%)    3.8% ( -18% -   32%) 0.326
                  TermDTSort      134.30     (10.4%)      139.50     (13.5%)    3.9% ( -18% -   31%) 0.310
        HighIntervalsOrdered        7.31      (6.7%)        7.66      (9.2%)    4.8% ( -10% -   22%) 0.057

gautamworah96 · 2022-04-17T04:19:34Z

I have not taken a detailed look at the PR yet. From the benchmark results posted, I see that we've got a regression in several taxonomy tasks that utilize BInaryDocValues. See (-15, -10, -10, -10). Would you mind re-running the benchmark just to be sure @Yuti-G (also verify that the only difference between mainline and candidate are your commits). Thanks for the effort btw!

                 TaskQPS baseline      StdDevQPS candidate      StdDev                Pct diff p-value
       BrowseMonthTaxoFacets       19.31     (43.0%)       16.36     (20.9%)  -15.3% ( -55% -   85%) 0.153
   BrowseDayOfYearTaxoFacets       17.31     (29.5%)       15.50     (19.3%)  -10.4% ( -45% -   54%) 0.186
        BrowseDateTaxoFacets       17.15     (29.9%)       15.36     (19.7%)  -10.4% ( -46% -   55%) 0.194
 BrowseRandomLabelTaxoFacets       14.70     (29.6%)       13.19     (18.5%)  -10.3% ( -44% -   53%) 0.189

Yuti-G · 2022-04-18T20:26:19Z

Hi @gautamworah96, thank you so much! I have re-run the benchmark with the up-to-date mainline, and the regressions have gone, so they could be noise. Please see the results.

                           TaskQPS baseline      StdDevQPS my_modified_version      StdDev Pct diff p-value
            HighTermTitleBDVSort      102.88     (19.0%)       96.21     (13.6%)   -6.5% ( -32% -   32%) 0.214
               HighTermMonthSort      121.33     (19.8%)      113.96     (14.0%)   -6.1% ( -33% -   34%) 0.263
                        Wildcard      235.05      (9.0%)      222.60     (10.0%)   -5.3% ( -22% -   14%) 0.077
                         Prefix3      138.51     (11.3%)      132.26     (12.2%)   -4.5% ( -25% -   21%) 0.225
                      TermDTSort       83.76     (16.7%)       80.34     (14.5%)   -4.1% ( -30% -   32%) 0.409
     BrowseRandomLabelSSDVFacets        6.23     (10.3%)        6.14      (7.9%)   -1.5% ( -17% -   18%) 0.598
       BrowseDayOfYearSSDVFacets        9.97     (19.8%)        9.88     (20.4%)   -0.9% ( -34% -   49%) 0.888
                       MedPhrase       72.98      (3.9%)       72.42      (3.7%)   -0.8% (  -8% -    7%) 0.521
            MedTermDayTaxoFacets       19.40      (4.7%)       19.26      (5.0%)   -0.7% (  -9% -    9%) 0.627
                        PKLookup      137.96      (3.6%)      136.99      (3.4%)   -0.7% (  -7% -    6%) 0.521
                     MedSpanNear       13.70      (2.8%)       13.60      (3.5%)   -0.7% (  -6% -    5%) 0.483
                    HighSpanNear        4.41      (3.3%)        4.38      (4.5%)   -0.7% (  -8% -    7%) 0.581
                     LowSpanNear       12.33      (3.0%)       12.25      (4.2%)   -0.7% (  -7% -    6%) 0.553
                 MedSloppyPhrase       45.31      (1.2%)       45.07      (2.8%)   -0.5% (  -4% -    3%) 0.426
         AndHighMedDayTaxoFacets       98.70      (2.0%)       98.17      (1.7%)   -0.5% (  -4% -    3%) 0.351
          OrHighMedDayTaxoFacets        9.51      (4.2%)        9.47      (3.3%)   -0.4% (  -7% -    7%) 0.714
                      HighPhrase      230.28      (2.5%)      229.28      (2.8%)   -0.4% (  -5% -    5%) 0.610
           HighTermDayOfYearSort       30.38     (21.9%)       30.29     (22.8%)   -0.3% ( -36% -   56%) 0.963
                       LowPhrase      239.16      (3.9%)      238.51      (4.1%)   -0.3% (  -7% -    8%) 0.831
                     AndHighHigh       50.68      (4.8%)       50.56      (3.6%)   -0.2% (  -8% -    8%) 0.858
                HighSloppyPhrase        6.20      (4.1%)        6.19      (4.7%)   -0.2% (  -8% -    8%) 0.878
                         LowTerm     1323.68      (4.6%)     1321.10      (5.7%)   -0.2% ( -10% -   10%) 0.906
                      AndHighLow      752.80      (2.8%)      751.45      (3.5%)   -0.2% (  -6% -    6%) 0.859
        AndHighHighDayTaxoFacets       25.28      (2.0%)       25.23      (2.8%)   -0.2% (  -4% -    4%) 0.827
                 LowSloppyPhrase       28.05      (1.6%)       28.01      (2.7%)   -0.1% (  -4% -    4%) 0.838
                          Fuzzy1       66.15      (1.8%)       66.09      (1.6%)   -0.1% (  -3% -    3%) 0.875
                          IntNRQ      445.57      (1.3%)      445.54      (2.0%)   -0.0% (  -3% -    3%) 0.989
                       OrHighLow      496.07      (2.0%)      496.07      (2.5%)   -0.0% (  -4% -    4%) 1.000
                    OrHighNotLow      827.15      (4.2%)      827.69      (3.7%)    0.1% (  -7% -    8%) 0.958
                      AndHighMed      189.81      (3.4%)      190.17      (3.4%)    0.2% (  -6% -    7%) 0.857
            HighIntervalsOrdered        6.60      (3.4%)        6.61      (3.6%)    0.2% (  -6% -    7%) 0.835
                    OrNotHighLow      811.18      (2.9%)      813.56      (2.1%)    0.3% (  -4% -    5%) 0.718
             MedIntervalsOrdered       14.97      (2.2%)       15.02      (2.9%)    0.3% (  -4% -    5%) 0.699
             LowIntervalsOrdered        6.50      (2.0%)        6.52      (2.5%)    0.3% (  -4% -    5%) 0.648
                   OrHighNotHigh      609.78      (4.4%)      612.30      (3.6%)    0.4% (  -7% -    8%) 0.744
                          Fuzzy2       56.06      (1.6%)       56.34      (1.6%)    0.5% (  -2% -    3%) 0.327
                    OrNotHighMed      623.85      (4.5%)      627.41      (4.1%)    0.6% (  -7% -    9%) 0.675
       BrowseDayOfYearTaxoFacets       18.77     (19.2%)       18.88     (16.5%)    0.6% ( -29% -   44%) 0.916
                       OrHighMed      153.63      (3.6%)      154.59      (3.2%)    0.6% (  -5% -    7%) 0.562
                      OrHighHigh       32.59      (4.1%)       32.83      (4.1%)    0.8% (  -7% -    9%) 0.560
                         Respell       47.99      (2.0%)       48.35      (2.0%)    0.8% (  -3% -    4%) 0.224
            BrowseDateSSDVFacets        1.86      (8.9%)        1.88      (9.0%)    1.0% ( -15% -   20%) 0.730
                   OrNotHighHigh      828.78      (4.0%)      839.81      (4.3%)    1.3% (  -6% -   10%) 0.309
            BrowseDateTaxoFacets       17.75     (18.4%)       18.00     (16.1%)    1.4% ( -27% -   43%) 0.800
                        HighTerm     1293.69      (5.5%)     1313.19      (5.5%)    1.5% (  -9% -   13%) 0.387
                    OrHighNotMed      792.26      (4.1%)      805.29      (4.2%)    1.6% (  -6% -   10%) 0.211
                         MedTerm     1234.34      (4.7%)     1255.40      (5.8%)    1.7% (  -8% -   12%) 0.308
     BrowseRandomLabelTaxoFacets       12.65     (15.6%)       13.30     (11.8%)    5.2% ( -19% -   38%) 0.236
           BrowseMonthTaxoFacets       18.53     (21.8%)       19.68     (18.6%)    6.2% ( -28% -   59%) 0.331
           BrowseMonthSSDVFacets       10.69     (17.8%)       11.68     (27.4%)    9.2% ( -30% -   66%) 0.206

gsmiller

Overall looks really good. Thanks! I left a few small comments.

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacets.java

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java

gsmiller

Thanks for addressing the feedback! I've got one more comment for you. Thanks again!

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java

Yuti-G · 2022-05-13T02:56:22Z

Thanks! I changed the parameter name to pathLength and added more documentation to avoid confusion. Thanks for your feedback again and please let me know if there is any question!

gsmiller

Looks good! I'll work on merging. Thanks @Yuti-G !

Yuti-G mentioned this pull request Apr 4, 2022

LUCENE-10495: Fix return statement of siblingsLoaded() in TaxonomyFacets #778

Merged

6 tasks

Yuti-G force-pushed the Lucene-10488-TaxonomyFacets branch from 16fe4fc to fd537e9 Compare April 7, 2022 23:32

gautamworah96 mentioned this pull request Apr 19, 2022

Add README instructions on how to interpret benchmark results mikemccand/luceneutil#167

Open

gsmiller reviewed May 10, 2022

View reviewed changes

Yuti-G added 7 commits May 10, 2022 20:35

optimize getTopDims in TaxonomyFacets

081365c

added tests

a8d5a43

added comments

f2f46ae

fixed a bug

f4ff763

applied gradewl check

78968e7

reset import

9787a2e

addressed feedback

1285431

Yuti-G force-pushed the Lucene-10488-TaxonomyFacets branch from fd537e9 to 1285431 Compare May 11, 2022 03:39

gsmiller reviewed May 12, 2022

View reviewed changes

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java Outdated Show resolved Hide resolved

gsmiller mentioned this pull request May 12, 2022

LUCENE-10488: Optimize Facets#getTopDims in ConcurrentSortedSetDocValuesFacetCounts #777

Merged

6 tasks

added documentation

1a40c4d

added TODO comment back

ee8adab

gsmiller approved these changes May 13, 2022

View reviewed changes

gsmiller merged commit 57f8cb2 into apache:main May 13, 2022

asfimport mentioned this pull request May 24, 2022

Optimize Facets#getTopDims across Facets implementations [LUCENE-10488] #11524

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-10488: Optimize Facets#getTopDims in IntTaxonomyFacets #779

LUCENE-10488: Optimize Facets#getTopDims in IntTaxonomyFacets #779

Yuti-G commented Mar 31, 2022

Yuti-G commented Mar 31, 2022 •

edited

gautamworah96 commented Apr 17, 2022

Yuti-G commented Apr 18, 2022 •

edited

gsmiller left a comment

gsmiller left a comment

Yuti-G commented May 13, 2022

gsmiller left a comment

LUCENE-10488: Optimize Facets#getTopDims in IntTaxonomyFacets #779

LUCENE-10488: Optimize Facets#getTopDims in IntTaxonomyFacets #779

Conversation

Yuti-G commented Mar 31, 2022

Description

Solution

Tests

Checklist

Yuti-G commented Mar 31, 2022 • edited

gautamworah96 commented Apr 17, 2022

Yuti-G commented Apr 18, 2022 • edited

gsmiller left a comment

Choose a reason for hiding this comment

gsmiller left a comment

Choose a reason for hiding this comment

Yuti-G commented May 13, 2022

gsmiller left a comment

Choose a reason for hiding this comment

Yuti-G commented Mar 31, 2022 •

edited

Yuti-G commented Apr 18, 2022 •

edited