LUCENE-10444: Support alternate aggregation functions in association facets #718

gsmiller · 2022-03-02T00:55:15Z

Description

This change adds support for "max" aggregations (in addition to "sum") to association faceting. It does so in a way that is (somewhat) extensible for future aggregation functionality.

Solution

Replaced the existing association faceting classes that were hardcoded to "sum" with new classes that allow the user to specify an aggregation function. Note that I will open a separate PR for a backport of this that remains backwards-compatible on 9x.

Tests

Added new testing for new aggregation functionality.

I also tested the performance of this approach using my 9x backport. Results on luceneutil look pretty flat to me. I've posted that out put in the backport PR: #719

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.

msokolov

It makes sense to me to add this capability. I wonder if the extra abstraction hurts us though in these tight loops summing up values in an array? If it does, we might want to provide a specialization for such loops as well?

msokolov · 2022-03-03T20:35:00Z

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FloatTaxonomyFacets.java

      return null;
    }

    if (dimConfig.multiValued) {
      if (dimConfig.requireDimCount) {
-        sumValues = values[dimOrd];
+        aggregatedValue = values[dimOrd];
      } else {
        // Our sum'd count is not correct, in general:


our "aggregated" count?

It's not necessarily a "count" though here right? It's an aggregated weight associated with the value.

msokolov · 2022-03-03T20:36:48Z

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java

          childCount++;
-          if (count > bottomValue) {
+          if (value > bottomValue) {


I guess we need to ensure that aggregation functions are nondecreasing? I mean min wouldn't work very well here

That's right. There are a number of things actually preventing us from cleanly adding something like min. I had it originally but as I started looking at all the changes it would require, I backed off for the time being (especially since I don't have a concrete use-case in mind). One interesting challenge is that these facets implementations all assume the weights are positive values. There are a lot of > 0 checks floating around the various implementations to check whether-or-not a value had any "weight" associated with it. This makes sense when using counts, but it's weird when generally associated weights with the values. So min started to feel a little weird and I just left it out for now.

Ahh, OK, so then disregard my comment above -- we should not try to tackle min here!

msokolov · 2022-03-03T21:14:30Z

It makes sense to me to add this capability. I wonder if the extra abstraction hurts us though in these tight loops summing up values in an array? If it does, we might want to provide a specialization for such loops as well?

Oh I missed the benchmarking you did - I guess it was on the backport PR. Looked like no significant change there, good.

…facets

gsmiller · 2022-03-14T16:02:58Z

Even though I ran benchmarks on the backport version of this change (#719), I figured it would be good to run benchmarks here as well. Below compares this patch against main using wikimedium10m:

                            TaskQPS baseline      StdDevQPS candidate      StdDev                Pct diff p-value
       BrowseDayOfYearTaxoFacets       21.65     (22.8%)       20.61     (22.9%)   -4.8% ( -41% -   52%) 0.507
            BrowseDateTaxoFacets       21.61     (22.6%)       20.61     (22.8%)   -4.6% ( -40% -   52%) 0.519
                         Prefix3      362.81     (10.5%)      349.00     (11.4%)   -3.8% ( -23% -   20%) 0.273
     BrowseRandomLabelTaxoFacets       17.76     (18.5%)       17.15     (20.0%)   -3.4% ( -35% -   43%) 0.577
           BrowseMonthTaxoFacets       27.32     (27.1%)       26.45     (29.2%)   -3.2% ( -46% -   72%) 0.721
                        Wildcard       64.61      (5.6%)       63.58      (6.0%)   -1.6% ( -12% -   10%) 0.381
                    OrNotHighMed      887.00      (3.3%)      874.66      (3.4%)   -1.4% (  -7% -    5%) 0.191
                         LowTerm     2661.07      (2.9%)     2630.67      (3.2%)   -1.1% (  -7% -    5%) 0.240
                   OrNotHighHigh     1523.01      (3.8%)     1506.01      (4.2%)   -1.1% (  -8% -    7%) 0.379
         AndHighMedDayTaxoFacets      124.82      (1.4%)      123.59      (1.5%)   -1.0% (  -3% -    1%) 0.032
                    HighSpanNear       21.47      (4.9%)       21.27      (4.9%)   -0.9% ( -10% -    9%) 0.558
                       MedPhrase      342.22      (2.8%)      339.35      (3.2%)   -0.8% (  -6% -    5%) 0.373
                      HighPhrase      453.21      (2.2%)      449.61      (2.6%)   -0.8% (  -5% -    4%) 0.291
                     MedSpanNear       74.03      (4.3%)       73.45      (4.2%)   -0.8% (  -8% -    8%) 0.559
           BrowseMonthSSDVFacets       13.79     (19.1%)       13.69     (19.7%)   -0.7% ( -33% -   47%) 0.910
                       LowPhrase       85.89      (1.9%)       85.33      (2.0%)   -0.7% (  -4% -    3%) 0.294
                        PKLookup      169.25      (3.2%)      168.19      (2.3%)   -0.6% (  -5% -    5%) 0.481
                    OrNotHighLow      962.38      (3.0%)      956.47      (2.9%)   -0.6% (  -6% -    5%) 0.511
       BrowseDayOfYearSSDVFacets       12.23     (14.4%)       12.15     (14.5%)   -0.6% ( -25% -   33%) 0.897
            BrowseDateSSDVFacets        2.34      (6.1%)        2.32      (7.8%)   -0.6% ( -13% -   14%) 0.802
                       OrHighMed      134.10      (5.0%)      133.49      (4.6%)   -0.5% (  -9% -    9%) 0.766
                          Fuzzy1       91.09      (1.2%)       90.71      (1.8%)   -0.4% (  -3% -    2%) 0.373
                        HighTerm     1690.39      (4.6%)     1684.21      (5.3%)   -0.4% (  -9% -   10%) 0.816
                   OrHighNotHigh     1592.87      (2.4%)     1587.86      (3.7%)   -0.3% (  -6% -    5%) 0.751
        AndHighHighDayTaxoFacets       12.34      (2.4%)       12.31      (2.4%)   -0.3% (  -4% -    4%) 0.696
                      AndHighLow      927.42      (3.2%)      924.95      (3.1%)   -0.3% (  -6% -    6%) 0.790
                 MedSloppyPhrase      107.35      (2.5%)      107.06      (2.5%)   -0.3% (  -5% -    4%) 0.736
                          IntNRQ       83.16      (1.2%)       82.95      (1.0%)   -0.2% (  -2% -    1%) 0.469
                         MedTerm     1935.87      (4.3%)     1934.77      (4.8%)   -0.1% (  -8% -    9%) 0.969
                    OrHighNotLow     1109.87      (4.1%)     1109.25      (4.7%)   -0.1% (  -8% -    9%) 0.968
                HighSloppyPhrase       28.62      (1.8%)       28.61      (2.6%)   -0.0% (  -4% -    4%) 0.981
                         Respell       57.41      (1.1%)       57.40      (1.6%)   -0.0% (  -2% -    2%) 0.968
                     LowSpanNear      193.79      (3.4%)      193.81      (3.8%)    0.0% (  -6% -    7%) 0.992
                 LowSloppyPhrase       30.88      (1.5%)       30.90      (1.8%)    0.1% (  -3% -    3%) 0.885
                      OrHighHigh       38.71      (4.3%)       38.74      (4.2%)    0.1% (  -8% -    8%) 0.954
                       OrHighLow      606.33      (2.8%)      606.93      (2.5%)    0.1% (  -5% -    5%) 0.907
                    OrHighNotMed      985.50      (4.3%)      986.95      (4.6%)    0.1% (  -8% -    9%) 0.917
                          Fuzzy2       38.38      (1.5%)       38.45      (1.8%)    0.2% (  -3% -    3%) 0.698
          OrHighMedDayTaxoFacets        5.37      (4.8%)        5.39      (4.9%)    0.4% (  -8% -   10%) 0.801
            MedTermDayTaxoFacets       27.22      (3.9%)       27.41      (4.6%)    0.7% (  -7% -    9%) 0.606
             LowIntervalsOrdered       38.55      (3.8%)       38.89      (3.2%)    0.9% (  -5% -    8%) 0.415
     BrowseRandomLabelSSDVFacets        9.42      (7.4%)        9.53      (9.7%)    1.1% ( -14% -   19%) 0.677
             MedIntervalsOrdered       33.92      (4.9%)       34.32      (4.2%)    1.2% (  -7% -   10%) 0.412
                      AndHighMed      132.91      (5.1%)      135.21      (7.0%)    1.7% (  -9% -   14%) 0.375
            HighIntervalsOrdered       25.36      (8.5%)       25.81      (7.0%)    1.7% ( -12% -   18%) 0.478
            HighTermTitleBDVSort       82.25     (17.7%)       83.75     (12.8%)    1.8% ( -24% -   39%) 0.710
               HighTermMonthSort      133.31     (13.6%)      135.96     (13.2%)    2.0% ( -21% -   33%) 0.638
                     AndHighHigh       65.89      (5.0%)       67.30      (7.3%)    2.1% (  -9% -   15%) 0.282
                      TermDTSort       69.38     (14.1%)       73.72     (12.3%)    6.3% ( -17% -   38%) 0.135
           HighTermDayOfYearSort       43.49     (15.0%)       46.25     (18.9%)    6.3% ( -23% -   47%) 0.239

mikemccand

I left a few small-ish comments. I love this approach! It generalizes our simplistic aggregation capabilities.

I agree we should not try to tackle the further generalizations needed for min now, but maybe we can add a // TODO capturing your comment (and the implicit "always 0" initial value) for the future?

Thanks @gsmiller.

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FloatTaxonomyFacets.java

mikemccand · 2022-03-29T13:59:22Z

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java

          childCount++;
-          if (count > bottomValue) {
+          if (value > bottomValue) {


Ahh, OK, so then disregard my comment above -- we should not try to tackle min here!

mikemccand · 2022-03-29T14:00:18Z

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacetFloatAssociations.java

+
+      @Override
+      public boolean advanceExact(int doc) throws IOException {
+        index++;


Hmm, instead of index++ shouldn't we do index = doc?

Oh wow. Great catch! I lifted this from the current implementation but this is an uncaught bug. I just reproduced it with a test. I'm going to open a separate issue and fix this bug first, then merge the fix into this PR. Thanks Mike!

lucene/facet/src/test/org/apache/lucene/facet/taxonomy/TestTaxonomyFacetAssociations.java

…ion-facets-pr

gsmiller · 2022-04-06T18:18:50Z

@mikemccand or @msokolov, did either of you have additional feedback? It didn't really look like it beyond the pre-existing bug (which I've since addressed), but I wanted to check before merging to make sure. Thanks again for the reviews!

mikemccand

Thanks @gsmiller looks great!

msokolov

Yeah, sorry I checked out. It was looking good to me at the start, and I think you addressed my concerns, plus did clean up, rebasing, etc. I thikn it's ready

gsmiller mentioned this pull request Mar 2, 2022

LUCENE-10444 BACKPORT: Support alternate aggregation functions in association facets #719

Merged

msokolov reviewed Mar 3, 2022

View reviewed changes

LUCENE-10444: Support alternate aggregation functions in association …

6614900

…facets

gsmiller force-pushed the LUCENE-10444-association-facets-pr branch from d3c44b7 to 6614900 Compare March 14, 2022 15:38

gsmiller requested a review from msokolov March 14, 2022 16:03

mikemccand reviewed Mar 29, 2022

View reviewed changes

gsmiller added 7 commits March 29, 2022 10:40

Merge remote-tracking branch 'origin/main' into LUCENE-10444-associat…

5620395

…ion-facets-pr

Merge remote-tracking branch 'origin/main' into LUCENE-10444-associat…

a6fbe68

…ion-facets-pr

move changes entry to 9.2

eb22809

address PR feedback

5d0221b

Merge remote-tracking branch 'origin/main' into LUCENE-10444-associat…

1bd7850

…ion-facets-pr

spotless

0ac7f39

Merge remote-tracking branch 'origin/main' into LUCENE-10444-associat…

f64497e

…ion-facets-pr

gsmiller requested a review from mikemccand April 5, 2022 16:24

gsmiller added 2 commits April 5, 2022 10:02

Merge remote-tracking branch 'origin/main' into LUCENE-10444-associat…

52fc1c1

…ion-facets-pr

fixup unit test compile error after merging upstream changes

78f6e65

mikemccand approved these changes Apr 6, 2022

View reviewed changes

msokolov approved these changes Apr 6, 2022

View reviewed changes

gsmiller merged commit f870edf into apache:main Apr 6, 2022

gsmiller deleted the LUCENE-10444-association-facets-pr branch April 6, 2022 21:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-10444: Support alternate aggregation functions in association facets #718

LUCENE-10444: Support alternate aggregation functions in association facets #718

gsmiller commented Mar 2, 2022 •

edited

Loading

msokolov left a comment

msokolov Mar 3, 2022

gsmiller Mar 14, 2022

msokolov Mar 3, 2022

gsmiller Mar 14, 2022

mikemccand Mar 29, 2022

msokolov commented Mar 3, 2022

gsmiller commented Mar 14, 2022

mikemccand left a comment

mikemccand Mar 29, 2022

mikemccand Mar 29, 2022

gsmiller Mar 30, 2022

gsmiller commented Apr 6, 2022

mikemccand left a comment

msokolov left a comment

LUCENE-10444: Support alternate aggregation functions in association facets #718

LUCENE-10444: Support alternate aggregation functions in association facets #718

Conversation

gsmiller commented Mar 2, 2022 • edited Loading

Description

Solution

Tests

Checklist

msokolov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msokolov commented Mar 3, 2022

gsmiller commented Mar 14, 2022

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gsmiller commented Apr 6, 2022

mikemccand left a comment

Choose a reason for hiding this comment

msokolov left a comment

Choose a reason for hiding this comment

gsmiller commented Mar 2, 2022 •

edited

Loading