Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10444: Support alternate aggregation functions in association facets #718

Merged
merged 10 commits into from
Apr 6, 2022

Conversation

gsmiller
Copy link
Contributor

@gsmiller gsmiller commented Mar 2, 2022

Description

This change adds support for "max" aggregations (in addition to "sum") to association faceting. It does so in a way that is (somewhat) extensible for future aggregation functionality.

Solution

Replaced the existing association faceting classes that were hardcoded to "sum" with new classes that allow the user to specify an aggregation function. Note that I will open a separate PR for a backport of this that remains backwards-compatible on 9x.

Tests

Added new testing for new aggregation functionality.

I also tested the performance of this approach using my 9x backport. Results on luceneutil look pretty flat to me. I've posted that out put in the backport PR: #719

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me to add this capability. I wonder if the extra abstraction hurts us though in these tight loops summing up values in an array? If it does, we might want to provide a specialization for such loops as well?

return null;
}

if (dimConfig.multiValued) {
if (dimConfig.requireDimCount) {
sumValues = values[dimOrd];
aggregatedValue = values[dimOrd];
} else {
// Our sum'd count is not correct, in general:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our "aggregated" count?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessarily a "count" though here right? It's an aggregated weight associated with the value.

childCount++;
if (count > bottomValue) {
if (value > bottomValue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need to ensure that aggregation functions are nondecreasing? I mean min wouldn't work very well here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. There are a number of things actually preventing us from cleanly adding something like min. I had it originally but as I started looking at all the changes it would require, I backed off for the time being (especially since I don't have a concrete use-case in mind). One interesting challenge is that these facets implementations all assume the weights are positive values. There are a lot of > 0 checks floating around the various implementations to check whether-or-not a value had any "weight" associated with it. This makes sense when using counts, but it's weird when generally associated weights with the values. So min started to feel a little weird and I just left it out for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, OK, so then disregard my comment above -- we should not try to tackle min here!

@msokolov
Copy link
Contributor

msokolov commented Mar 3, 2022

It makes sense to me to add this capability. I wonder if the extra abstraction hurts us though in these tight loops summing up values in an array? If it does, we might want to provide a specialization for such loops as well?

Oh I missed the benchmarking you did - I guess it was on the backport PR. Looked like no significant change there, good.

@gsmiller gsmiller force-pushed the LUCENE-10444-association-facets-pr branch from d3c44b7 to 6614900 Compare March 14, 2022 15:38
@gsmiller
Copy link
Contributor Author

Even though I ran benchmarks on the backport version of this change (#719), I figured it would be good to run benchmarks here as well. Below compares this patch against main using wikimedium10m:

                            TaskQPS baseline      StdDevQPS candidate      StdDev                Pct diff p-value
       BrowseDayOfYearTaxoFacets       21.65     (22.8%)       20.61     (22.9%)   -4.8% ( -41% -   52%) 0.507
            BrowseDateTaxoFacets       21.61     (22.6%)       20.61     (22.8%)   -4.6% ( -40% -   52%) 0.519
                         Prefix3      362.81     (10.5%)      349.00     (11.4%)   -3.8% ( -23% -   20%) 0.273
     BrowseRandomLabelTaxoFacets       17.76     (18.5%)       17.15     (20.0%)   -3.4% ( -35% -   43%) 0.577
           BrowseMonthTaxoFacets       27.32     (27.1%)       26.45     (29.2%)   -3.2% ( -46% -   72%) 0.721
                        Wildcard       64.61      (5.6%)       63.58      (6.0%)   -1.6% ( -12% -   10%) 0.381
                    OrNotHighMed      887.00      (3.3%)      874.66      (3.4%)   -1.4% (  -7% -    5%) 0.191
                         LowTerm     2661.07      (2.9%)     2630.67      (3.2%)   -1.1% (  -7% -    5%) 0.240
                   OrNotHighHigh     1523.01      (3.8%)     1506.01      (4.2%)   -1.1% (  -8% -    7%) 0.379
         AndHighMedDayTaxoFacets      124.82      (1.4%)      123.59      (1.5%)   -1.0% (  -3% -    1%) 0.032
                    HighSpanNear       21.47      (4.9%)       21.27      (4.9%)   -0.9% ( -10% -    9%) 0.558
                       MedPhrase      342.22      (2.8%)      339.35      (3.2%)   -0.8% (  -6% -    5%) 0.373
                      HighPhrase      453.21      (2.2%)      449.61      (2.6%)   -0.8% (  -5% -    4%) 0.291
                     MedSpanNear       74.03      (4.3%)       73.45      (4.2%)   -0.8% (  -8% -    8%) 0.559
           BrowseMonthSSDVFacets       13.79     (19.1%)       13.69     (19.7%)   -0.7% ( -33% -   47%) 0.910
                       LowPhrase       85.89      (1.9%)       85.33      (2.0%)   -0.7% (  -4% -    3%) 0.294
                        PKLookup      169.25      (3.2%)      168.19      (2.3%)   -0.6% (  -5% -    5%) 0.481
                    OrNotHighLow      962.38      (3.0%)      956.47      (2.9%)   -0.6% (  -6% -    5%) 0.511
       BrowseDayOfYearSSDVFacets       12.23     (14.4%)       12.15     (14.5%)   -0.6% ( -25% -   33%) 0.897
            BrowseDateSSDVFacets        2.34      (6.1%)        2.32      (7.8%)   -0.6% ( -13% -   14%) 0.802
                       OrHighMed      134.10      (5.0%)      133.49      (4.6%)   -0.5% (  -9% -    9%) 0.766
                          Fuzzy1       91.09      (1.2%)       90.71      (1.8%)   -0.4% (  -3% -    2%) 0.373
                        HighTerm     1690.39      (4.6%)     1684.21      (5.3%)   -0.4% (  -9% -   10%) 0.816
                   OrHighNotHigh     1592.87      (2.4%)     1587.86      (3.7%)   -0.3% (  -6% -    5%) 0.751
        AndHighHighDayTaxoFacets       12.34      (2.4%)       12.31      (2.4%)   -0.3% (  -4% -    4%) 0.696
                      AndHighLow      927.42      (3.2%)      924.95      (3.1%)   -0.3% (  -6% -    6%) 0.790
                 MedSloppyPhrase      107.35      (2.5%)      107.06      (2.5%)   -0.3% (  -5% -    4%) 0.736
                          IntNRQ       83.16      (1.2%)       82.95      (1.0%)   -0.2% (  -2% -    1%) 0.469
                         MedTerm     1935.87      (4.3%)     1934.77      (4.8%)   -0.1% (  -8% -    9%) 0.969
                    OrHighNotLow     1109.87      (4.1%)     1109.25      (4.7%)   -0.1% (  -8% -    9%) 0.968
                HighSloppyPhrase       28.62      (1.8%)       28.61      (2.6%)   -0.0% (  -4% -    4%) 0.981
                         Respell       57.41      (1.1%)       57.40      (1.6%)   -0.0% (  -2% -    2%) 0.968
                     LowSpanNear      193.79      (3.4%)      193.81      (3.8%)    0.0% (  -6% -    7%) 0.992
                 LowSloppyPhrase       30.88      (1.5%)       30.90      (1.8%)    0.1% (  -3% -    3%) 0.885
                      OrHighHigh       38.71      (4.3%)       38.74      (4.2%)    0.1% (  -8% -    8%) 0.954
                       OrHighLow      606.33      (2.8%)      606.93      (2.5%)    0.1% (  -5% -    5%) 0.907
                    OrHighNotMed      985.50      (4.3%)      986.95      (4.6%)    0.1% (  -8% -    9%) 0.917
                          Fuzzy2       38.38      (1.5%)       38.45      (1.8%)    0.2% (  -3% -    3%) 0.698
          OrHighMedDayTaxoFacets        5.37      (4.8%)        5.39      (4.9%)    0.4% (  -8% -   10%) 0.801
            MedTermDayTaxoFacets       27.22      (3.9%)       27.41      (4.6%)    0.7% (  -7% -    9%) 0.606
             LowIntervalsOrdered       38.55      (3.8%)       38.89      (3.2%)    0.9% (  -5% -    8%) 0.415
     BrowseRandomLabelSSDVFacets        9.42      (7.4%)        9.53      (9.7%)    1.1% ( -14% -   19%) 0.677
             MedIntervalsOrdered       33.92      (4.9%)       34.32      (4.2%)    1.2% (  -7% -   10%) 0.412
                      AndHighMed      132.91      (5.1%)      135.21      (7.0%)    1.7% (  -9% -   14%) 0.375
            HighIntervalsOrdered       25.36      (8.5%)       25.81      (7.0%)    1.7% ( -12% -   18%) 0.478
            HighTermTitleBDVSort       82.25     (17.7%)       83.75     (12.8%)    1.8% ( -24% -   39%) 0.710
               HighTermMonthSort      133.31     (13.6%)      135.96     (13.2%)    2.0% ( -21% -   33%) 0.638
                     AndHighHigh       65.89      (5.0%)       67.30      (7.3%)    2.1% (  -9% -   15%) 0.282
                      TermDTSort       69.38     (14.1%)       73.72     (12.3%)    6.3% ( -17% -   38%) 0.135
           HighTermDayOfYearSort       43.49     (15.0%)       46.25     (18.9%)    6.3% ( -23% -   47%) 0.239

@gsmiller gsmiller requested a review from msokolov March 14, 2022 16:03
Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few small-ish comments. I love this approach! It generalizes our simplistic aggregation capabilities.

I agree we should not try to tackle the further generalizations needed for min now, but maybe we can add a // TODO capturing your comment (and the implicit "always 0" initial value) for the future?

Thanks @gsmiller.

childCount++;
if (count > bottomValue) {
if (value > bottomValue) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, OK, so then disregard my comment above -- we should not try to tackle min here!


@Override
public boolean advanceExact(int doc) throws IOException {
index++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, instead of index++ shouldn't we do index = doc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow. Great catch! I lifted this from the current implementation but this is an uncaught bug. I just reproduced it with a test. I'm going to open a separate issue and fix this bug first, then merge the fix into this PR. Thanks Mike!

@gsmiller gsmiller requested a review from mikemccand April 5, 2022 16:24
@gsmiller
Copy link
Contributor Author

gsmiller commented Apr 6, 2022

@mikemccand or @msokolov, did either of you have additional feedback? It didn't really look like it beyond the pre-existing bug (which I've since addressed), but I wanted to check before merging to make sure. Thanks again for the reviews!

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gsmiller looks great!

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry I checked out. It was looking good to me at the start, and I think you addressed my concerns, plus did clean up, rebasing, etc. I thikn it's ready

@gsmiller gsmiller merged commit f870edf into apache:main Apr 6, 2022
@gsmiller gsmiller deleted the LUCENE-10444-association-facets-pr branch April 6, 2022 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants