Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10585: Scrub copy/paste code in the facets module and attempt to simplify a bit #915

Merged
merged 2 commits into from May 29, 2022

Conversation

gsmiller
Copy link
Contributor

Description

The facets module has had a fair amount of recent development, some resulting in a fair amount of copy/paste duplication. This attempts to clean it up a bit and simplify/refactor some of the internal functionality.

Solution

Minor internal refactoring of some facet implementations and extraction of a common parent class to contain some duplicated logic in the SSDV faceting implementations.

Tests

No new tests were written. All existing testing is still passing.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.

@gsmiller
Copy link
Contributor Author

Ran benchmarks to make sure this didn't regress performance in any meaningful way. I'm not seeing any regressions:

                            TaskQPS baseline      StdDevQPS candidate      StdDev                Pct diff p-value
                        Wildcard      337.70     (16.7%)      323.50     (16.7%)   -4.2% ( -32% -   35%) 0.426
           BrowseMonthTaxoFacets       29.43     (13.8%)       28.72     (19.1%)   -2.4% ( -31% -   35%) 0.649
            BrowseDateSSDVFacets        2.68     (17.5%)        2.63     (17.0%)   -1.8% ( -30% -   39%) 0.743
       BrowseDayOfYearSSDVFacets       14.94     (16.5%)       14.68     (13.6%)   -1.8% ( -27% -   33%) 0.709
               HighTermMonthSort      147.90     (22.4%)      145.50     (19.4%)   -1.6% ( -35% -   51%) 0.807
           HighTermDayOfYearSort       61.68     (14.9%)       60.80     (15.3%)   -1.4% ( -27% -   33%) 0.766
                   OrHighNotHigh     1107.68      (4.1%)     1092.99      (2.8%)   -1.3% (  -7% -    5%) 0.232
                         LowTerm     2181.38      (3.5%)     2153.14      (3.1%)   -1.3% (  -7% -    5%) 0.213
                    OrHighNotMed     1420.79      (2.7%)     1404.11      (3.1%)   -1.2% (  -6% -    4%) 0.203
                      TermDTSort      117.08     (15.5%)      115.77     (15.3%)   -1.1% ( -27% -   35%) 0.818
     BrowseRandomLabelTaxoFacets       18.20     (10.0%)       18.01     (13.2%)   -1.1% ( -22% -   24%) 0.769
                 LowSloppyPhrase      121.57      (4.8%)      120.52      (3.7%)   -0.9% (  -8% -    8%) 0.527
                        HighTerm     1427.75      (4.8%)     1416.29      (3.0%)   -0.8% (  -8% -    7%) 0.529
                 MedSloppyPhrase       92.18      (5.1%)       91.45      (4.0%)   -0.8% (  -9% -    8%) 0.589
            BrowseDateTaxoFacets       20.98     (12.4%)       20.82     (15.5%)   -0.7% ( -25% -   30%) 0.866
                    OrHighNotLow     1233.19      (4.3%)     1223.99      (3.8%)   -0.7% (  -8% -    7%) 0.557
                       OrHighLow      515.20      (2.8%)      511.43      (2.7%)   -0.7% (  -6% -    4%) 0.400
       BrowseDayOfYearTaxoFacets       20.91     (12.5%)       20.77     (15.7%)   -0.6% ( -25% -   31%) 0.886
                     AndHighHigh       40.18      (4.6%)       40.06      (4.2%)   -0.3% (  -8% -    8%) 0.823
                    OrNotHighMed      923.89      (3.3%)      921.97      (2.8%)   -0.2% (  -6% -    6%) 0.830
                HighSloppyPhrase       17.59      (6.1%)       17.55      (4.3%)   -0.2% ( -10% -   10%) 0.906
                      AndHighMed      137.42      (3.7%)      137.17      (3.5%)   -0.2% (  -7% -    7%) 0.870
                        PKLookup      168.11      (2.6%)      168.01      (3.4%)   -0.1% (  -5% -    6%) 0.951
     BrowseRandomLabelSSDVFacets       10.14      (3.6%)       10.14      (3.3%)    0.0% (  -6% -    7%) 0.989
                          IntNRQ      204.47      (0.9%)      204.53      (0.6%)    0.0% (  -1% -    1%) 0.907
                   OrNotHighHigh     1480.42      (3.4%)     1480.99      (2.4%)    0.0% (  -5% -    6%) 0.967
                         Prefix3      221.07     (15.0%)      221.20     (13.6%)    0.1% ( -24% -   33%) 0.989
                          Fuzzy1       79.60      (1.8%)       79.74      (1.4%)    0.2% (  -2% -    3%) 0.727
                          Fuzzy2       75.46      (1.9%)       75.62      (1.6%)    0.2% (  -3% -    3%) 0.697
                         MedTerm     1563.09      (5.2%)     1567.15      (3.5%)    0.3% (  -8% -    9%) 0.853
                     MedSpanNear       27.45      (1.7%)       27.54      (1.9%)    0.3% (  -3% -    3%) 0.579
                       LowPhrase      123.19      (2.2%)      123.66      (1.7%)    0.4% (  -3% -    4%) 0.539
                         Respell       46.77      (1.8%)       46.97      (1.4%)    0.4% (  -2% -    3%) 0.422
                       MedPhrase      476.52      (2.1%)      478.54      (1.9%)    0.4% (  -3% -    4%) 0.497
                      HighPhrase      222.77      (2.3%)      223.82      (1.7%)    0.5% (  -3% -    4%) 0.458
                      AndHighLow     1161.41      (2.3%)     1167.56      (2.4%)    0.5% (  -4% -    5%) 0.482
                       OrHighMed      128.58      (3.7%)      129.28      (4.3%)    0.5% (  -7% -    8%) 0.671
                     LowSpanNear       13.26      (2.3%)       13.34      (2.2%)    0.6% (  -3% -    5%) 0.421
                    HighSpanNear       10.68      (2.2%)       10.74      (2.1%)    0.6% (  -3% -    5%) 0.392
                    OrNotHighLow     1314.95      (3.3%)     1324.37      (2.5%)    0.7% (  -4% -    6%) 0.443
                      OrHighHigh       19.95      (3.9%)       20.11      (4.6%)    0.8% (  -7% -    9%) 0.552
             LowIntervalsOrdered       57.37      (4.6%)       57.85      (2.1%)    0.8% (  -5% -    7%) 0.457
        AndHighHighDayTaxoFacets       19.87      (2.5%)       20.06      (3.0%)    1.0% (  -4% -    6%) 0.274
             MedIntervalsOrdered       56.23      (4.1%)       56.78      (2.3%)    1.0% (  -5% -    7%) 0.354
         AndHighMedDayTaxoFacets       81.95      (2.8%)       82.87      (2.0%)    1.1% (  -3% -    5%) 0.138
            HighIntervalsOrdered       22.29      (5.1%)       22.56      (3.5%)    1.2% (  -7% -   10%) 0.380
            MedTermDayTaxoFacets       39.54      (4.8%)       40.04      (4.9%)    1.3% (  -8% -   11%) 0.410
          OrHighMedDayTaxoFacets        4.09      (4.7%)        4.15      (4.2%)    1.5% (  -7% -   10%) 0.304
            HighTermTitleBDVSort       42.14     (15.8%)       43.15     (20.0%)    2.4% ( -28% -   45%) 0.675
           BrowseMonthSSDVFacets       15.74     (11.0%)       16.17     (13.1%)    2.7% ( -19% -   30%) 0.477

return Arrays.asList(results);
}

abstract int getCount(int ord);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe add a javadoc here explaining if counts are aggregated at indexing time, we can just retrieve it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Thanks!

@Yuti-G
Copy link
Contributor

Yuti-G commented May 23, 2022

Hi @gsmiller, thanks for making a lot of improvements to the code, and it looks great to me! I also ran the benchmarks for facet and do not observe much difference from the main branch. I added getTopDims to benchmarks but the PR hasn't merged yet, so the attached results are from my local. Thanks!

Main:
Screen Shot 2022-05-23 at 11 12 26 AM

pr/915:
Screen Shot 2022-05-23 at 12 17 42 PM

@gsmiller
Copy link
Contributor Author

Thanks @Yuti-G for the feedback and benchmark results! I appreciate you taking a look since I know you're quite familiar with this code. I saw a couple opportunities to de-dupe some code, but you did all the hard work introducing the feature. Thanks again!

@mikemccand
Copy link
Member

Thank you @Yuti-G for running the dedicated luceneutil faceting benchmark!

But: the getAllDims time for SSDV seems to have gotten much faster with this PR, which is great! Was that expected? Or is this some horrible noise? Is it repeatable?

@Yuti-G
Copy link
Contributor

Yuti-G commented May 26, 2022

But: the getAllDims time for SSDV seems to have gotten much faster with this PR, which is great! Was that expected? Or is this some horrible noise? Is it repeatable?

I think it's just noise. I just re-run the dedicated luceneutil faceting benchmark against the main branch:

1st run:
Screen Shot 2022-05-26 at 9 55 35 AM

2nd run:
Screen Shot 2022-05-26 at 9 56 07 AM

3rd run:
Screen Shot 2022-05-26 at 10 49 52 AM

…en in getTopDims if we know dim counts will be inaccurate ahead of time
@gsmiller
Copy link
Contributor Author

@Yuti-G I just updated the PR with some additional comments/javadoc and a very minor optimization in the SSDV#getTopDims case. Could you have a look at the latest changes when you get a chance?

@gsmiller
Copy link
Contributor Author

Since this change is purely meant to remove some code duplication and make some very minor optimizations, and doesn't modify the API or expose any additional API surface area, I plan to merge in the next couple of days unless anyone objects. If anyone wants more time to review or has feedback, I'm more than happy to wait. Thanks!

@Yuti-G
Copy link
Contributor

Yuti-G commented May 26, 2022

Looks good to me! I will rebase my current work at #914 - getAllChildren after this PR is merged. Thank you so much for adding more documentations and making the code so clean!

@gsmiller gsmiller merged commit 8db1e41 into apache:main May 29, 2022
@gsmiller gsmiller deleted the LUCENE-10585-facet-cleanup-pr branch May 29, 2022 08:26
@gsmiller
Copy link
Contributor Author

@Yuti-G this is merged onto main and backported to branch_9x. Please feel free to rebase your PR on top of it, and let me know if you have any questions. Also, thanks for mentioning your PR. I'd missed that but would like to have a look. I'll wait for you to rebase then I'll review. Thanks again (and hopefully rebasing isn't too big of a pain!)!

shaie pushed a commit to mdmarshmallow/lucene that referenced this pull request Jun 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants