Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals #443

Merged
merged 15 commits into from Nov 19, 2021

Conversation

gsmiller
Copy link
Contributor

@gsmiller gsmiller commented Nov 15, 2021

Description

In benchmarks, using numeric doc values to store taxonomy facet ordinals shows almost a 400% qps improvement in browse-related taxonomy-based tasks (instead of custom delta-encoding into a binary doc values field). This PR changes the encoding of facet ordinals, while maintaining backwards compatibility with 8.x indexes.

Solution

This change moves to standard numeric doc values for storing taxonomy ordinals.

Tests

No new tests added. Lots of existing test coverage for taxonomy faceting functionality.

NOTE: I will add new testing that ensures backwards compatibility support remains with 8.x. I'm putting this PR out for feedback while doing so.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.

@gsmiller
Copy link
Contributor Author

NOTE: I'm working on additional testing but hoping to get some early feedback on this approach, particularly as backwards-compatibility is concerned. I'll update with more tests and a CHANGES entry.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could make the bw compat logic simpler by creating a SortedNumericDocValues instance that is backed by BinaryDocValues for 8.x indices?

@gsmiller
Copy link
Contributor Author

I wonder if we could make the bw compat logic simpler by creating a SortedNumericDocValues instance that is backed by BinaryDocValues for 8.x indices?

I like this idea. Thanks! I'll give it a shot and see if it makes things a bit cleaner.

@gsmiller
Copy link
Contributor Author

OK should be ready for another look now. Thanks for the feedback everyone!

// so sub-class decoding implementations are honored:
SortedNumericDocValues wrapped =
BackCompatSortedNumericDocValues.wrap(
context.reader().getBinaryDocValues(field), this::decode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused: this is the only call site that takes a decoder, and yet it uses the same decoding logic as the default logic. Do we need to expose a decoding function at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit wonky for sure, but I think this is necessary to truly be backwards compatible. The issue is that users could be extending DocValuesOrdinalsReader and implementing their own custom decode logic. So this "hook" makes sure we delegate to the decode method in case the user is doing that.

@Deprecated
public static boolean usesOlderBinaryOrdinals(LeafReader reader) {
return reader.getMetaData().getCreatedVersionMajor() <= 8;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of just being a flag, maybe this could abstract reading ordinals from the index?

  public static SortedNumericDocValues getOrdinals(LeafReader reader, String field) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I went ahead and make this change. Thanks for the suggestion.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one is very close -- I left mostly minor comments. Thanks @gsmiller!

*/
@Deprecated
public static boolean usesOlderBinaryOrdinals(LeafReader reader) {
return reader.getMetaData().getCreatedVersionMajor() <= 8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is checking if the particular segment was created with version 8 or earlier? Hmm, or, is this method .getCreatedVersionMajor() referring to the whole index, even though it is this one segment that stored it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It refers to the whole index indeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I think it works either way for this purpose, but I had assumed it was at the segment level. Thanks @jpountz for clarifying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should (separately) update the javadocs for this method to make it more clear that even though this is a leaf's metadata class, this method (and maybe also the other two) are really global properties to the index?

@gsmiller
Copy link
Contributor Author

Maybe clarify that full index rebuild means start a new 9.x index and index all your documents. I.e. merely re-indexing all docs into your previous (8.x created) index is not sufficient.

Makes sense. Will tweak the language.

@gsmiller
Copy link
Contributor Author

For posterity, I re-ran benchmarks with this most recent change, comparing against branch_9_0 as a baseline (so it includes all the back-compat checks and such). Results still look very good:

                            TaskQPS baseline      StdDevQPS candidate      StdDev                Pct diff p-value
           HighTermDayOfYearSort      127.51     (15.4%)      124.04     (12.1%)   -2.7% ( -26% -   29%) 0.534
                      TermDTSort       96.36     (15.3%)       94.54     (15.1%)   -1.9% ( -28% -   33%) 0.695
               HighTermMonthSort      135.11     (13.3%)      133.31     (16.3%)   -1.3% ( -27% -   32%) 0.778
                          Fuzzy1       63.28      (8.3%)       62.55      (8.6%)   -1.2% ( -16% -   17%) 0.662
                        PKLookup      152.55      (2.1%)      151.14      (1.8%)   -0.9% (  -4% -    3%) 0.139
             MedIntervalsOrdered       82.93      (3.3%)       82.62      (2.4%)   -0.4% (  -5% -    5%) 0.681
       BrowseDayOfYearSSDVFacets       14.45     (14.5%)       14.40     (14.1%)   -0.3% ( -25% -   33%) 0.945
            HighTermTitleBDVSort      106.51     (21.3%)      106.18     (21.6%)   -0.3% ( -35% -   54%) 0.964
                         MedTerm     1238.25      (2.3%)     1235.30      (2.5%)   -0.2% (  -4% -    4%) 0.753
             LowIntervalsOrdered      112.01      (2.2%)      111.75      (1.9%)   -0.2% (  -4% -    4%) 0.724
                         Prefix3      246.62      (2.2%)      246.10      (2.2%)   -0.2% (  -4% -    4%) 0.764
                 MedSloppyPhrase       72.18      (2.9%)       72.06      (3.0%)   -0.2% (  -5% -    5%) 0.861
                   OrHighNotHigh      530.67      (2.9%)      529.98      (2.1%)   -0.1% (  -5% -    5%) 0.873
                    OrHighNotLow      581.58      (3.3%)      581.28      (2.3%)   -0.1% (  -5% -    5%) 0.954
           BrowseMonthSSDVFacets       14.32      (4.5%)       14.32      (4.5%)   -0.0% (  -8% -    9%) 0.977
                       MedPhrase      200.56      (1.9%)      200.54      (1.9%)   -0.0% (  -3% -    3%) 0.992
                    OrHighNotMed      600.02      (2.7%)      599.99      (3.1%)   -0.0% (  -5% -    5%) 0.995
                          IntNRQ       56.03      (3.5%)       56.03      (3.5%)    0.0% (  -6% -    7%) 0.995
            HighIntervalsOrdered        1.86      (2.2%)        1.86      (2.2%)    0.1% (  -4% -    4%) 0.908
                     LowSpanNear       44.66      (1.7%)       44.70      (1.6%)    0.1% (  -3% -    3%) 0.874
                        HighTerm      974.66      (3.1%)      975.67      (2.8%)    0.1% (  -5% -    6%) 0.912
                         Respell       59.10      (1.8%)       59.17      (1.9%)    0.1% (  -3% -    3%) 0.849
                      HighPhrase       30.48      (1.9%)       30.52      (1.8%)    0.1% (  -3% -    3%) 0.819
                       OrHighLow      310.16      (2.6%)      310.61      (2.3%)    0.1% (  -4% -    5%) 0.849
                         LowTerm     1419.43      (2.7%)     1421.64      (2.5%)    0.2% (  -4% -    5%) 0.851
                     MedSpanNear       14.05      (2.3%)       14.08      (2.2%)    0.2% (  -4% -    4%) 0.798
                       LowPhrase      218.41      (4.5%)      218.81      (4.7%)    0.2% (  -8% -    9%) 0.899
                   OrNotHighHigh      531.27      (2.5%)      532.41      (2.1%)    0.2% (  -4% -    4%) 0.768
                 LowSloppyPhrase       23.57      (1.6%)       23.63      (1.8%)    0.2% (  -3% -    3%) 0.649
                HighSloppyPhrase       34.72      (2.0%)       34.83      (2.0%)    0.3% (  -3% -    4%) 0.595
                    OrNotHighLow      786.86      (2.0%)      789.56      (1.9%)    0.3% (  -3% -    4%) 0.580
                    HighSpanNear       32.06      (2.9%)       32.17      (2.6%)    0.3% (  -4% -    5%) 0.690
                      AndHighMed      210.90      (3.2%)      212.04      (3.1%)    0.5% (  -5% -    7%) 0.589
                        Wildcard      106.13      (3.0%)      106.83      (3.3%)    0.7% (  -5% -    7%) 0.505
                       OrHighMed       60.24      (2.3%)       60.69      (2.4%)    0.7% (  -3% -    5%) 0.318
                     AndHighHigh       54.27      (3.3%)       54.70      (3.3%)    0.8% (  -5% -    7%) 0.448
                      AndHighLow      514.75      (2.5%)      518.95      (2.3%)    0.8% (  -3% -    5%) 0.284
                      OrHighHigh       25.30      (2.5%)       25.51      (2.3%)    0.8% (  -3% -    5%) 0.274
                    OrNotHighMed      636.45      (2.8%)      643.57      (2.8%)    1.1% (  -4% -    6%) 0.204
                          Fuzzy2       49.98      (7.1%)       50.55      (7.0%)    1.2% ( -12% -   16%) 0.605
        AndHighHighDayTaxoFacets       23.04      (2.6%)       26.15      (2.8%)   13.5% (   7% -   19%) 0.000
         AndHighMedDayTaxoFacets       44.10      (2.3%)       53.02      (3.1%)   20.2% (  14% -   26%) 0.000
            MedTermDayTaxoFacets       22.03      (3.9%)       31.37      (3.7%)   42.4% (  33% -   51%) 0.000
          OrHighMedDayTaxoFacets        6.94      (3.0%)       10.00      (4.2%)   44.0% (  35% -   52%) 0.000
           BrowseMonthTaxoFacets        2.58      (7.7%)       12.74     (71.1%)  393.7% ( 292% -  511%) 0.000
            BrowseDateTaxoFacets        2.43      (7.2%)       12.46     (78.3%)  412.0% ( 304% -  536%) 0.000
       BrowseDayOfYearTaxoFacets        2.43      (7.2%)       12.48     (78.6%)  412.5% ( 304% -  536%) 0.000

@gsmiller
Copy link
Contributor Author

OK, I believe this change is good-to-go now. I think I've addressed all the feedback so far and just did another benchmark run to make sure all the performance benefits are still showing up with the back-compat considerations in place, etc. Please let me know if anyone has additional feedback. Thanks again everyone!

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gsmiller -- this looks awesome! The performance gains are great :) Shows all the benefits we've accumulated by encoding our default Codec impl for writing/reading multi-valued numeric doc values (SortedNumericDocValues) over time...

lucene/MIGRATE.md Show resolved Hide resolved
@gsmiller gsmiller merged commit 5fe8f0e into apache:branch_9_0 Nov 19, 2021
@gsmiller
Copy link
Contributor Author

I believe I've addressed all the PR feedback and got an approval from @mikemccand, so I went ahead and merged. Happy to iterate on this further if anyone has additional comments. Thanks again everyone! Somewhat complicated to make this one back-compat, but I think we ended up with a much cleaner solution thanks to all the feedback.

gsmiller added a commit to gsmiller/lucene that referenced this pull request Nov 19, 2021
@gsmiller gsmiller deleted the LUCENE-10062-taxo-facet-opto-on9 branch November 19, 2021 14:19
@gsmiller
Copy link
Contributor Author

FYI, I'm merging this onto branch_9x as well over in #458 (no need for a review unless someone wants to have a look). I've also updated #264 with the non-back-compat version of this change against main. If someone could have a look at that when you get a chance, I'd appreciate it. It's hopefully a bit easier to reason about since it doesn't require any of the back-compat logic.

@jpountz
Copy link
Contributor

jpountz commented Nov 19, 2021

I'm getting a test failure that looks caused by this change:

gradlew test --tests TestBackwardsCompatibility.testCreateNewTaxonomy -Dtests.seed=567E100D397BFC2E -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=brx-IN -Dtests.timezone=America/Chicago -Dtests.asserts=true -Dtests.file.encoding=UTF-8

org.apache.lucene.facet.taxonomy.directory.TestBackwardsCompatibility > testCreateNewTaxonomy FAILED
    java.lang.IllegalArgumentException: docs out of order: previous docId=1 current docId=0
        at __randomizedtesting.SeedInfo.seed([567E100D397BFC2E:DE2B08A7CA8D0189]:0)
        at org.apache.lucene.facet.taxonomy.TaxonomyFacetLabels$FacetLabelReader.nextFacetLabel(TaxonomyFacetLabels.java:146)
        at org.apache.lucene.facet.taxonomy.directory.TestBackwardsCompatibility.createNewTaxonomyIndex(TestBackwardsCompatibility.java:228)
        at org.apache.lucene.facet.taxonomy.directory.TestBackwardsCompatibility.testCreateNewTaxonomy(TestBackwardsCompatibility.java:78)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:78)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:567)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
        at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
        at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
        at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
        at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
        at org.junit.rules.RunRules.evaluate(RunRules.java:20)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898)
        at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
        at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
        at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
        at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
        at org.junit.rules.RunRules.evaluate(RunRules.java:20)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:826)
        at java.base/java.lang.Thread.run(Thread.java:831)

@gsmiller
Copy link
Contributor Author

I'm getting a test failure that looks caused by this change

Uh oh. Ok, looking into it. Thanks for letting me know @jpountz

@gsmiller
Copy link
Contributor Author

Ok, silly bug in the test case itself. Fix is here #459. Feel free to have a look if you like, but I think it's simple enough that I'll just merge it after the approval checks pass. (I'll make sure this makes it into branch_9x as well). Thanks for finding this (and apologies).

@gsmiller
Copy link
Contributor Author

OK, pushed the bug fix onto branch_9_0. @jpountz hopefully no more issues related to this change.

@jpountz
Copy link
Contributor

jpountz commented Nov 19, 2021

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants