New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals #443
LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals #443
Conversation
NOTE: I'm working on additional testing but hoping to get some early feedback on this approach, particularly as backwards-compatibility is concerned. I'll update with more tests and a CHANGES entry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could make the bw compat logic simpler by creating a SortedNumericDocValues instance that is backed by BinaryDocValues for 8.x indices?
I like this idea. Thanks! I'll give it a shot and see if it makes things a bit cleaner. |
OK should be ready for another look now. Thanks for the feedback everyone! |
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/BackCompatSortedNumericDocValues.java
Outdated
Show resolved
Hide resolved
// so sub-class decoding implementations are honored: | ||
SortedNumericDocValues wrapped = | ||
BackCompatSortedNumericDocValues.wrap( | ||
context.reader().getBinaryDocValues(field), this::decode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused: this is the only call site that takes a decoder, and yet it uses the same decoding logic as the default logic. Do we need to expose a decoding function at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit wonky for sure, but I think this is necessary to truly be backwards compatible. The issue is that users could be extending DocValuesOrdinalsReader
and implementing their own custom decode
logic. So this "hook" makes sure we delegate to the decode
method in case the user is doing that.
@Deprecated | ||
public static boolean usesOlderBinaryOrdinals(LeafReader reader) { | ||
return reader.getMetaData().getCreatedVersionMajor() <= 8; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of just being a flag, maybe this could abstract reading ordinals from the index?
public static SortedNumericDocValues getOrdinals(LeafReader reader, String field) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I went ahead and make this change. Thanks for the suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this one is very close -- I left mostly minor comments. Thanks @gsmiller!
*/ | ||
@Deprecated | ||
public static boolean usesOlderBinaryOrdinals(LeafReader reader) { | ||
return reader.getMetaData().getCreatedVersionMajor() <= 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is checking if the particular segment was created with version 8 or earlier? Hmm, or, is this method .getCreatedVersionMajor()
referring to the whole index, even though it is this one segment that stored it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It refers to the whole index indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I think it works either way for this purpose, but I had assumed it was at the segment level. Thanks @jpountz for clarifying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should (separately) update the javadocs for this method to make it more clear that even though this is a leaf
's metadata class, this method (and maybe also the other two) are really global properties to the index?
lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java
Outdated
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/BackCompatSortedNumericDocValues.java
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/CachedOrdinalsReader.java
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacetLabels.java
Outdated
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacetLabels.java
Show resolved
Hide resolved
...ne/facet/src/test/org/apache/lucene/facet/taxonomy/TestBackCompatSortedNumericDocValues.java
Outdated
Show resolved
Hide resolved
...ne/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestBackwardsCompatibility.java
Outdated
Show resolved
Hide resolved
...ne/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestBackwardsCompatibility.java
Outdated
Show resolved
Hide resolved
…ic ordering of facet labels
…xo-facet-opto-on9
Makes sense. Will tweak the language. |
For posterity, I re-ran benchmarks with this most recent change, comparing against
|
OK, I believe this change is good-to-go now. I think I've addressed all the feedback so far and just did another benchmark run to make sure all the performance benefits are still showing up with the back-compat considerations in place, etc. Please let me know if anyone has additional feedback. Thanks again everyone! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gsmiller -- this looks awesome! The performance gains are great :) Shows all the benefits we've accumulated by encoding our default Codec impl for writing/reading multi-valued numeric doc values (SortedNumericDocValues
) over time...
I believe I've addressed all the PR feedback and got an approval from @mikemccand, so I went ahead and merged. Happy to iterate on this further if anyone has additional comments. Thanks again everyone! Somewhat complicated to make this one back-compat, but I think we ended up with a much cleaner solution thanks to all the feedback. |
FYI, I'm merging this onto |
I'm getting a test failure that looks caused by this change:
|
Uh oh. Ok, looking into it. Thanks for letting me know @jpountz |
Ok, silly bug in the test case itself. Fix is here #459. Feel free to have a look if you like, but I think it's simple enough that I'll just merge it after the approval checks pass. (I'll make sure this makes it into |
OK, pushed the bug fix onto |
Thank you! |
Description
In benchmarks, using numeric doc values to store taxonomy facet ordinals shows almost a 400% qps improvement in browse-related taxonomy-based tasks (instead of custom delta-encoding into a binary doc values field). This PR changes the encoding of facet ordinals, while maintaining backwards compatibility with 8.x indexes.
Solution
This change moves to standard numeric doc values for storing taxonomy ordinals.
Tests
No new tests added. Lots of existing test coverage for taxonomy faceting functionality.
NOTE: I will add new testing that ensures backwards compatibility support remains with 8.x. I'm putting this PR out for feedback while doing so.
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.