-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-9378: Make it possible to configure how to trade speed for compression on doc values. #2069
Conversation
…pression on doc values. This adds a switch to `Lucene80DocValuesFormat` which allows to configure whether to prioritize retrieval speed over compression ratio or the other way around. When prioritizing retrieval speed, binary doc values are written using the exact same format as before more aggressive compression got introduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jpountz! I left a few small questions, but this looks great.
At write time, Lucene user can choose whether they want smaller index (compressed) or faster search (BEST_SPEED
), and the resulting index has Lucene's normal full back compat guarantee.
Let's make sure Lucene's test-framework
is fully randomize these write-time Codec
choices so we get good test coverage of all options.
private final Lucene87StoredFieldsFormat.Mode storedMode; | ||
private final Lucene80DocValuesFormat.Mode dvMode; | ||
|
||
private Mode(Lucene87StoredFieldsFormat.Mode storedMode, Lucene80DocValuesFormat.Mode dvMode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! So we roll up the tradeoffs to Codec level which will then tell each format how to tradeoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. It's still possible to made different choices for stored fields and doc values given that we allow configuration of doc values on a per-field basis, but this should at least keep simple use simple with one switch that configures stored fields and doc values at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Simple for common use cases ("I want best compression" or "I want fastest search"), and complex for complex use cases (I want separate control for each part of the index).
@@ -286,7 +278,7 @@ private void doTestTermsEnumRandom(int numDocs, Supplier<String> valuesProducer) | |||
conf.setMergeScheduler(new SerialMergeScheduler()); | |||
// set to duel against a codec which has ordinals: | |||
final PostingsFormat pf = TestUtil.getPostingsFormatWithOrds(random()); | |||
final DocValuesFormat dv = new Lucene80DocValuesFormat(); | |||
final DocValuesFormat dv = getCodec().docValuesFormat(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this randomize between the different Mode
tradeoffs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not randomizing, we are testing both modes explicitly via TestBestSpeedLucene80DocValuesFormat on one hand and TestBestCompressionLucene80DocValuesFormat on the other hand.
/** | ||
* Tests Lucene80DocValuesFormat | ||
*/ | ||
public class TestBestSpeedLucene80DocValuesFormat extends BaseLucene80DocValuesFormatTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also have a dedicated TestBestCompressedLucene80DocValuesFormat
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nevermind I see you opened followon issue for this: https://issues.apache.org/jira/browse/LUCENE-9602
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should see a TestBestCompressedLucene80DocValuesFormat
file as well in this PR. I opened LUCENE-9602 specifically for backward compatibility and make sure we check in indices created by BEST_COMPRESSION in our source tree after every release to make sure we have good bw compatibility coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jpountz!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jpountz!
…pression on doc values. (#2069) This adds a switch to `Lucene80DocValuesFormat` which allows to configure whether to prioritize retrieval speed over compression ratio or the other way around. When prioritizing retrieval speed, binary doc values are written using the exact same format as before more aggressive compression got introduced.
…pression on doc values. (apache#2069) This adds a switch to `Lucene80DocValuesFormat` which allows to configure whether to prioritize retrieval speed over compression ratio or the other way around. When prioritizing retrieval speed, binary doc values are written using the exact same format as before more aggressive compression got introduced.
…pression on doc values. (apache#2069) This adds a switch to `Lucene80DocValuesFormat` which allows to configure whether to prioritize retrieval speed over compression ratio or the other way around. When prioritizing retrieval speed, binary doc values are written using the exact same format as before more aggressive compression got introduced.
…pression on doc values. (apache#2069) This adds a switch to `Lucene80DocValuesFormat` which allows to configure whether to prioritize retrieval speed over compression ratio or the other way around. When prioritizing retrieval speed, binary doc values are written using the exact same format as before more aggressive compression got introduced. cr https://code.amazon.com/reviews/CR-40919834
This adds a switch to
Lucene80DocValuesFormat
which allows toconfigure whether to prioritize retrieval speed over compression ratio
or the other way around. When prioritizing retrieval speed, binary doc
values are written using the exact same format as before more aggressive
compression got introduced.