Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9378: Make it possible to configure how to trade speed for compression on doc values. #2069

Merged
merged 7 commits into from
Nov 12, 2020

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Nov 9, 2020

This adds a switch to Lucene80DocValuesFormat which allows to
configure whether to prioritize retrieval speed over compression ratio
or the other way around. When prioritizing retrieval speed, binary doc
values are written using the exact same format as before more aggressive
compression got introduced.

…pression on doc values.

This adds a switch to `Lucene80DocValuesFormat` which allows to
configure whether to prioritize retrieval speed over compression ratio
or the other way around. When prioritizing retrieval speed, binary doc
values are written using the exact same format as before more aggressive
compression got introduced.
Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jpountz! I left a few small questions, but this looks great.

At write time, Lucene user can choose whether they want smaller index (compressed) or faster search (BEST_SPEED), and the resulting index has Lucene's normal full back compat guarantee.

Let's make sure Lucene's test-framework is fully randomize these write-time Codec choices so we get good test coverage of all options.

private final Lucene87StoredFieldsFormat.Mode storedMode;
private final Lucene80DocValuesFormat.Mode dvMode;

private Mode(Lucene87StoredFieldsFormat.Mode storedMode, Lucene80DocValuesFormat.Mode dvMode) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! So we roll up the tradeoffs to Codec level which will then tell each format how to tradeoff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. It's still possible to made different choices for stored fields and doc values given that we allow configuration of doc values on a per-field basis, but this should at least keep simple use simple with one switch that configures stored fields and doc values at the same time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Simple for common use cases ("I want best compression" or "I want fastest search"), and complex for complex use cases (I want separate control for each part of the index).

@@ -286,7 +278,7 @@ private void doTestTermsEnumRandom(int numDocs, Supplier<String> valuesProducer)
conf.setMergeScheduler(new SerialMergeScheduler());
// set to duel against a codec which has ordinals:
final PostingsFormat pf = TestUtil.getPostingsFormatWithOrds(random());
final DocValuesFormat dv = new Lucene80DocValuesFormat();
final DocValuesFormat dv = getCodec().docValuesFormat();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this randomize between the different Mode tradeoffs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not randomizing, we are testing both modes explicitly via TestBestSpeedLucene80DocValuesFormat on one hand and TestBestCompressionLucene80DocValuesFormat on the other hand.

/**
* Tests Lucene80DocValuesFormat
*/
public class TestBestSpeedLucene80DocValuesFormat extends BaseLucene80DocValuesFormatTestCase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also have a dedicated TestBestCompressedLucene80DocValuesFormat?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nevermind I see you opened followon issue for this: https://issues.apache.org/jira/browse/LUCENE-9602

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should see a TestBestCompressedLucene80DocValuesFormat file as well in this PR. I opened LUCENE-9602 specifically for backward compatibility and make sure we check in indices created by BEST_COMPRESSION in our source tree after every release to make sure we have good bw compatibility coverage.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jpountz!

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jpountz!

@jpountz jpountz merged commit 06877b2 into apache:master Nov 12, 2020
@jpountz jpountz deleted the lucene9378 branch November 12, 2020 15:10
jpountz added a commit that referenced this pull request Nov 12, 2020
…pression on doc values. (#2069)

This adds a switch to `Lucene80DocValuesFormat` which allows to
configure whether to prioritize retrieval speed over compression ratio
or the other way around. When prioritizing retrieval speed, binary doc
values are written using the exact same format as before more aggressive
compression got introduced.
msfroh pushed a commit to msfroh/lucene-solr that referenced this pull request Nov 18, 2020
…pression on doc values. (apache#2069)

This adds a switch to `Lucene80DocValuesFormat` which allows to
configure whether to prioritize retrieval speed over compression ratio
or the other way around. When prioritizing retrieval speed, binary doc
values are written using the exact same format as before more aggressive
compression got introduced.
epugh pushed a commit to epugh/lucene-solr-1 that referenced this pull request Jan 15, 2021
…pression on doc values. (apache#2069)

This adds a switch to `Lucene80DocValuesFormat` which allows to
configure whether to prioritize retrieval speed over compression ratio
or the other way around. When prioritizing retrieval speed, binary doc
values are written using the exact same format as before more aggressive
compression got introduced.
gsmiller pushed a commit to gsmiller/lucene-solr that referenced this pull request Mar 17, 2021
…pression on doc values. (apache#2069)

This adds a switch to `Lucene80DocValuesFormat` which allows to
configure whether to prioritize retrieval speed over compression ratio
or the other way around. When prioritizing retrieval speed, binary doc
values are written using the exact same format as before more aggressive
compression got introduced.

cr https://code.amazon.com/reviews/CR-40919834
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants