LUCENE-9378: Make it possible to configure how to trade speed for compression on doc values. #2069

jpountz · 2020-11-09T17:57:35Z

This adds a switch to Lucene80DocValuesFormat which allows to
configure whether to prioritize retrieval speed over compression ratio
or the other way around. When prioritizing retrieval speed, binary doc
values are written using the exact same format as before more aggressive
compression got introduced.

…pression on doc values. This adds a switch to `Lucene80DocValuesFormat` which allows to configure whether to prioritize retrieval speed over compression ratio or the other way around. When prioritizing retrieval speed, binary doc values are written using the exact same format as before more aggressive compression got introduced.

mikemccand

Thank you @jpountz! I left a few small questions, but this looks great.

At write time, Lucene user can choose whether they want smaller index (compressed) or faster search (BEST_SPEED), and the resulting index has Lucene's normal full back compat guarantee.

Let's make sure Lucene's test-framework is fully randomize these write-time Codec choices so we get good test coverage of all options.

mikemccand · 2020-11-09T18:22:26Z

lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene87/Lucene87Codec.java

+    private final Lucene87StoredFieldsFormat.Mode storedMode;
+    private final Lucene80DocValuesFormat.Mode dvMode;
+
+    private Mode(Lucene87StoredFieldsFormat.Mode storedMode, Lucene80DocValuesFormat.Mode dvMode) {


Nice! So we roll up the tradeoffs to Codec level which will then tell each format how to tradeoff.

Right. It's still possible to made different choices for stored fields and doc values given that we allow configuration of doc values on a per-field basis, but this should at least keep simple use simple with one switch that configures stored fields and doc values at the same time.

Great! Simple for common use cases ("I want best compression" or "I want fastest search"), and complex for complex use cases (I want separate control for each part of the index).

mikemccand · 2020-11-09T18:24:51Z

lucene/core/src/test/org/apache/lucene/codecs/lucene80/BaseLucene80DocValuesFormatTestCase.java

@@ -286,7 +278,7 @@ private void doTestTermsEnumRandom(int numDocs, Supplier<String> valuesProducer)
    conf.setMergeScheduler(new SerialMergeScheduler());
    // set to duel against a codec which has ordinals:
    final PostingsFormat pf = TestUtil.getPostingsFormatWithOrds(random());
-    final DocValuesFormat dv = new Lucene80DocValuesFormat();
+    final DocValuesFormat dv = getCodec().docValuesFormat();


Will this randomize between the different Mode tradeoffs?

It's not randomizing, we are testing both modes explicitly via TestBestSpeedLucene80DocValuesFormat on one hand and TestBestCompressionLucene80DocValuesFormat on the other hand.

mikemccand · 2020-11-09T18:25:27Z

...ne/core/src/test/org/apache/lucene/codecs/lucene80/TestBestSpeedLucene80DocValuesFormat.java

+/**
+ * Tests Lucene80DocValuesFormat
+ */
+public class TestBestSpeedLucene80DocValuesFormat extends BaseLucene80DocValuesFormatTestCase {


Do we also have a dedicated TestBestCompressedLucene80DocValuesFormat?

Oh nevermind I see you opened followon issue for this: https://issues.apache.org/jira/browse/LUCENE-9602

You should see a TestBestCompressedLucene80DocValuesFormat file as well in this PR. I opened LUCENE-9602 specifically for backward compatibility and make sure we check in indices created by BEST_COMPRESSION in our source tree after every release to make sure we have good bw compatibility coverage.

mikemccand

Thanks @jpountz!

mikemccand

Thanks @jpountz!

…pression on doc values. (#2069) This adds a switch to `Lucene80DocValuesFormat` which allows to configure whether to prioritize retrieval speed over compression ratio or the other way around. When prioritizing retrieval speed, binary doc values are written using the exact same format as before more aggressive compression got introduced.

…pression on doc values. (apache#2069) This adds a switch to `Lucene80DocValuesFormat` which allows to configure whether to prioritize retrieval speed over compression ratio or the other way around. When prioritizing retrieval speed, binary doc values are written using the exact same format as before more aggressive compression got introduced.

…pression on doc values. (apache#2069) This adds a switch to `Lucene80DocValuesFormat` which allows to configure whether to prioritize retrieval speed over compression ratio or the other way around. When prioritizing retrieval speed, binary doc values are written using the exact same format as before more aggressive compression got introduced. cr https://code.amazon.com/reviews/CR-40919834

jpountz requested review from dsmiley, mikemccand and msokolov November 9, 2020 17:57

jpountz added 2 commits November 9, 2020 18:58

CHANGES

c54f2ce

iter

0a30f4b

mikemccand reviewed Nov 9, 2020

View reviewed changes

mikemccand approved these changes Nov 9, 2020

View reviewed changes

jpountz added 3 commits November 10, 2020 10:50

Merge branch 'master' into lucene9378

94b4540

Improve codec randomization.

8cdd7a1

iter

65bb945

mikemccand approved these changes Nov 10, 2020

View reviewed changes

iter

6aca71f

mikemccand approved these changes Nov 12, 2020

View reviewed changes

jpountz merged commit 06877b2 into apache:master Nov 12, 2020

jpountz deleted the lucene9378 branch November 12, 2020 15:10

asfimport mentioned this pull request Aug 24, 2022

Remove compression option on doc values [LUCENE-9843] apache/lucene#10882

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-9378: Make it possible to configure how to trade speed for compression on doc values. #2069

LUCENE-9378: Make it possible to configure how to trade speed for compression on doc values. #2069

jpountz commented Nov 9, 2020

mikemccand left a comment

mikemccand Nov 9, 2020

jpountz Nov 9, 2020

mikemccand Nov 9, 2020

mikemccand Nov 9, 2020

jpountz Nov 9, 2020

mikemccand Nov 9, 2020

mikemccand Nov 9, 2020

jpountz Nov 9, 2020

mikemccand left a comment

mikemccand left a comment

LUCENE-9378: Make it possible to configure how to trade speed for compression on doc values. #2069

LUCENE-9378: Make it possible to configure how to trade speed for compression on doc values. #2069

Conversation

jpountz commented Nov 9, 2020

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikemccand left a comment

Choose a reason for hiding this comment

mikemccand left a comment

Choose a reason for hiding this comment