LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues #2213

jaisonbi · 2021-01-17T16:07:54Z

Description

Elasticsearch keyword field uses SortedSet DocValues. In our applications, “keyword” is the most frequently used field type.
LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do better by adding LZ4 compression to current prefix-compression. In one of our application, the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).

This feature is only for the high-cardinality fields.

Tests

See Jira LUCENE-9663 for details.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the master branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Ref Guide (for Solr changes only).

bruno-roustant

Nice improvement. Thanks for the benchmarks.

bruno-roustant · 2021-01-22T11:02:28Z

lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java

@@ -731,7 +731,22 @@ private void doAddSortedField(FieldInfo field, DocValuesProducer valuesProducer)
      meta.writeLong(data.getFilePointer() - start); // ordsLength
    }

-    addTermsDict(DocValues.singleton(valuesProducer.getSorted(field)));
+    int valuesCount = values.getValueCount();
+    switch (mode) {


Could we have a "if" instead of a switch? The if could call the right method on the SortedSetDocValues singleton.
(for a switch, we should handle the "default" case with an exception but I don't think it's useful here)

yes, should use "if" instead of "switch", thanks:)

bruno-roustant · 2021-01-22T11:05:21Z

lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java

@@ -791,6 +806,107 @@ private void addTermsDict(SortedSetDocValues values) throws IOException {
    writeTermsIndex(values);
  }

+  private void addCompressedTermsDict(SortedSetDocValues values) throws IOException {


This method shares many lines identical to addTermsDict(). As a reviewer I had to check line by line for the diff.
IMO it would be clearer to keep only one method and insert some compression-conditional code in addTermsDict().

I will try to optimize this method...thanks for the comment.

bruno-roustant · 2021-01-22T11:07:02Z

lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java

+          LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+          int compressedLength = (int) (data.getFilePointer() - before);
+          // Corner case: Compressed length might be bigger than un-compressed length.
+          maxBlockLength = Math.max(maxBlockLength, compressedLength);


This line could be grouped with line 839 with a double Math.max.

ok..it looks better:)

bruno-roustant · 2021-01-22T11:09:35Z

lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java

+        final int prefixLength = StringHelper.bytesDifference(previous.get(), term);
+        final int suffixLength = term.length - prefixLength;
+        assert suffixLength > 0; // terms are unique
+        int reservedSize = suffixLength + 11; // 1 byte + 2 vint.


This first time I read this line I thought it was reserved in the output stream, but no it's in the buffer. Maybe it would be clearer to just put this suffixLength + 11 directly in the next line.

makes sense, no need to introduce the new variable "reservedSize"...

bruno-roustant · 2021-01-22T11:17:31Z

lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java

+  private ByteArrayDataOutput maybeGrowBuffer(ByteArrayDataOutput bufferedOutput, int termLength) {
+    int pos = bufferedOutput.getPosition(), originalLength = termsDictBuffer.length;
+    if (pos + termLength >= originalLength - 1) {
+      int newLength = (originalLength + termLength) << 1;


This is the job of ArrayUtil.grow to double the size and pad if needed. We can simply call ArrayUtil.grow(termsDictBuffer, termsDictBuffer.length + termLength) and remove line 903.

Suppose the original termsDictBuffer length is 32, the term length is 3, so the new length after calling ArrayUtil.grow will be 40.....so only add 8 bytes to the original array...

I'm just worry about 16 KB is too small for some long value and high-cardinality fields, so it will call ArrayUtil.grow too many times....keep line 903 will make it faster...what do you think?

I try to be consistent with the approach taken in Lucene to not systematically double the buffer size but let a common tool apply uniformly a selected heuristic. Maybe @jpountz @msokolov you would have an opinion here to share.

Just copied my comment from another conversation:

I re-tested this method and try to understand the growing behavior..
suppose the original length is 16 * 1024, and the termLength is 112...so the new length will be: 18560
so remove line 851 makes sense to me.

so the above description "...it will call ArrayUtil.grow too many times..." is in-correct. I've removed line 851.

I don't completely understand the discussion here - line 851 is the call to ArrayUtil.grow? It looks like there is still a line 851 :) So -- I'll just say +1 to using Lucene's existing array-growing utility. It has an exponential growth that is < 2x, and unless this use case has some very special requirement, we should stick with that. If you feel the need to try something different, please substantiate it with tests on real data that show the value.

I don't completely understand the discussion here - line 851 is the call to ArrayUtil.grow?

In the old solution:

int newLength = (originalLength + termLength) << 1; ArrayUtil.grow(termsDictBuffer, newLength);

I doubled the array length before calling ArrayUtil.grow, I was worry about it may call ArrayUtil.grow many times for the long-value and high-cardinality fields.
After some tests, I found ArrayUtil.grow grow the array in a heuristic behavior...so I removed the line of int newLength = (originalLength + termLength) << 1; in the latest commit:)

bruno-roustant · 2021-01-22T11:18:37Z

lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java

@@ -962,6 +1078,21 @@ public SortedDocValues getSorted(FieldInfo field) throws IOException {
    addressesWriter.finish();
    meta.writeLong(data.getFilePointer() - start); // addressesLength

-    addTermsDict(values);
+    long valuesCount = values.getValueCount();


Could we factor this code? It is nearly identical to the new code in doAddSortedField().

will remove this:)

bruno-roustant · 2021-01-22T11:26:01Z

lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java

+    if (Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_CODE == termsDictBlockCode) {
+      // This is a LZ4 compressed block.
+      entry.compressed = true;
+      entry.termsDictBlockShift = Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_CODE >> 16;


This could be simply (my prefered option)
entry.termsDictBlockShift = Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_SHIFT;
entry.compressorCode = Lucene80DocValuesFormat.TERMS_DICT_COMPRESSOR_LZ4_CODE;

or if we want to prepare a future support of another compression (which seems not obvious to me because we would probably need more code change)
line 279: if (termsDictBlockCode > 1 << 16) { // with a 1 << 16 constant which means "compression code"
and
entry.termsDictBlockShift = termsDictBlockCode >> 16;
entry.compressorCode = termsDictBlockCode & 0xFFFF;

Will choose the first option, it will be easy to read...

bruno-roustant · 2021-01-22T11:28:05Z

lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java

+
+    boolean compressed;
+    // Reserved for support other compressors.
+    int compressorCode;


I'm not sure we need this compressorCode field. The current code does not use it and this future seems too fuzzy.

will remove this..just thought we could support more types of compression algorithms here...

bruno-roustant · 2021-01-22T11:28:35Z

lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java

@@ -1144,6 +1157,7 @@ public TermsEnum termsEnum() throws IOException {
  }

  private static class TermsDict extends BaseTermsEnum {
+    static final int PADDING_LENGTH = 7;


What is PADDING_LENGTH?

Just refer from CompressionMode$LZ4_DECOMPRESSOR...it said add 7 padding bytes can help decompression run faster...

Ok, in this case can we rename it LZ4_DECOMPRESSOR_PADDING and add this comment about the decompression speed?

ok..will rename and add this comment:)

jaisonbi · 2021-01-26T12:53:26Z

@bruno-roustant I've updated the patch according to your comments, thanks a lot:)

zhaih · 2021-01-27T04:50:37Z

lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java

@@ -736,49 +736,92 @@ private void doAddSortedField(FieldInfo field, DocValuesProducer valuesProducer)
  private void addTermsDict(SortedSetDocValues values) throws IOException {
    final long size = values.getValueCount();
    meta.writeVLong(size);
-    meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_SHIFT);
+    boolean compress =
+        Lucene80DocValuesFormat.Mode.BEST_COMPRESSION == mode


Is it possible to just enable compression for SortedDocValues but not BinaryDocValues? Since compressed BinaryDocValues is proved to be harmful for our service but we still want to try this one out...

I didn't find a good solution to resolve this problem, since terms dict compression could not be enabled by default currently.....there will be two options:

introduce more configuration/mode. it will make things more complicated.

optimize BinaryDocValues compression.

We will try this feature in our cluster first, and see the actual performance..so can we resolve this in another issue? or introduce more configurations in this patch?

@zhaih could you be more specific about this? Could you post something in the Jira issue discussion to explain how this compression is harmful for BinaryDocValues? It needs to be more visible than just in this review.

Terms dict compression and BinaryDocValues compression now share the same configuration...it means these two compression will be enabled and disabled together...But BinaryDocValues compression will cause performance reduction.. Please check the discussions from LUCENE-9378 :)

Sorry for late response. I agree we could solve it in a follow-up issue. And I could still test this via a customized PerFieldDocValuesFormat, thank you!

bruno-roustant · 2021-01-27T15:23:26Z

lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java

+  private ByteArrayDataOutput maybeGrowBuffer(ByteArrayDataOutput bufferedOutput, int termLength) {
+    int pos = bufferedOutput.getPosition(), originalLength = termsDictBuffer.length;
+    if (pos + termLength >= originalLength - 1) {
+      int targetLength = (originalLength + termLength) << 1;


The idea is to remove this line 851 (double the size) to let the ArrayUtil.oversize() inside ArrayUtil.grow() decide what is the appropriate size growth (exponential growth with less mem waste than classic x2).
The following line would become something like:
termsDictBuffer = ArrayUtil.grow(termsDictBuffer, pos + termLength)

I just worry about rely on ArrayUtil.grow is slower..I re-tested this method and try to understand the growing behavior..
suppose the original length is 16 * 1024, and the termLength is 112...so the new length will be: 18560

so remove line 851 makes sense to me. will change it immediately. Thanks @bruno-roustant

bruno-roustant · 2021-01-27T15:31:17Z

lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java

@@ -736,49 +736,92 @@ private void doAddSortedField(FieldInfo field, DocValuesProducer valuesProducer)
  private void addTermsDict(SortedSetDocValues values) throws IOException {
    final long size = values.getValueCount();
    meta.writeVLong(size);
-    meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_SHIFT);
+    boolean compress =
+        Lucene80DocValuesFormat.Mode.BEST_COMPRESSION == mode


@zhaih could you be more specific about this? Could you post something in the Jira issue discussion to explain how this compression is harmful for BinaryDocValues? It needs to be more visible than just in this review.

zhaih · 2021-01-27T19:05:22Z

@bruno-roustant Sorry I didn't make it clear. I didn't mean this is harmful for BinaryDocValues, I mean compression of BinaryDocValues makes our searching slower.
But we do want to test this out, so I want to know whether it's possible to config the codec that only enables compression of SortedDocValues but not enables other compressions. Seems like for now they are bind in doc values compression mode together.

jaisonbi · 2021-01-28T04:10:44Z

I also want to use different configurations, so we can enable terms dict compression only. I plan to remove the enum Mode from Lucene80DocValuesFormat, and add a new config class as below:

public enum CompressionMode {
    NONE,
    LZ4
  }

  public static class DocValuesConfig {
    // Compression mode for terms dict from SortedSet/Sorted DocValues.
    private CompressionMode termsDictCompressionMode;
    // Compression mode for binary DocValues.
    private CompressionMode binaryDocValueCompressionMode;

    public DocValuesConfig() {
      this(CompressionMode.NONE, CompressionMode.NONE);
    }

    public DocValuesConfig(CompressionMode termsDictCompressionMode, CompressionMode binaryDocValueCompressionMode) {
      this.termsDictCompressionMode = termsDictCompressionMode;
      this.binaryDocValueCompressionMode = binaryDocValueCompressionMode;
    }

    public CompressionMode getBinaryDocValueCompressionMode() {
      return binaryDocValueCompressionMode;
    }

    public boolean isTermsDictCompressionEnabled() {
      return CompressionMode.LZ4 == this.termsDictCompressionMode;
    }

    public boolean isBinaryDocValueCompressionEnabled() {
      return CompressionMode.LZ4 == this.binaryDocValueCompressionMode;
    }
  }

Before I commit the new changes, please share your thoughts on this. @bruno-roustant @zhaih
Thanks.

msokolov · 2021-01-28T13:11:03Z

@jaisonbi I'm curious what your plan is for surfacing this DocValuesConfig configuration in the API - are you thinking of adding it to the DocValuesFormat in place of the existing mode? Also, I think we do have a customization route via PerFieldDocValuesFormat. It does require users to create a custom Codec, so perhaps a bit expert for most, but if we begin to add this kind of customization support as a convenience, it probably ought to go in FieldType?

jaisonbi · 2021-01-29T02:37:15Z

I'm curious what your plan is for surfacing this DocValuesConfig configuration in the API - are you thinking of adding it to the DocValuesFormat in place of the existing mode?

Just want to add different configurations, so user can enable terms dict compression only(without enabling binary DocValue compression)...Yes, I planed to add it to DocValuesFormat, and remove Mode from DocValuesFormat...

will check the route via PerFieldDocValuesFormat...Thanks @msokolov

bruno-roustant · 2021-01-29T08:57:53Z

+1 to customize via PerFieldDocValuesFormat. This approach is the preferred design, as discussed earlier in LUCENE-9378 https://github.com/apache/lucene-solr/pull/2069/files#r520035432.

zhaih · 2021-02-02T00:48:25Z

I see, so I think for now I could test it via a customized PerFieldDocValuesFormat, I'll give PerFieldDocValuesFormat route a try then.

Tho IMO I would prefer a simpler configuration (as proposed by @jaisonbi) rather than customize using PerFieldDocValuesFormat in the future, if these 2 compression are showing different performance characteristic. Since if my understand is correct, to enable only TermDictCompression using PerFieldDOcValuesFormat we need to enumerate all SSDV field names in that class? Which sounds not quite maintainable if there's regularly field addition/deletion. Please correct me if I'm wrong as I'm not quite familiar with codec part...

jaisonbi · 2021-02-02T02:12:39Z

If I understood correctly, the route via PerFieldDocValuesFormat need to change the usage of SortedSetDocValues.
The idea is adding another constructor for enabling terms dict compression, as below:

public SortedSetDocValuesField(String name, BytesRef bytes, boolean compression) {
    super(name, compression ? COMPRESSION_TYPE: TYPE);
    fieldsData = bytes;
}

And below is the definition of COMPRESSION_TYPE:

public static final FieldType COMPRESSION_TYPE = new FieldType();
static {
  COMPRESSION_TYPE.setDocValuesType(DocValuesType.SORTED_SET);
  // add one new attribute for telling PerFieldDocValuesFormat that terms dict compression is enabled for this field
  COMPRESSION_TYPE.putAttribute("docvalue.sortedset.compression", "true");
  COMPRESSION_TYPE.freeze();
}

Not sure if I've got it right :)

@msokolov @bruno-roustant

bruno-roustant · 2021-02-02T16:52:00Z

I expect you don't need to change Lucene code but just write a new custom codec (with a specific name) which provides a custom DocValuesFormat. It extends PerFieldDocValuesFormat and implements the method
DocValuesFormat getDocValuesFormatForField(String field).
This method provides either a standard Lucene80DocValuesFormat (no compression) or another new custom DocValuesFormat (with a specific name to write in the index) extending Lucene80DocValuesFormat with BEST_COMPRESSION mode.
The choice can be made either based on a config (e.g. file) which lists all compressed DocValue based fields, or based on a naming convention.

jaisonbi · 2021-02-04T03:08:58Z

Thanks for the detailed suggestion:) @bruno-roustant

Will add a new custom codec to support compression or un-compression mode. I name it as "Lucene90ExpertCodec", in this codec, user will have more compression choices, likes Terms Dict compression, Binary DocValue compression..

One doubt, PerFieldDocValuesFormat provides one method for getting Format according to field name.
public abstract DocValuesFormat getDocValuesFormatForField(String field);
so we need to use the approach as @bruno-roustant suggested:

The choice can be made either based on a config (e.g. file) which lists all compressed DocValue based fields, or based on a naming convention.

we need to maintain the config file, and the field name list may got change...Regarding on the "naming convention" approach, we need to give rules to the field name definition(Please correct me if I am wrong)...I am afraid it's a heavy burden for users if they want to enable terms-dict compression for the exist indices.

There's another two options:

Use a global switch in "Lucene90ExpertCodec", so Terms-Dict compression/Binary DocValue Compression can be enabled easily.
Based on PerFieldDocValuesFormat, but add one more method:
public abstract DocValuesFormat getDocValuesFormatForField(FieldInfo field);
We can add new attribute in FieldInfo to identify the compression-enabled fields.

bruno-roustant

To be clear, I'm not suggesting that we add a new Codec in Lucene, but that you create a new custom Codec on your side, as a Lucene expert user. If you think you cannot create this custom codec because of some Lucene limitations, then I suggest that
1- We complete this PR for the scope of adding LZ4 compression to SortedSet/Sorted DocValues. We remove the latest commit about the configuration. This is the public scope of the Jira description which has been discussed.
2- Additional scope about configuration is proposed with a new Jira issue for more public visibility and discussion (it is required), such as
2a - enabling a switch to differentiate Terms-Dict / Binary DocValues compression
2b - improving PerFieldDocValuesFormat with an additional getDocValuesFormatForField(FieldInfo field) method

jaisonbi · 2021-02-04T09:52:16Z

To be clear, I'm not suggesting that we add a new Codec in Lucene, but that you create a new custom Codec on your side, as a Lucene expert user

I think this improvement is also valuable to other Lucene/ES users, so I hope I can complete this PR :-).. I just commit the new changes, but some checks are still running...In the latest commit, I removed the compression mode from Lucene80DocValuesFormat, but add different switches for terms-dict compression/binary DocValue compression.

Thanks very much @bruno-roustant

jaisonbi · 2021-02-04T10:07:23Z

1- We complete this PR for the scope of adding LZ4 compression to SortedSet/Sorted DocValues. We remove the latest commit about the configuration. This is the public scope of the Jira description which has been discussed.

ok...I will revert the new commits, and create another JIRA for flexible configurations. @bruno-roustant

bruno-roustant · 2021-02-04T10:23:30Z

@jaisonbi I didn't want to close this PR. I wanted to merge the first part related to LZ4 compression, and let you push a PR on another jira issue with the remaining commits you proposed.
Do you want to reopen the PR?
Or do you think the commits must not be separated?

jaisonbi · 2021-02-04T10:24:31Z

Sorry, I did something wrong while doing revert...will create a new PR.. @bruno-roustant

bruno-roustant requested changes Jan 22, 2021

View reviewed changes

jaisonbi requested a review from bruno-roustant January 27, 2021 01:34

zhaih reviewed Jan 27, 2021

View reviewed changes

bruno-roustant requested changes Jan 27, 2021

View reviewed changes

bruno-roustant requested changes Feb 4, 2021

View reviewed changes

jaisonbi closed this Feb 4, 2021

jaisonbi force-pushed the TermsDictCompression branch from 243a2b2 to 894d0bb Compare February 4, 2021 10:20

jaisonbi mentioned this pull request Feb 4, 2021

LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted D… #2302

Closed

7 tasks

asfimport mentioned this pull request Jul 16, 2021

Adding compression to terms dict from SortedSet/Sorted DocValues [LUCENE-9663] apache/lucene#10702

Closed

LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues #2213

LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues #2213

Conversation

jaisonbi commented Jan 17, 2021

Description

Tests

Checklist

bruno-roustant left a comment

Choose a reason for hiding this comment

bruno-roustant Jan 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaisonbi Jan 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaisonbi commented Jan 26, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhaih commented Jan 27, 2021

jaisonbi commented Jan 28, 2021

msokolov commented Jan 28, 2021

jaisonbi commented Jan 29, 2021

bruno-roustant commented Jan 29, 2021

zhaih commented Feb 2, 2021

jaisonbi commented Feb 2, 2021 • edited Loading

bruno-roustant commented Feb 2, 2021

jaisonbi commented Feb 4, 2021 • edited Loading

bruno-roustant left a comment

Choose a reason for hiding this comment

jaisonbi commented Feb 4, 2021

jaisonbi commented Feb 4, 2021

bruno-roustant commented Feb 4, 2021

jaisonbi commented Feb 4, 2021

bruno-roustant Jan 22, 2021 •

edited

Loading

jaisonbi Jan 26, 2021 •

edited

Loading

jaisonbi commented Feb 2, 2021 •

edited

Loading

jaisonbi commented Feb 4, 2021 •

edited

Loading