Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues #2213

Closed
wants to merge 0 commits into from

Conversation

jaisonbi
Copy link
Contributor

Description

Elasticsearch keyword field uses SortedSet DocValues. In our applications, “keyword” is the most frequently used field type.
LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do better by adding LZ4 compression to current prefix-compression. In one of our application, the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).

This feature is only for the high-cardinality fields.

Tests

See Jira LUCENE-9663 for details.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the master branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Ref Guide (for Solr changes only).

Copy link
Contributor

@bruno-roustant bruno-roustant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement. Thanks for the benchmarks.

@@ -731,7 +731,22 @@ private void doAddSortedField(FieldInfo field, DocValuesProducer valuesProducer)
meta.writeLong(data.getFilePointer() - start); // ordsLength
}

addTermsDict(DocValues.singleton(valuesProducer.getSorted(field)));
int valuesCount = values.getValueCount();
switch (mode) {
Copy link
Contributor

@bruno-roustant bruno-roustant Jan 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a "if" instead of a switch? The if could call the right method on the SortedSetDocValues singleton.
(for a switch, we should handle the "default" case with an exception but I don't think it's useful here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, should use "if" instead of "switch", thanks:)

@@ -791,6 +806,107 @@ private void addTermsDict(SortedSetDocValues values) throws IOException {
writeTermsIndex(values);
}

private void addCompressedTermsDict(SortedSetDocValues values) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method shares many lines identical to addTermsDict(). As a reviewer I had to check line by line for the diff.
IMO it would be clearer to keep only one method and insert some compression-conditional code in addTermsDict().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to optimize this method...thanks for the comment.

LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
int compressedLength = (int) (data.getFilePointer() - before);
// Corner case: Compressed length might be bigger than un-compressed length.
maxBlockLength = Math.max(maxBlockLength, compressedLength);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line could be grouped with line 839 with a double Math.max.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok..it looks better:)

final int prefixLength = StringHelper.bytesDifference(previous.get(), term);
final int suffixLength = term.length - prefixLength;
assert suffixLength > 0; // terms are unique
int reservedSize = suffixLength + 11; // 1 byte + 2 vint.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This first time I read this line I thought it was reserved in the output stream, but no it's in the buffer. Maybe it would be clearer to just put this suffixLength + 11 directly in the next line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, no need to introduce the new variable "reservedSize"...

private ByteArrayDataOutput maybeGrowBuffer(ByteArrayDataOutput bufferedOutput, int termLength) {
int pos = bufferedOutput.getPosition(), originalLength = termsDictBuffer.length;
if (pos + termLength >= originalLength - 1) {
int newLength = (originalLength + termLength) << 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the job of ArrayUtil.grow to double the size and pad if needed. We can simply call ArrayUtil.grow(termsDictBuffer, termsDictBuffer.length + termLength) and remove line 903.

Copy link
Contributor Author

@jaisonbi jaisonbi Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose the original termsDictBuffer length is 32, the term length is 3, so the new length after calling ArrayUtil.grow will be 40.....so only add 8 bytes to the original array...

I'm just worry about 16 KB is too small for some long value and high-cardinality fields, so it will call ArrayUtil.grow too many times....keep line 903 will make it faster...what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to be consistent with the approach taken in Lucene to not systematically double the buffer size but let a common tool apply uniformly a selected heuristic. Maybe @jpountz @msokolov you would have an opinion here to share.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just copied my comment from another conversation:

I re-tested this method and try to understand the growing behavior..
suppose the original length is 16 * 1024, and the termLength is 112...so the new length will be: 18560
so remove line 851 makes sense to me.

so the above description "...it will call ArrayUtil.grow too many times..." is in-correct. I've removed line 851.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't completely understand the discussion here - line 851 is the call to ArrayUtil.grow? It looks like there is still a line 851 :) So -- I'll just say +1 to using Lucene's existing array-growing utility. It has an exponential growth that is < 2x, and unless this use case has some very special requirement, we should stick with that. If you feel the need to try something different, please substantiate it with tests on real data that show the value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't completely understand the discussion here - line 851 is the call to ArrayUtil.grow?

In the old solution:

int newLength = (originalLength + termLength) << 1;
ArrayUtil.grow(termsDictBuffer, newLength);

I doubled the array length before calling ArrayUtil.grow, I was worry about it may call ArrayUtil.grow many times for the long-value and high-cardinality fields.
After some tests, I found ArrayUtil.grow grow the array in a heuristic behavior...so I removed the line of int newLength = (originalLength + termLength) << 1; in the latest commit:)

@@ -962,6 +1078,21 @@ public SortedDocValues getSorted(FieldInfo field) throws IOException {
addressesWriter.finish();
meta.writeLong(data.getFilePointer() - start); // addressesLength

addTermsDict(values);
long valuesCount = values.getValueCount();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we factor this code? It is nearly identical to the new code in doAddSortedField().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove this:)

if (Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_CODE == termsDictBlockCode) {
// This is a LZ4 compressed block.
entry.compressed = true;
entry.termsDictBlockShift = Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_CODE >> 16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be simply (my prefered option)
entry.termsDictBlockShift = Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_SHIFT;
entry.compressorCode = Lucene80DocValuesFormat.TERMS_DICT_COMPRESSOR_LZ4_CODE;

or if we want to prepare a future support of another compression (which seems not obvious to me because we would probably need more code change)
line 279: if (termsDictBlockCode > 1 << 16) { // with a 1 << 16 constant which means "compression code"
and
entry.termsDictBlockShift = termsDictBlockCode >> 16;
entry.compressorCode = termsDictBlockCode & 0xFFFF;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will choose the first option, it will be easy to read...


boolean compressed;
// Reserved for support other compressors.
int compressorCode;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need this compressorCode field. The current code does not use it and this future seems too fuzzy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove this..just thought we could support more types of compression algorithms here...

@@ -1144,6 +1157,7 @@ public TermsEnum termsEnum() throws IOException {
}

private static class TermsDict extends BaseTermsEnum {
static final int PADDING_LENGTH = 7;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is PADDING_LENGTH?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just refer from CompressionMode$LZ4_DECOMPRESSOR...it said add 7 padding bytes can help decompression run faster...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, in this case can we rename it LZ4_DECOMPRESSOR_PADDING and add this comment about the decompression speed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok..will rename and add this comment:)

@jaisonbi
Copy link
Contributor Author

@bruno-roustant I've updated the patch according to your comments, thanks a lot:)

@@ -736,49 +736,92 @@ private void doAddSortedField(FieldInfo field, DocValuesProducer valuesProducer)
private void addTermsDict(SortedSetDocValues values) throws IOException {
final long size = values.getValueCount();
meta.writeVLong(size);
meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_SHIFT);
boolean compress =
Lucene80DocValuesFormat.Mode.BEST_COMPRESSION == mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to just enable compression for SortedDocValues but not BinaryDocValues? Since compressed BinaryDocValues is proved to be harmful for our service but we still want to try this one out...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find a good solution to resolve this problem, since terms dict compression could not be enabled by default currently.....there will be two options:

  1. introduce more configuration/mode. it will make things more complicated.
  2. optimize BinaryDocValues compression.

We will try this feature in our cluster first, and see the actual performance..so can we resolve this in another issue? or introduce more configurations in this patch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhaih could you be more specific about this? Could you post something in the Jira issue discussion to explain how this compression is harmful for BinaryDocValues? It needs to be more visible than just in this review.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Terms dict compression and BinaryDocValues compression now share the same configuration...it means these two compression will be enabled and disabled together...But BinaryDocValues compression will cause performance reduction.. Please check the discussions from LUCENE-9378 :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late response. I agree we could solve it in a follow-up issue. And I could still test this via a customized PerFieldDocValuesFormat, thank you!

private ByteArrayDataOutput maybeGrowBuffer(ByteArrayDataOutput bufferedOutput, int termLength) {
int pos = bufferedOutput.getPosition(), originalLength = termsDictBuffer.length;
if (pos + termLength >= originalLength - 1) {
int targetLength = (originalLength + termLength) << 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to remove this line 851 (double the size) to let the ArrayUtil.oversize() inside ArrayUtil.grow() decide what is the appropriate size growth (exponential growth with less mem waste than classic x2).
The following line would become something like:
termsDictBuffer = ArrayUtil.grow(termsDictBuffer, pos + termLength)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just worry about rely on ArrayUtil.grow is slower..I re-tested this method and try to understand the growing behavior..
suppose the original length is 16 * 1024, and the termLength is 112...so the new length will be: 18560

so remove line 851 makes sense to me. will change it immediately. Thanks @bruno-roustant

@@ -736,49 +736,92 @@ private void doAddSortedField(FieldInfo field, DocValuesProducer valuesProducer)
private void addTermsDict(SortedSetDocValues values) throws IOException {
final long size = values.getValueCount();
meta.writeVLong(size);
meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_SHIFT);
boolean compress =
Lucene80DocValuesFormat.Mode.BEST_COMPRESSION == mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhaih could you be more specific about this? Could you post something in the Jira issue discussion to explain how this compression is harmful for BinaryDocValues? It needs to be more visible than just in this review.

@zhaih
Copy link
Contributor

zhaih commented Jan 27, 2021

@bruno-roustant Sorry I didn't make it clear. I didn't mean this is harmful for BinaryDocValues, I mean compression of BinaryDocValues makes our searching slower.
But we do want to test this out, so I want to know whether it's possible to config the codec that only enables compression of SortedDocValues but not enables other compressions. Seems like for now they are bind in doc values compression mode together.

@jaisonbi
Copy link
Contributor Author

I also want to use different configurations, so we can enable terms dict compression only. I plan to remove the enum Mode from Lucene80DocValuesFormat, and add a new config class as below:

public enum CompressionMode {
    NONE,
    LZ4
  }

  public static class DocValuesConfig {
    // Compression mode for terms dict from SortedSet/Sorted DocValues.
    private CompressionMode termsDictCompressionMode;
    // Compression mode for binary DocValues.
    private CompressionMode binaryDocValueCompressionMode;

    public DocValuesConfig() {
      this(CompressionMode.NONE, CompressionMode.NONE);
    }

    public DocValuesConfig(CompressionMode termsDictCompressionMode, CompressionMode binaryDocValueCompressionMode) {
      this.termsDictCompressionMode = termsDictCompressionMode;
      this.binaryDocValueCompressionMode = binaryDocValueCompressionMode;
    }

    public CompressionMode getBinaryDocValueCompressionMode() {
      return binaryDocValueCompressionMode;
    }

    public boolean isTermsDictCompressionEnabled() {
      return CompressionMode.LZ4 == this.termsDictCompressionMode;
    }

    public boolean isBinaryDocValueCompressionEnabled() {
      return CompressionMode.LZ4 == this.binaryDocValueCompressionMode;
    }
  }

Before I commit the new changes, please share your thoughts on this. @bruno-roustant @zhaih
Thanks.

@msokolov
Copy link
Contributor

@jaisonbi I'm curious what your plan is for surfacing this DocValuesConfig configuration in the API - are you thinking of adding it to the DocValuesFormat in place of the existing mode? Also, I think we do have a customization route via PerFieldDocValuesFormat. It does require users to create a custom Codec, so perhaps a bit expert for most, but if we begin to add this kind of customization support as a convenience, it probably ought to go in FieldType?

@jaisonbi
Copy link
Contributor Author

I'm curious what your plan is for surfacing this DocValuesConfig configuration in the API - are you thinking of adding it to the DocValuesFormat in place of the existing mode?

Just want to add different configurations, so user can enable terms dict compression only(without enabling binary DocValue compression)...Yes, I planed to add it to DocValuesFormat, and remove Mode from DocValuesFormat...

will check the route via PerFieldDocValuesFormat...Thanks @msokolov

@bruno-roustant
Copy link
Contributor

+1 to customize via PerFieldDocValuesFormat. This approach is the preferred design, as discussed earlier in LUCENE-9378 https://github.com/apache/lucene-solr/pull/2069/files#r520035432.

@zhaih
Copy link
Contributor

zhaih commented Feb 2, 2021

I see, so I think for now I could test it via a customized PerFieldDocValuesFormat, I'll give PerFieldDocValuesFormat route a try then.

Tho IMO I would prefer a simpler configuration (as proposed by @jaisonbi) rather than customize using PerFieldDocValuesFormat in the future, if these 2 compression are showing different performance characteristic. Since if my understand is correct, to enable only TermDictCompression using PerFieldDOcValuesFormat we need to enumerate all SSDV field names in that class? Which sounds not quite maintainable if there's regularly field addition/deletion. Please correct me if I'm wrong as I'm not quite familiar with codec part...

@jaisonbi
Copy link
Contributor Author

jaisonbi commented Feb 2, 2021

If I understood correctly, the route via PerFieldDocValuesFormat need to change the usage of SortedSetDocValues.
The idea is adding another constructor for enabling terms dict compression, as below:

public SortedSetDocValuesField(String name, BytesRef bytes, boolean compression) {
    super(name, compression ? COMPRESSION_TYPE: TYPE);
    fieldsData = bytes;
}

And below is the definition of COMPRESSION_TYPE:

public static final FieldType COMPRESSION_TYPE = new FieldType();
static {
  COMPRESSION_TYPE.setDocValuesType(DocValuesType.SORTED_SET);
  // add one new attribute for telling PerFieldDocValuesFormat that terms dict compression is enabled for this field
  COMPRESSION_TYPE.putAttribute("docvalue.sortedset.compression", "true");
  COMPRESSION_TYPE.freeze();
}

Not sure if I've got it right :)

@msokolov @bruno-roustant

@bruno-roustant
Copy link
Contributor

I expect you don't need to change Lucene code but just write a new custom codec (with a specific name) which provides a custom DocValuesFormat. It extends PerFieldDocValuesFormat and implements the method
DocValuesFormat getDocValuesFormatForField(String field).
This method provides either a standard Lucene80DocValuesFormat (no compression) or another new custom DocValuesFormat (with a specific name to write in the index) extending Lucene80DocValuesFormat with BEST_COMPRESSION mode.
The choice can be made either based on a config (e.g. file) which lists all compressed DocValue based fields, or based on a naming convention.

@jaisonbi
Copy link
Contributor Author

jaisonbi commented Feb 4, 2021

Thanks for the detailed suggestion:) @bruno-roustant

Will add a new custom codec to support compression or un-compression mode. I name it as "Lucene90ExpertCodec", in this codec, user will have more compression choices, likes Terms Dict compression, Binary DocValue compression..

One doubt, PerFieldDocValuesFormat provides one method for getting Format according to field name.
public abstract DocValuesFormat getDocValuesFormatForField(String field);
so we need to use the approach as @bruno-roustant suggested:

The choice can be made either based on a config (e.g. file) which lists all compressed DocValue based fields, or based on a naming convention.

we need to maintain the config file, and the field name list may got change...Regarding on the "naming convention" approach, we need to give rules to the field name definition(Please correct me if I am wrong)...I am afraid it's a heavy burden for users if they want to enable terms-dict compression for the exist indices.

There's another two options:

  1. Use a global switch in "Lucene90ExpertCodec", so Terms-Dict compression/Binary DocValue Compression can be enabled easily.
  2. Based on PerFieldDocValuesFormat, but add one more method:
    public abstract DocValuesFormat getDocValuesFormatForField(FieldInfo field);
    We can add new attribute in FieldInfo to identify the compression-enabled fields.

Copy link
Contributor

@bruno-roustant bruno-roustant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I'm not suggesting that we add a new Codec in Lucene, but that you create a new custom Codec on your side, as a Lucene expert user. If you think you cannot create this custom codec because of some Lucene limitations, then I suggest that
1- We complete this PR for the scope of adding LZ4 compression to SortedSet/Sorted DocValues. We remove the latest commit about the configuration. This is the public scope of the Jira description which has been discussed.
2- Additional scope about configuration is proposed with a new Jira issue for more public visibility and discussion (it is required), such as
2a - enabling a switch to differentiate Terms-Dict / Binary DocValues compression
2b - improving PerFieldDocValuesFormat with an additional getDocValuesFormatForField(FieldInfo field) method

@jaisonbi
Copy link
Contributor Author

jaisonbi commented Feb 4, 2021

To be clear, I'm not suggesting that we add a new Codec in Lucene, but that you create a new custom Codec on your side, as a Lucene expert user

I think this improvement is also valuable to other Lucene/ES users, so I hope I can complete this PR :-).. I just commit the new changes, but some checks are still running...In the latest commit, I removed the compression mode from Lucene80DocValuesFormat, but add different switches for terms-dict compression/binary DocValue compression.

Thanks very much @bruno-roustant

@jaisonbi
Copy link
Contributor Author

jaisonbi commented Feb 4, 2021

1- We complete this PR for the scope of adding LZ4 compression to SortedSet/Sorted DocValues. We remove the latest commit about the configuration. This is the public scope of the Jira description which has been discussed.

ok...I will revert the new commits, and create another JIRA for flexible configurations. @bruno-roustant

@bruno-roustant
Copy link
Contributor

@jaisonbi I didn't want to close this PR. I wanted to merge the first part related to LZ4 compression, and let you push a PR on another jira issue with the remaining commits you proposed.
Do you want to reopen the PR?
Or do you think the commits must not be separated?

@jaisonbi
Copy link
Contributor Author

jaisonbi commented Feb 4, 2021

Sorry, I did something wrong while doing revert...will create a new PR.. @bruno-roustant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants