-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-9378: Disable compression on binary values whose length is less than 32. #1543
LUCENE-9378: Disable compression on binary values whose length is less than 32. #1543
Conversation
…s than 32. This commit disables compression on short binary values, and also switches from "fast" compression to "high" compression for long values. The reasoning is that "high" compression tends to insert fewer, longer back references, which makes decompression slightly faster.
Here are results on wikimedium10m
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small comment but otherwise LGTM
// dictionaries that allow decompressing one value at once instead of | ||
// forcing 32 values to be decompressed even when you only need one? | ||
if (uncompressedBlockLength >= 32 * numDocsInCurrentBlock) { | ||
LZ4.compress(block, 0, uncompressedBlockLength, data, highCompHt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe create a constant for this 32 ("MIN_COMPRESSABLE_FIELD_LENGTH" ?) otherwise it might get confused with the 32 we happen to use for max num of values in a block.
Thanks Adrien - speedup looks good for the BDVSort case, but I wonder if it has really recovered to the status quo ante. Were you able to run a before/after with 8.4? |
I guess I'm concerned that there can still be perf degradation for longer highly-compressible fields. Still perhaps we can extend this approach by providing a per-field configuration if that becomes needed, so let's merge and make some progress! |
Here are updated results where the baseline is master with Mark's commit that enabled compression of binary doc values is reverted, and the patch is this pull request. The bottleneck is now reading the lengths of all values in a given block.
|
Our internal benchmarks show slowdowns even as we increase the threshold here, although we can nearly recover (-9% QPS) the previous query-time performance disabling below 128 bytes. I'm beginning to wonder what the case is for compressing BDV at all? Do we see large BinaryDocValues fields that are not, or only rarely, decoded at query time? I think if such fields participate in the search in any significant way (ie for all hits), we are going to see these slowdowns. Maybe the bytes threshold is really just a proxy for binary doc values used as stored fields? |
Hi @msokolov I'm looking into improving the encoding of lengths, which is the next bottleneck for binary doc values. We are using binary doc values to run regex queries on a high cardinality field. A ngram index helps find good candidates and binary doc values are then used for verification. Field values are typically files, URLs, ... which can have significant redundancy. I'm ok with making compression less aggressive though I think that it would be a pity to disable it entirely and never take advantage of redundancy. You mentioned slowdowns, but this actually depends on the query, e.g. I'm seeing an almost 2x speedup when sorting a |
Here are updated results for the latest version of this PR. Baseline is master with binary DV compression reverted, and the contender is this pull request. I also introduced a new How hard would it be to run this patch on your internal benchmarks @msokolov ? If this looks good or at least like a good step in the right direction, I'll work on improving tests.
|
Thanks, @jpountz we'll definitely try it out. Probably won't be until early next week as I'm off today, but maybe someone else can pick it up sooner |
Thanks @jpountz -- we'll test on Amazon's product search use case. Do we know why faceting and sorting see such massive speedups? Is it because of more efficient representation of the per-doc length? Anyways, it looks awesome. |
I ran our internal benchmarks on the latest changes. Throughput and latency remained almost same compared to the first version of this PR. I am in process of collecting some detailed stats regarding fields that are getting compressed and their query time access pattern to understand more. |
@mikemccand With |
@jpountz I am still confused about the above benchmarks. If baseline was master with |
@mikemccand I think that there are two things that get combined:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jpountz for working to recover the lost performance here!
The change looks great; I left some small comments.
We (Amazon product search) are still somewhat mystified ... @gandhi-viral will give more specifics, but we have tried multiple options (use this PR as is, use this PR but force disable all compression by increasing the threshold, vary threshold from 32 bytes average length, use this PR but only compress one of our BINARY fields, or only compress all but one of our BINARY fields) to understand the non-trivial red-line QPS performance loss we see here.
We use BINARY DV fields for multiple faceting fields, some large, some tiny, and then one biggish (~90 bytes average) one to hold metadata for each document that we load/decode on each hit.
@@ -218,6 +218,10 @@ Optimizations | |||
* LUCENE-9087: Build always trees with full leaves and lower the default value for maxPointsPerLeafNode to 512. | |||
(Ignacio Vera) | |||
|
|||
* LUCENE-9378: Disabled compression on short binary values, as compression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe say Disable doc values compression on short binary values, ...
? (To make it clear we are talking about doc values and not maybe stored fields).
@@ -64,6 +63,8 @@ | |||
/** writer for {@link Lucene80DocValuesFormat} */ | |||
final class Lucene80DocValuesConsumer extends DocValuesConsumer implements Closeable { | |||
|
|||
private static final int BINARY_LENGTH_COMPRESSION_THRESHOLD = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment? This threshold is applied to the average length across all (32) docs in each block?
@@ -404,32 +406,51 @@ private void flushData() throws IOException { | |||
// Write offset to this block to temporary offsets file | |||
totalChunks++; | |||
long thisBlockStartPointer = data.getFilePointer(); | |||
|
|||
// Optimisation - check if all lengths are same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if removing this optimization is maybe hurting our usage ... not certain.
With this change, with all lengths same, we now write 32 zero bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all docs are the same length, then numBytes
would be 0 below and we only encode the average length, so this case is still optimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh OK thanks for the clarification.
if (uncompressedBlockLength >= BINARY_LENGTH_COMPRESSION_THRESHOLD * numDocsInCurrentBlock) { | ||
LZ4.compress(block, 0, uncompressedBlockLength, data, highCompHt); | ||
} else { | ||
LZ4.compress(block, 0, uncompressedBlockLength, data, noCompHt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm do we know that our new LZ4.NoCompressionHashTable
is actually really close to doing nothing? I don't understand LZ4
well enough to know that e.g. return -1
from int get (int offset)
method is really a no-op overall...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified this. In that case the compressed stream only consists of a length in a variable length integer followed by the raw bytes. I also checked whether it helped to just call IndexOutput#writeBytes
but the benchmark didn't suggest any performance improvement (or regression), so I kept this, which helps have a single format to handle on the read side.
switch (numBytes) { | ||
case Integer.BYTES: | ||
startDelta = deltas.getInt(docInBlockId * Integer.BYTES); | ||
endDelta = deltas.getInt((docInBlockId + 1) * Integer.BYTES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't that mean we need to write an extra value (33 total)? I tried to find where we are doing that (above) but could not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trick I'm using is that I'm reading 32 values starting at offset 1. This helps avoid a condition for the first value of the block, but we're still writing/reading only 32 values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha! Sneaky :)
Red-line QPS (throughput) based on our internal benchmarking is still unfortunately suffering (-49%) with the latest PR. We were able to isolate one particular field, a ~90 byte on average metadata field, which is causing most of our regression. After disabling compression on that particular field, we are at -8% red-line QPS compared to using Lucene 8.4 BDVs. Looking further into the access pattern for that field, we see that (num_access / num_blocks_decompressed = 1.51), so we are decompressing a whole block per every ~1.5 hits. By temporarily using Thank you @jpountz for your help. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gandhi-viral That would work for me but I'd like to make sure we're talking about the same thing:
- Lucene86DocValuesConsumer gets a ctor argument to configure the threshold.
- Lucene86DocValuesFormat keeps 32 as a default value.
- You would create your own DocValuesFormat that would reuse Lucene86DocValuesProducer and create a Lucene86DocValuesConsumer with a high threshold for compression of binary values.
- You would enable this format by overriding getDocValueFormatForField in Lucene86Codec.
- This would mean that your indices would no longer have backward compatibility guarantees of the default codec (N-1) but maybe you don't care since you're re-building your indices from scratch on a regular basis?
@@ -404,32 +406,51 @@ private void flushData() throws IOException { | |||
// Write offset to this block to temporary offsets file | |||
totalChunks++; | |||
long thisBlockStartPointer = data.getFilePointer(); | |||
|
|||
// Optimisation - check if all lengths are same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all docs are the same length, then numBytes
would be 0 below and we only encode the average length, so this case is still optimized.
if (uncompressedBlockLength >= BINARY_LENGTH_COMPRESSION_THRESHOLD * numDocsInCurrentBlock) { | ||
LZ4.compress(block, 0, uncompressedBlockLength, data, highCompHt); | ||
} else { | ||
LZ4.compress(block, 0, uncompressedBlockLength, data, noCompHt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified this. In that case the compressed stream only consists of a length in a variable length integer followed by the raw bytes. I also checked whether it helped to just call IndexOutput#writeBytes
but the benchmark didn't suggest any performance improvement (or regression), so I kept this, which helps have a single format to handle on the read side.
switch (numBytes) { | ||
case Integer.BYTES: | ||
startDelta = deltas.getInt(docInBlockId * Integer.BYTES); | ||
endDelta = deltas.getInt((docInBlockId + 1) * Integer.BYTES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trick I'm using is that I'm reading 32 values starting at offset 1. This helps avoid a condition for the first value of the block, but we're still writing/reading only 32 values.
Yes, that's what I had in mind too. Currently, we are doing similar thing after You are right about backward compatibility guarantees not being an issue for our use-case since we do re-build our indices on each software deployment. |
Hmm, could we add the parameter also to It is true that for us (Amazon product search) in particular it would be OK to forego backwards compatibility, but I think we shouldn't push that on others who might want to customize this / make their own Codec? At read time ( |
Thinking more above the above proposal, specifically this line, which is OK for our usage but I think not so great for other users who would like to reduce or turn off the
We lose backwards compatibility because we would have to create our own named But, couldn't we instead just subclass Lucene's default codec, override {{getDocValuesFormatPerField}} to subclass {{Lucene80DocValuesFormat}} (oh, I see, yeah we cannot do that -- this class is So yeah we would lose backwards compatibility, but it's a trivially small piece of code to carry that custom But then I wonder why not just add a |
To me we only guarantee backward compatibility for users of the default codec. With the approach you mentioned, indices would be backward compatible, but I'm seeing this as accidental rather than something we guarantee.
I wanted to look into whether we could avoid this as it would boil down to maintaining two doc-value formats, but this might be the best way forward as it looks like the heuristics we tried out above don't work well to disable compression for use-cases when it hurts more than it helps. |
+1. I'm afraid whether compression is a good idea for BDV or not is a very application specific tradeoff. |
See https://issues.apache.org/jira/browse/LUCENE-9378
This commit disables compression on short binary values, and also
switches from "fast" compression to "high" compression for long values.
The reasoning is that "high" compression tends to insert fewer, longer
back references, which makes decompression slightly faster.