LUCENE-9378: Disable compression on binary values whose length is less than 32. #1543

jpountz · 2020-05-29T09:59:10Z

See https://issues.apache.org/jira/browse/LUCENE-9378

This commit disables compression on short binary values, and also
switches from "fast" compression to "high" compression for long values.
The reasoning is that "high" compression tends to insert fewer, longer
back references, which makes decompression slightly faster.

…s than 32. This commit disables compression on short binary values, and also switches from "fast" compression to "high" compression for long values. The reasoning is that "high" compression tends to insert fewer, longer back references, which makes decompression slightly faster.

jpountz · 2020-05-29T10:01:05Z

Here are results on wikimedium10m

                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
BrowseDayOfYearSSDVFacets       15.32      (8.5%)       14.80     (12.6%)   -3.4% ( -22% -   19%)
   BrowseMonthSSDVFacets       16.73     (13.8%)       16.45     (14.8%)   -1.7% ( -26% -   31%)
        HighSloppyPhrase       24.95      (8.0%)       24.66      (8.7%)   -1.2% ( -16% -   16%)
         MedSloppyPhrase      171.40      (6.0%)      169.44      (6.4%)   -1.1% ( -12% -   12%)
            OrNotHighMed     1305.52      (4.4%)     1296.78      (3.7%)   -0.7% (  -8% -    7%)
         LowSloppyPhrase      202.88      (4.6%)      201.78      (4.6%)   -0.5% (  -9% -    9%)
       HighTermMonthSort       90.33     (10.5%)       89.88     (10.8%)   -0.5% ( -19% -   23%)
              AndHighLow     1221.67      (4.4%)     1216.14      (3.2%)   -0.5% (  -7% -    7%)
                 MedTerm     2535.67      (3.4%)     2525.61      (3.0%)   -0.4% (  -6% -    6%)
            HighSpanNear       22.82      (4.4%)       22.76      (3.0%)   -0.3% (  -7% -    7%)
                HighTerm     2484.93      (3.5%)     2482.72      (2.8%)   -0.1% (  -6% -    6%)
             LowSpanNear       54.72      (2.9%)       54.67      (2.6%)   -0.1% (  -5% -    5%)
           OrHighNotHigh     1050.37      (2.8%)     1049.92      (3.2%)   -0.0% (  -5% -    6%)
            OrHighNotLow     1605.11      (3.6%)     1604.69      (3.4%)   -0.0% (  -6% -    7%)
               OrHighLow      803.31      (3.7%)      803.36      (3.9%)    0.0% (  -7% -    7%)
            OrHighNotMed     1394.34      (3.2%)     1395.18      (3.3%)    0.1% (  -6% -    6%)
              HighPhrase      250.77      (2.9%)      251.05      (3.0%)    0.1% (  -5% -    6%)
                 Prefix3      657.61      (9.6%)      658.52     (10.2%)    0.1% ( -17% -   21%)
            OrNotHighLow     1094.18      (3.9%)     1096.03      (3.2%)    0.2% (  -6% -    7%)
               MedPhrase      167.34      (2.8%)      167.64      (2.6%)    0.2% (  -5% -    5%)
              AndHighMed      127.35      (3.6%)      127.59      (2.8%)    0.2% (  -6% -    6%)
           OrNotHighHigh     1193.60      (3.4%)     1195.83      (2.8%)    0.2% (  -5% -    6%)
               OrHighMed      123.41      (4.0%)      123.74      (3.0%)    0.3% (  -6% -    7%)
                 Respell      305.95      (3.6%)      306.79      (4.4%)    0.3% (  -7% -    8%)
                Wildcard      250.69      (7.3%)      251.41      (7.5%)    0.3% ( -13% -   16%)
              OrHighHigh       23.79      (4.3%)       23.88      (3.3%)    0.4% (  -6% -    8%)
                PKLookup      257.43      (6.5%)      258.43      (5.0%)    0.4% ( -10% -   12%)
             MedSpanNear       15.20      (3.1%)       15.28      (2.6%)    0.5% (  -5% -    6%)
               LowPhrase      947.49      (2.8%)      952.85      (2.3%)    0.6% (  -4% -    5%)
   HighTermDayOfYearSort       88.51      (6.4%)       89.20      (5.5%)    0.8% ( -10% -   13%)
             AndHighHigh      118.29      (3.1%)      119.51      (2.9%)    1.0% (  -4% -    7%)
    HighIntervalsOrdered       44.70      (4.0%)       45.35      (4.6%)    1.5% (  -6% -   10%)
                 LowTerm     2756.66      (5.8%)     2806.48      (4.0%)    1.8% (  -7% -   12%)
                  Fuzzy2      135.58      (8.8%)      139.92      (6.4%)    3.2% ( -11% -   20%)
                  Fuzzy1      225.49      (9.3%)      234.07      (9.7%)    3.8% ( -13% -   25%)
                  IntNRQ      100.54     (49.9%)      105.58     (49.7%)    5.0% ( -63% -  208%)
BrowseDayOfYearTaxoFacets        5.64      (8.7%)        6.54     (11.0%)   15.9% (  -3% -   38%)
    BrowseDateTaxoFacets        5.64      (8.6%)        6.56     (10.9%)   16.3% (  -2% -   39%)
   BrowseMonthTaxoFacets        6.67      (7.3%)        7.84      (8.8%)   17.5% (   1% -   36%)
    HighTermTitleBDVSort        6.16      (6.1%)       14.02     (28.0%)  127.6% (  88% -  172%)

markharwood

One small comment but otherwise LGTM

markharwood · 2020-05-29T13:03:59Z

lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java

+        // dictionaries that allow decompressing one value at once instead of
+        // forcing 32 values to be decompressed even when you only need one?
+        if (uncompressedBlockLength >= 32 * numDocsInCurrentBlock) {
+          LZ4.compress(block, 0, uncompressedBlockLength, data, highCompHt);


Maybe create a constant for this 32 ("MIN_COMPRESSABLE_FIELD_LENGTH" ?) otherwise it might get confused with the 32 we happen to use for max num of values in a block.

msokolov · 2020-05-29T14:59:04Z

Thanks Adrien - speedup looks good for the BDVSort case, but I wonder if it has really recovered to the status quo ante. Were you able to run a before/after with 8.4?

msokolov · 2020-05-29T15:27:17Z

I guess I'm concerned that there can still be perf degradation for longer highly-compressible fields. Still perhaps we can extend this approach by providing a per-field configuration if that becomes needed, so let's merge and make some progress!

jpountz · 2020-06-02T12:45:01Z

Here are updated results where the baseline is master with Mark's commit that enabled compression of binary doc values is reverted, and the patch is this pull request. The bottleneck is now reading the lengths of all values in a given block.

                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
    HighTermTitleBDVSort       28.91      (2.7%)       15.31      (7.4%)  -47.0% ( -55% -  -37%)
        HighSloppyPhrase        6.45      (7.6%)        6.35      (9.1%)   -1.6% ( -16% -   16%)
            OrHighNotLow     1260.60      (3.7%)     1243.30      (3.5%)   -1.4% (  -8% -    6%)
         MedSloppyPhrase       34.45      (4.5%)       34.07      (5.4%)   -1.1% ( -10% -    9%)
   HighTermDayOfYearSort       28.14      (8.4%)       27.84      (9.6%)   -1.1% ( -17% -   18%)
   BrowseMonthSSDVFacets       16.96     (11.7%)       16.81     (11.9%)   -0.9% ( -21% -   25%)
    HighIntervalsOrdered       28.80      (2.7%)       28.55      (2.6%)   -0.8% (  -6% -    4%)
                 Prefix3      100.23     (10.4%)       99.45      (9.4%)   -0.8% ( -18% -   21%)
BrowseDayOfYearSSDVFacets       15.44      (8.3%)       15.34      (9.6%)   -0.6% ( -17% -   18%)
           OrHighNotHigh     1306.26      (2.5%)     1299.19      (1.9%)   -0.5% (  -4% -    4%)
           OrNotHighHigh     1389.86      (2.4%)     1382.43      (2.5%)   -0.5% (  -5% -    4%)
              AndHighLow      843.27      (3.3%)      839.05      (3.3%)   -0.5% (  -6% -    6%)
         LowSloppyPhrase      151.10      (3.7%)      150.35      (4.0%)   -0.5% (  -7% -    7%)
            OrHighNotMed     1336.61      (4.0%)     1331.42      (2.4%)   -0.4% (  -6% -    6%)
               MedPhrase      244.05      (2.4%)      243.17      (2.0%)   -0.4% (  -4% -    4%)
                 MedTerm     3342.61      (4.1%)     3332.82      (3.9%)   -0.3% (  -7% -    8%)
                 LowTerm     2724.03      (4.5%)     2719.60      (4.6%)   -0.2% (  -8% -    9%)
            HighSpanNear       20.82      (4.0%)       20.79      (3.7%)   -0.2% (  -7% -    7%)
            OrNotHighMed     1638.22      (3.8%)     1637.84      (2.8%)   -0.0% (  -6% -    6%)
                  IntNRQ      193.31      (1.4%)      193.40      (1.4%)    0.0% (  -2% -    2%)
                 Respell      224.18      (2.2%)      224.43      (2.7%)    0.1% (  -4% -    5%)
              AndHighMed      334.41      (3.7%)      334.91      (3.0%)    0.2% (  -6% -    7%)
             LowSpanNear      184.23      (3.8%)      184.53      (3.3%)    0.2% (  -6% -    7%)
       HighTermMonthSort       76.33      (7.7%)       76.48      (8.5%)    0.2% ( -14% -   17%)
               OrHighLow      765.34      (3.5%)      767.25      (2.9%)    0.2% (  -5% -    6%)
                HighTerm     2321.04      (4.8%)     2327.02      (3.9%)    0.3% (  -8% -    9%)
              HighPhrase      198.93      (3.2%)      199.50      (2.1%)    0.3% (  -4% -    5%)
              OrHighHigh       42.13      (5.8%)       42.26      (5.3%)    0.3% ( -10% -   12%)
             AndHighHigh       95.83      (3.2%)       96.18      (3.2%)    0.4% (  -5% -    6%)
             MedSpanNear       44.19      (4.1%)       44.37      (4.0%)    0.4% (  -7% -    8%)
                  Fuzzy1      222.24      (7.0%)      223.30      (7.1%)    0.5% ( -12% -   15%)
               OrHighMed       93.32      (4.4%)       93.80      (4.3%)    0.5% (  -7% -    9%)
               LowPhrase       27.51      (3.5%)       27.68      (2.4%)    0.6% (  -5% -    6%)
            OrNotHighLow     1277.07      (3.6%)     1285.51      (3.4%)    0.7% (  -6% -    7%)
                PKLookup      244.71      (4.6%)      246.90      (4.6%)    0.9% (  -7% -   10%)
                  Fuzzy2      120.37      (6.5%)      122.37      (6.3%)    1.7% ( -10% -   15%)
                Wildcard      486.28     (21.6%)      505.80     (23.1%)    4.0% ( -33% -   62%)
BrowseDayOfYearTaxoFacets        3.62      (2.6%)        6.52     (19.4%)   80.4% (  56% -  105%)
    BrowseDateTaxoFacets        3.62      (2.8%)        6.56     (19.6%)   81.3% (  57% -  106%)
   BrowseMonthTaxoFacets        3.89      (3.1%)        7.74     (16.8%)   99.2% (  76% -  122%)

msokolov · 2020-06-04T13:36:53Z

Our internal benchmarks show slowdowns even as we increase the threshold here, although we can nearly recover (-9% QPS) the previous query-time performance disabling below 128 bytes. I'm beginning to wonder what the case is for compressing BDV at all? Do we see large BinaryDocValues fields that are not, or only rarely, decoded at query time? I think if such fields participate in the search in any significant way (ie for all hits), we are going to see these slowdowns. Maybe the bytes threshold is really just a proxy for binary doc values used as stored fields?

jpountz · 2020-06-04T15:03:53Z

Hi @msokolov I'm looking into improving the encoding of lengths, which is the next bottleneck for binary doc values. We are using binary doc values to run regex queries on a high cardinality field. A ngram index helps find good candidates and binary doc values are then used for verification. Field values are typically files, URLs, ... which can have significant redundancy. I'm ok with making compression less aggressive though I think that it would be a pity to disable it entirely and never take advantage of redundancy. You mentioned slowdowns, but this actually depends on the query, e.g. I'm seeing an almost 2x speedup when sorting a MatchAllDocsQuery on wikimedium10m. Let me give a try at making compression less aggressive/slow?

jpountz · 2020-06-05T11:45:22Z

Here are updated results for the latest version of this PR. Baseline is master with binary DV compression reverted, and the contender is this pull request. I also introduced a new MatchAllBDVSort to show the impact on a MatchAllDocsQuery sorted by a binary DV field.

How hard would it be to run this patch on your internal benchmarks @msokolov ?

If this looks good or at least like a good step in the right direction, I'll work on improving tests.

                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
    HighTermTitleBDVSort       27.68      (1.9%)       23.89      (2.7%)  -13.7% ( -17% -   -9%)
            OrNotHighMed     1338.79      (3.2%)     1303.47      (3.6%)   -2.6% (  -9% -    4%)
                 LowTerm     2885.17      (6.7%)     2833.30      (4.4%)   -1.8% ( -12% -   10%)
            OrNotHighLow      730.18      (3.2%)      720.21      (2.7%)   -1.4% (  -7% -    4%)
            OrHighNotLow     1302.47      (4.9%)     1284.90      (3.8%)   -1.3% (  -9% -    7%)
                 Respell      305.20      (2.1%)      301.99      (4.6%)   -1.1% (  -7% -    5%)
                HighTerm     2312.32      (6.0%)     2288.59      (4.8%)   -1.0% ( -11% -   10%)
       HighTermMonthSort      111.61     (13.3%)      110.51     (13.2%)   -1.0% ( -24% -   29%)
                 MedTerm     2239.99      (5.9%)     2218.58      (3.7%)   -1.0% ( -10% -    9%)
               LowPhrase       35.12      (3.1%)       34.84      (2.1%)   -0.8% (  -5% -    4%)
           OrNotHighHigh     1320.65      (4.5%)     1310.72      (3.9%)   -0.8% (  -8% -    7%)
           OrHighNotHigh     1414.46      (4.6%)     1404.17      (3.3%)   -0.7% (  -8% -    7%)
            OrHighNotMed     1480.35      (5.2%)     1473.91      (3.5%)   -0.4% (  -8% -    8%)
              AndHighLow     1451.78      (3.8%)     1447.45      (4.1%)   -0.3% (  -7% -    7%)
              HighPhrase      196.94      (2.7%)      196.65      (1.4%)   -0.1% (  -4% -    4%)
             MedSpanNear       19.30      (3.7%)       19.29      (5.0%)   -0.1% (  -8% -    8%)
                  Fuzzy1      146.69      (8.8%)      146.59      (4.5%)   -0.1% ( -12% -   14%)
             LowSpanNear      161.50      (3.5%)      161.56      (4.4%)    0.0% (  -7% -    8%)
   HighTermDayOfYearSort      100.49      (8.5%)      100.63      (8.9%)    0.1% ( -15% -   19%)
   BrowseMonthSSDVFacets       17.66     (10.1%)       17.68     (10.3%)    0.2% ( -18% -   22%)
             AndHighHigh       71.98      (3.4%)       72.23      (3.9%)    0.4% (  -6% -    7%)
            HighSpanNear       26.91      (6.9%)       27.01      (8.5%)    0.4% ( -14% -   16%)
              AndHighMed      231.71      (3.7%)      232.57      (4.6%)    0.4% (  -7% -    8%)
               MedPhrase      768.29      (4.0%)      771.42      (2.5%)    0.4% (  -5% -    7%)
                PKLookup      246.88      (5.7%)      248.10      (5.7%)    0.5% ( -10% -   12%)
               OrHighLow      874.22      (2.2%)      878.87      (2.6%)    0.5% (  -4% -    5%)
                  IntNRQ      121.89     (43.7%)      122.70     (44.6%)    0.7% ( -60% -  157%)
         LowSloppyPhrase       69.77      (4.1%)       70.41      (4.6%)    0.9% (  -7% -   10%)
    HighIntervalsOrdered       41.67      (5.1%)       42.07      (4.5%)    1.0% (  -8% -   11%)
         MedSloppyPhrase       29.60      (4.4%)       29.92      (3.9%)    1.1% (  -6% -    9%)
                Wildcard      179.38      (9.1%)      181.58      (9.4%)    1.2% ( -15% -   21%)
                  Fuzzy2      102.10      (7.4%)      103.69      (7.6%)    1.6% ( -12% -   17%)
              OrHighHigh       39.88      (4.8%)       40.51      (5.0%)    1.6% (  -7% -   11%)
BrowseDayOfYearSSDVFacets       15.76      (7.0%)       16.02      (8.2%)    1.6% ( -12% -   18%)
               OrHighMed       53.42      (4.6%)       54.33      (4.3%)    1.7% (  -6% -   11%)
                 Prefix3      427.32      (9.0%)      436.76     (10.2%)    2.2% ( -15% -   23%)
        HighSloppyPhrase       53.57      (7.6%)       55.11      (7.2%)    2.9% ( -11% -   19%)
         MatchAllBDVSort        2.41      (3.8%)        5.18     (16.4%)  114.8% (  91% -  140%)
BrowseDayOfYearTaxoFacets        3.63      (2.6%)        8.27     (18.1%)  127.6% ( 104% -  152%)
    BrowseDateTaxoFacets        3.63      (2.7%)        8.31     (18.2%)  128.8% ( 105% -  153%)
   BrowseMonthTaxoFacets        3.90      (2.7%)        9.79     (21.2%)  151.2% ( 124% -  179%)

msokolov · 2020-06-05T13:56:21Z

Thanks, @jpountz we'll definitely try it out. Probably won't be until early next week as I'm off today, but maybe someone else can pick it up sooner

mikemccand · 2020-06-05T18:03:29Z

Thanks @jpountz -- we'll test on Amazon's product search use case.

Do we know why faceting and sorting see such massive speedups? Is it because of more efficient representation of the per-doc length? Anyways, it looks awesome.

gandhi-viral · 2020-06-05T20:51:49Z

I ran our internal benchmarks on the latest changes. Throughput and latency remained almost same compared to the first version of this PR. I am in process of collecting some detailed stats regarding fields that are getting compressed and their query time access pattern to understand more.

jpountz · 2020-06-06T20:10:22Z

@mikemccand With HighTermTitleBDVSort, almost every match requires decoding a new block of 32 values, which is a worst-case scenario. On the other hand MatchAllBDVSort and the faceting tasks need many documents per block, so the cost of decoding the block is amortized across several documents.

mikemccand · 2020-06-10T14:42:21Z

@jpountz I am still confused about the above benchmarks. If baseline was master with BINARY compression reverted, contender is this PR (which adds compression but is more careful about when to compress), how can it be that faceting got so much faster? Shouldn't this PR only add (some, non-zero) cost?

jpountz · 2020-06-10T16:03:51Z

@mikemccand I think that there are two things that get combined:

it's cheaper to do a single IndexInput#readBytes call with 32X bytes, than 32 IndexInput#readBytes calls with X bytes
DirectMonotonicReader#get is not super expensive, but it's not so cheap either when called in very tight loops and we need to call it 64x more frequently when sorting the MatchAllDocsQuery by titleBDV (2x per document vs 1x every 32 documents)

mikemccand

Thank you @jpountz for working to recover the lost performance here!

The change looks great; I left some small comments.

We (Amazon product search) are still somewhat mystified ... @gandhi-viral will give more specifics, but we have tried multiple options (use this PR as is, use this PR but force disable all compression by increasing the threshold, vary threshold from 32 bytes average length, use this PR but only compress one of our BINARY fields, or only compress all but one of our BINARY fields) to understand the non-trivial red-line QPS performance loss we see here.

We use BINARY DV fields for multiple faceting fields, some large, some tiny, and then one biggish (~90 bytes average) one to hold metadata for each document that we load/decode on each hit.

mikemccand · 2020-06-11T16:49:23Z

lucene/CHANGES.txt

@@ -218,6 +218,10 @@ Optimizations
 * LUCENE-9087: Build always trees with full leaves and lower the default value for maxPointsPerLeafNode to 512.
  (Ignacio Vera)

+* LUCENE-9378: Disabled compression on short binary values, as compression


Maybe say Disable doc values compression on short binary values, ...? (To make it clear we are talking about doc values and not maybe stored fields).

mikemccand · 2020-06-11T17:23:02Z

lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java

@@ -64,6 +63,8 @@
 /** writer for {@link Lucene80DocValuesFormat} */
 final class Lucene80DocValuesConsumer extends DocValuesConsumer implements Closeable {

+  private static final int BINARY_LENGTH_COMPRESSION_THRESHOLD = 32;


Maybe add a comment? This threshold is applied to the average length across all (32) docs in each block?

mikemccand · 2020-06-11T17:23:48Z

lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java

@@ -404,32 +406,51 @@ private void flushData() throws IOException {
        // Write offset to this block to temporary offsets file
        totalChunks++;
        long thisBlockStartPointer = data.getFilePointer();
-
-        // Optimisation - check if all lengths are same


I wonder if removing this optimization is maybe hurting our usage ... not certain.

With this change, with all lengths same, we now write 32 zero bytes?

If all docs are the same length, then numByteswould be 0 below and we only encode the average length, so this case is still optimized.

Ahhh OK thanks for the clarification.

mikemccand · 2020-06-11T17:27:12Z

lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java

+        if (uncompressedBlockLength >= BINARY_LENGTH_COMPRESSION_THRESHOLD * numDocsInCurrentBlock) {
+          LZ4.compress(block, 0, uncompressedBlockLength, data, highCompHt);
+        } else {
+          LZ4.compress(block, 0, uncompressedBlockLength, data, noCompHt);


Hmm do we know that our new LZ4.NoCompressionHashTable is actually really close to doing nothing? I don't understand LZ4 well enough to know that e.g. return -1 from int get (int offset) method is really a no-op overall...

I verified this. In that case the compressed stream only consists of a length in a variable length integer followed by the raw bytes. I also checked whether it helped to just call IndexOutput#writeBytes but the benchmark didn't suggest any performance improvement (or regression), so I kept this, which helps have a single format to handle on the read side.

mikemccand · 2020-06-11T17:45:53Z

lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java

+      switch (numBytes) {
+        case Integer.BYTES:
+          startDelta = deltas.getInt(docInBlockId * Integer.BYTES);
+          endDelta = deltas.getInt((docInBlockId + 1) * Integer.BYTES);


Doesn't that mean we need to write an extra value (33 total)? I tried to find where we are doing that (above) but could not.

The trick I'm using is that I'm reading 32 values starting at offset 1. This helps avoid a condition for the first value of the block, but we're still writing/reading only 32 values.

Aha! Sneaky :)

gandhi-viral · 2020-06-11T19:39:23Z

Red-line QPS (throughput) based on our internal benchmarking is still unfortunately suffering (-49%) with the latest PR.

We were able to isolate one particular field, a ~90 byte on average metadata field, which is causing most of our regression. After disabling compression on that particular field, we are at -8% red-line QPS compared to using Lucene 8.4 BDVs. Looking further into the access pattern for that field, we see that (num_access / num_blocks_decompressed = 1.51), so we are decompressing a whole block per every ~1.5 hits.

By temporarily using BINARY_LENGTH_COMPRESSION_THRESHOLD = 10000 to effectively disable the LZ4 compression, we are at -2% red-line QPS, which we could live with. Could we maybe add an option to the Lucene80DocValuesConsumer constructor to disable compression for BinaryDocValues, or to control the 32 byte threshold? We could enable this compression by default, since it’s clearly helpful in many cases from the luceneutil benchmarks, but let expert users create their custom Codec to control it.

Thank you @jpountz for your help.

jpountz

@gandhi-viral That would work for me but I'd like to make sure we're talking about the same thing:

Lucene86DocValuesConsumer gets a ctor argument to configure the threshold.
Lucene86DocValuesFormat keeps 32 as a default value.
You would create your own DocValuesFormat that would reuse Lucene86DocValuesProducer and create a Lucene86DocValuesConsumer with a high threshold for compression of binary values.
You would enable this format by overriding getDocValueFormatForField in Lucene86Codec.
This would mean that your indices would no longer have backward compatibility guarantees of the default codec (N-1) but maybe you don't care since you're re-building your indices from scratch on a regular basis?

jpountz · 2020-06-11T20:19:58Z

lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java

@@ -404,32 +406,51 @@ private void flushData() throws IOException {
        // Write offset to this block to temporary offsets file
        totalChunks++;
        long thisBlockStartPointer = data.getFilePointer();
-
-        // Optimisation - check if all lengths are same


If all docs are the same length, then numByteswould be 0 below and we only encode the average length, so this case is still optimized.

jpountz · 2020-06-11T20:23:36Z

lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java

+        if (uncompressedBlockLength >= BINARY_LENGTH_COMPRESSION_THRESHOLD * numDocsInCurrentBlock) {
+          LZ4.compress(block, 0, uncompressedBlockLength, data, highCompHt);
+        } else {
+          LZ4.compress(block, 0, uncompressedBlockLength, data, noCompHt);


I verified this. In that case the compressed stream only consists of a length in a variable length integer followed by the raw bytes. I also checked whether it helped to just call IndexOutput#writeBytes but the benchmark didn't suggest any performance improvement (or regression), so I kept this, which helps have a single format to handle on the read side.

jpountz · 2020-06-11T20:24:45Z

lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java

+      switch (numBytes) {
+        case Integer.BYTES:
+          startDelta = deltas.getInt(docInBlockId * Integer.BYTES);
+          endDelta = deltas.getInt((docInBlockId + 1) * Integer.BYTES);


The trick I'm using is that I'm reading 32 values starting at offset 1. This helps avoid a condition for the first value of the block, but we're still writing/reading only 32 values.

gandhi-viral · 2020-06-11T21:15:26Z

@gandhi-viral That would work for me but I'd like to make sure we're talking about the same thing:

Lucene86DocValuesConsumer gets a ctor argument to configure the threshold.

Lucene86DocValuesFormat keeps 32 as a default value.

You would create your own DocValuesFormat that would reuse Lucene86DocValuesProducer and create a Lucene86DocValuesConsumer with a high threshold for compression of binary values.

You would enable this format by overriding getDocValueFormatForField in Lucene86Codec.

This would mean that your indices would no longer have backward compatibility guarantees of the default codec (N-1) but maybe you don't care since you're re-building your indices from scratch on a regular basis?

Yes, that's what I had in mind too. Currently, we are doing similar thing after 8.5.1 upgrade to keep using forked BDVs from 8.4.

You are right about backward compatibility guarantees not being an issue for our use-case since we do re-build our indices on each software deployment.

mikemccand · 2020-06-12T13:12:50Z

@gandhi-viral That would work for me but I'd like to make sure we're talking about the same thing:

Lucene86DocValuesConsumer gets a ctor argument to configure the threshold.

Lucene86DocValuesFormat keeps 32 as a default value.

You would create your own DocValuesFormat that would reuse Lucene86DocValuesProducer and create a Lucene86DocValuesConsumer with a high threshold for compression of binary values.

You would enable this format by overriding getDocValueFormatForField in Lucene86Codec.

This would mean that your indices would no longer have backward compatibility guarantees of the default codec (N-1) but maybe you don't care since you're re-building your indices from scratch on a regular basis?

Yes, that's what I had in mind too. Currently, we are doing similar thing after 8.5.1 upgrade to keep using forked BDVs from 8.4.

You are right about backward compatibility guarantees not being an issue for our use-case since we do re-build our indices on each software deployment.

Hmm, could we add the parameter also to Lucene86DocValuesFormat, which would forward that to Lucene86DocValuesConsumer? This would allow users to keep back-compat (same SPI-named DocValuesFormat).

It is true that for us (Amazon product search) in particular it would be OK to forego backwards compatibility, but I think we shouldn't push that on others who might want to customize this / make their own Codec?

At read time (Lucene86DocValuesProducer) the constant isn't used, right? It was built into the index at segment-write time.

mikemccand · 2020-08-05T14:29:50Z

Thinking more above the above proposal, specifically this line, which is OK for our usage but I think not so great for other users who would like to reduce or turn off the BINARY doc values compression:

This would mean that your indices would no longer have backward compatibility guarantees of the default codec (N-1) but maybe you don't care since you're re-building your indices from scratch on a regular basis?

We lose backwards compatibility because we would have to create our own named Codec, using our own custom DocValuesFormat?

But, couldn't we instead just subclass Lucene's default codec, override {{getDocValuesFormatPerField}} to subclass {{Lucene80DocValuesFormat}} (oh, I see, yeah we cannot do that -- this class is final, which makes sense). I was thinking since this (whether to compress each block) is purely a write time decision, it could still be done as Lucene80 doc values format SPI.

So yeah we would lose backwards compatibility, but it's a trivially small piece of code to carry that custom Codec forward, so it is not that big a deal.

But then I wonder why not just add a boolean compress option to Lucene80DocValuesFormat? This is similar to the compression Mode we pass to stored fields and term vectors format at write time, and it'd allow users who would like to disable BINARY doc values compression to keep backwards compatibility.

jpountz · 2020-08-06T13:29:40Z

But, couldn't we instead just subclass Lucene's default codec, override {{getDocValuesFormatPerField}} to subclass {{Lucene80DocValuesFormat}} (oh, I see, yeah we cannot do that -- this class is final, which makes sense). I was thinking since this (whether to compress each block) is purely a write time decision, it could still be done as Lucene80 doc values format SPI.

To me we only guarantee backward compatibility for users of the default codec. With the approach you mentioned, indices would be backward compatible, but I'm seeing this as accidental rather than something we guarantee.

But then I wonder why not just add a boolean compress option to Lucene80DocValuesFormat? This is similar to the compression Mode we pass to stored fields and term vectors format at write time, and it'd allow users who would like to disable BINARY doc values compression to keep backwards compatibility.

I wanted to look into whether we could avoid this as it would boil down to maintaining two doc-value formats, but this might be the best way forward as it looks like the heuristics we tried out above don't work well to disable compression for use-cases when it hurts more than it helps.

mikemccand · 2020-11-09T15:03:42Z

But then I wonder why not just add a boolean compress option to Lucene80DocValuesFormat? This is similar to the compression Mode we pass to stored fields and term vectors format at write time, and it'd allow users who would like to disable BINARY doc values compression to keep backwards compatibility.

I wanted to look into whether we could avoid this as it would boil down to maintaining two doc-value formats, but this might be the best way forward as it looks like the heuristics we tried out above don't work well to disable compression for use-cases when it hurts more than it helps.

+1.

I'm afraid whether compression is a good idea for BDV or not is a very application specific tradeoff.

jpountz requested review from markharwood and msokolov May 29, 2020 09:59

markharwood approved these changes May 29, 2020

View reviewed changes

msokolov approved these changes May 29, 2020

View reviewed changes

jpountz added 2 commits June 5, 2020 11:11

More efficient encoding of lengths.

cc59cd9

Speed up decoding of lengths.

a03d242

mikemccand reviewed Jun 11, 2020

View reviewed changes

jpountz commented Jun 11, 2020

View reviewed changes

jpountz closed this Jun 16, 2021

asfimport mentioned this pull request Oct 26, 2021

Configurable compression for BinaryDocValues [LUCENE-9378] apache/lucene#10418

Closed

asfimport mentioned this pull request Aug 24, 2022

Remove compression option on doc values [LUCENE-9843] apache/lucene#10882

Closed

LUCENE-9378: Disable compression on binary values whose length is less than 32. #1543

LUCENE-9378: Disable compression on binary values whose length is less than 32. #1543

Conversation

jpountz commented May 29, 2020

jpountz commented May 29, 2020

markharwood left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msokolov commented May 29, 2020

msokolov commented May 29, 2020

jpountz commented Jun 2, 2020

msokolov commented Jun 4, 2020

jpountz commented Jun 4, 2020

jpountz commented Jun 5, 2020

msokolov commented Jun 5, 2020

mikemccand commented Jun 5, 2020

gandhi-viral commented Jun 5, 2020

jpountz commented Jun 6, 2020

mikemccand commented Jun 10, 2020

jpountz commented Jun 10, 2020

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gandhi-viral commented Jun 11, 2020

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gandhi-viral commented Jun 11, 2020

mikemccand commented Jun 12, 2020

mikemccand commented Aug 5, 2020

jpountz commented Aug 6, 2020 • edited Loading

mikemccand commented Nov 9, 2020

jpountz commented Aug 6, 2020 •

edited

Loading