Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use group-varint encoding for the tail of postings #12782

Merged
merged 13 commits into from
Nov 20, 2023

Conversation

easyice
Copy link
Contributor

@easyice easyice commented Nov 8, 2023

As discussed in issue #12717

the read performance of group-varint is 14-30% faster than vint, the size 16-248 is the number of ints will be read.
feel free to close the PR if the performance improves is not enough :)

Benchmark                (size)   Mode  Cnt   Score   Error   Units
GroupVInt.readGroupVInt      16  thrpt    5  30.743 ± 5.054  ops/us
GroupVInt.readGroupVInt      32  thrpt    5  14.495 ± 0.606  ops/us
GroupVInt.readGroupVInt      64  thrpt    5   6.930 ± 4.679  ops/us
GroupVInt.readGroupVInt     128  thrpt    5   3.593 ± 0.687  ops/us
GroupVInt.readGroupVInt     248  thrpt    5   2.356 ± 0.073  ops/us
GroupVInt.readVInt           16  thrpt    5  21.437 ± 1.102  ops/us
GroupVInt.readVInt           32  thrpt    5  10.482 ± 3.620  ops/us
GroupVInt.readVInt           64  thrpt    5   5.966 ± 0.707  ops/us
GroupVInt.readVInt          128  thrpt    5   2.750 ± 1.668  ops/us
GroupVInt.readVInt          248  thrpt    5   1.606 ± 0.042  ops/us

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into it! I left some questions.

readVInts(docs, 0, limit);
return;
}
int groupValues = limit / 4 * 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do this with a single instruction I believe?

Suggested change
int groupValues = limit / 4 * 4;
int groupValues = limit & 0xFFFFFFFC;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice idea :)

for (int i = 0; i < groupValues; i++) {
cur = i % 4;
if (cur == 0) {
groupLengths = flagToLengths[Byte.toUnsignedInt(bytes[offset++])];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the flagToLengths table approach is best on Java because of bound checks vs. recomputing the length from the flag using shits/masks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it also looks scary big, like 4KB? it could have negative impact on cpu cache that may not show in a jmh.

if (cur == 0) {
groupLengths = flagToLengths[Byte.toUnsignedInt(bytes[offset++])];
}
docs[i] = (int) BitUtil.VH_LE_INT.get(bytes, offset) & MASKS[groupLengths[cur]];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And likewise for masks, I wonder if the table lookup is actually better than recomputing the mask.

* Encode integers using group-varint. It uses VInt to encode tail values that are not enough for a
* group
*/
class GroupVintWriter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use an upper I for consistency with DataInput#readVInt?

Suggested change
class GroupVintWriter {
class GroupVIntWriter {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, sorry for my mistake ;)

@easyice
Copy link
Contributor Author

easyice commented Nov 10, 2023

@jpountz @rmuir Thanks for your suggestions, it's very helpful for me! I will run the benchmark for recomputing length vs table lookup.

@easyice
Copy link
Contributor Author

easyice commented Nov 11, 2023

@jpountz You are right, recomputing the length is faster than table lookup, here is the benchmark when reading the ints, each value will takes 4 bytes:

GroupVInt.readGroupVInt                  16  thrpt    5  11.822 ± 0.187  ops/us
GroupVInt.readGroupVInt                  32  thrpt    5   7.558 ± 0.209  ops/us
GroupVInt.readGroupVInt                  64  thrpt    5   3.556 ± 0.344  ops/us
GroupVInt.readGroupVInt                 128  thrpt    5   1.786 ± 0.145  ops/us
GroupVInt.readGroupVInt                 248  thrpt    5   0.972 ± 0.025  ops/us
GroupVInt.readGroupVIntWithoutTable      16  thrpt    5  19.787 ± 2.848  ops/us
GroupVInt.readGroupVIntWithoutTable      32  thrpt    5  10.162 ± 1.491  ops/us
GroupVInt.readGroupVIntWithoutTable      64  thrpt    5   5.141 ± 0.121  ops/us
GroupVInt.readGroupVIntWithoutTable     128  thrpt    5   2.247 ± 0.017  ops/us
GroupVInt.readGroupVIntWithoutTable     248  thrpt    5   1.183 ± 0.014  ops/us
GroupVInt.readVInt                       16  thrpt    5  12.679 ± 0.405  ops/us
GroupVInt.readVInt                       32  thrpt    5   6.519 ± 0.247  ops/us
GroupVInt.readVInt                       64  thrpt    5   3.218 ± 0.804  ops/us
GroupVInt.readVInt                      128  thrpt    5   1.762 ± 0.096  ops/us
GroupVInt.readVInt                      248  thrpt    5   0.887 ± 0.035  ops/us

But i found that the group-varint encoding is faster than vint only when the value takes up 4 bytes, here is the benchmark for reading 64 int values. numBytesOfInt is the bytes for an int value that will be taken :

Benchmark                            (numBytesOfInt)  (size)   Mode  Cnt  Score   Error   Units
GroupVInt.readGroupVIntWithoutTable                1      64  thrpt    5  5.099 ± 0.147  ops/us
GroupVInt.readGroupVIntWithoutTable                2      64  thrpt    5  4.982 ± 0.632  ops/us
GroupVInt.readGroupVIntWithoutTable                3      64  thrpt    5  5.194 ± 0.163  ops/us
GroupVInt.readGroupVIntWithoutTable                4      64  thrpt    5  4.923 ± 0.092  ops/us
GroupVInt.readVInt                                 1      64  thrpt    5  8.433 ± 0.287  ops/us
GroupVInt.readVInt                                 2      64  thrpt    5  6.309 ± 0.155  ops/us
GroupVInt.readVInt                                 3      64  thrpt    5  5.196 ± 0.213  ops/us
GroupVInt.readVInt                                 4      64  thrpt    5  3.300 ± 0.302  ops/us

The decoding for group-varint takes up constant time. vint decoding is faster when the values take up fewer bytes. so the actual payoff depends on factors like maxDoc.

@jpountz
Copy link
Contributor

jpountz commented Nov 13, 2023

Could you check in your benchmark under lucene/benchmark-jmh so that we could play with it?

@jpountz
Copy link
Contributor

jpountz commented Nov 13, 2023

At least in theory, group varint could be made faster than vints even with single-byte integers, because a single check on flag == 0 would tell us that all 4 integers have a single byte. Now, I don't know if we should do it, this doesn't sound like the most common case for doc IDs.

Your change keep mixing doc IDs and frequencies. I wonder if we should write them in separate varint blocks?

@easyice
Copy link
Contributor Author

easyice commented Nov 14, 2023

Thank you @jpountz , I pushed the benchmark code, and added a new comparison between ByteArrayDataInput vs ByteBufferIndexInput . For readVInt, the ByteBufferIndexInput is a bit slower than ByteArrayDataInput. with some minor optimize, group-viarint is faster than vInt when bytes take greater than 1 bytes.

Now, I don't know if we should do it

+1

I wonder if we should write them in separate varint blocks?

It's a good idea, It can use less memory when decoding.

Code
Benchmark                                   (numBytesPerInt)  (size)   Mode  Cnt   Score   Error   Units
GroupVIntBenchmark.byteArrayReadGroupVInt                  1      64  thrpt    5   8.113 ± 1.135  ops/us
GroupVIntBenchmark.byteArrayReadGroupVInt                  2      64  thrpt    5   6.343 ± 0.058  ops/us
GroupVIntBenchmark.byteArrayReadGroupVInt                  3      64  thrpt    5   6.339 ± 0.162  ops/us
GroupVIntBenchmark.byteArrayReadGroupVInt                  4      64  thrpt    5   6.268 ± 0.743  ops/us
GroupVIntBenchmark.byteArrayReadVInt                       1      64  thrpt    5  20.325 ± 0.896  ops/us
GroupVIntBenchmark.byteArrayReadVInt                       2      64  thrpt    5   7.303 ± 0.350  ops/us
GroupVIntBenchmark.byteArrayReadVInt                       3      64  thrpt    5   4.333 ± 0.261  ops/us
GroupVIntBenchmark.byteArrayReadVInt                       4      64  thrpt    5   3.236 ± 0.030  ops/us
GroupVIntBenchmark.byteBufferReadGroupVInt                 1      64  thrpt    5   8.063 ± 0.890  ops/us
GroupVIntBenchmark.byteBufferReadGroupVInt                 2      64  thrpt    5   6.518 ± 0.203  ops/us
GroupVIntBenchmark.byteBufferReadGroupVInt                 3      64  thrpt    5   6.367 ± 0.362  ops/us
GroupVIntBenchmark.byteBufferReadGroupVInt                 4      64  thrpt    5   6.526 ± 0.245  ops/us
GroupVIntBenchmark.byteBufferReadVInt                      1      64  thrpt    5  19.794 ± 1.177  ops/us
GroupVIntBenchmark.byteBufferReadVInt                      2      64  thrpt    5   6.081 ± 0.144  ops/us
GroupVIntBenchmark.byteBufferReadVInt                      3      64  thrpt    5   4.139 ± 0.102  ops/us
GroupVIntBenchmark.byteBufferReadVInt                      4      64  thrpt    5   3.112 ± 0.049  ops/us

Bulk decoding rather than one by one.
Write docs and freqs separetely instead of interleaved.
Write freqs as regular vints, as the benchmark suggest single-byte vints are
fast and freqs are often small.
Remove `len`/`numGroup` to save space.
Read directly from the directory instead of using an intermediate buffer. This
helps save memory copies.
@jpountz
Copy link
Contributor

jpountz commented Nov 17, 2023

Thanks @easyice. I took some time to look into the benchmark and improve a few things, hopefully you don't mind. Here is the output of the benchmark on my machine now:

Benchmark                                   (numBytesPerInt)  (size)   Mode  Cnt   Score   Error   Units
GroupVIntBenchmark.byteArrayReadGroupVInt                  1      64  thrpt    5  24.483 ± 0.345  ops/us
GroupVIntBenchmark.byteArrayReadGroupVInt                  2      64  thrpt    5  23.346 ± 0.288  ops/us
GroupVIntBenchmark.byteArrayReadGroupVInt                  3      64  thrpt    5  16.318 ± 0.062  ops/us
GroupVIntBenchmark.byteArrayReadGroupVInt                  4      64  thrpt    5  24.748 ± 0.993  ops/us
GroupVIntBenchmark.byteArrayReadVInt                       1      64  thrpt    5  17.767 ± 0.081  ops/us
GroupVIntBenchmark.byteArrayReadVInt                       2      64  thrpt    5   7.256 ± 0.013  ops/us
GroupVIntBenchmark.byteArrayReadVInt                       3      64  thrpt    5   5.546 ± 0.449  ops/us
GroupVIntBenchmark.byteArrayReadVInt                       4      64  thrpt    5   4.475 ± 0.021  ops/us
GroupVIntBenchmark.byteBufferReadGroupVInt                 1      64  thrpt    5  21.812 ± 0.485  ops/us
GroupVIntBenchmark.byteBufferReadGroupVInt                 2      64  thrpt    5  20.623 ± 1.454  ops/us
GroupVIntBenchmark.byteBufferReadGroupVInt                 3      64  thrpt    5  13.601 ± 0.299  ops/us
GroupVIntBenchmark.byteBufferReadGroupVInt                 4      64  thrpt    5  22.649 ± 0.662  ops/us
GroupVIntBenchmark.byteBufferReadVInt                      1      64  thrpt    5  22.147 ± 0.083  ops/us
GroupVIntBenchmark.byteBufferReadVInt                      2      64  thrpt    5   8.072 ± 0.116  ops/us
GroupVIntBenchmark.byteBufferReadVInt                      3      64  thrpt    5   4.554 ± 0.394  ops/us
GroupVIntBenchmark.byteBufferReadVInt                      4      64  thrpt    5   4.145 ± 0.674  ops/us

The benchmark used to read directly from the in-memory byte[] by calling rewind() , I changed that to force it to read directly from the directoly to make the comparison a bit more fair.

buffer[bufferOffset++] = v;
}

public void reset(int numValues) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the numValues parameter since it looks like we can't actually rely on it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. It's more simpler now.

byteBufferVIntIn.seek(0);
for (int i = 0; i < size; i++) {
values[i] = byteBufferVIntIn.readVInt();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pass the values array to a Blackhole object to make sure that the JVM doesn't optimize away some of the decoding logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your guidance! ...

}

/** only readValues or nextInt can be called after reset */
public void readValues(long[] docs, int limit) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need both a reset() and readValues(), maybe readValues() could take a DataInput directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

return;
}
encodeValues(buffer, bufferOffset);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about making the API look more like the reader API, ie. replace reset/add/flush with a single writeValues(DataOutput, long[] values, int limit) API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

// encode each group
while ((limit - off) >= 4) {
// the maximum size of one group is 4 ints + 1 byte flag.
bytes = ArrayUtil.grow(bytes, byteOffset + 17);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we write the data to out here instead of growing the buffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. but a 17-byte array for single group is still required, because the DataOutput cannot write an integer to the specified number of bytes. Is it okay?

@easyice
Copy link
Contributor Author

easyice commented Nov 18, 2023

Wow, what an incredible speedup! I would not have expected bulk decoding with read directly is so much faster than read from array, Thank you for your time @jpountz , and I'm sorry i didn't try this approach.

Copy link
Contributor Author

@easyice easyice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz Thank a lot for the great suggestions. :)

buffer[bufferOffset++] = v;
}

public void reset(int numValues) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. It's more simpler now.

}

/** only readValues or nextInt can be called after reset */
public void readValues(long[] docs, int limit) throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

return;
}
encodeValues(buffer, bufferOffset);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

// encode each group
while ((limit - off) >= 4) {
// the maximum size of one group is 4 ints + 1 byte flag.
bytes = ArrayUtil.grow(bytes, byteOffset + 17);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. but a 17-byte array for single group is still required, because the DataOutput cannot write an integer to the specified number of bytes. Is it okay?

byteBufferVIntIn.seek(0);
for (int i = 0; i < size; i++) {
values[i] = byteBufferVIntIn.readVInt();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your guidance! ...

@easyice easyice marked this pull request as ready for review November 18, 2023 17:43
@easyice
Copy link
Contributor Author

easyice commented Nov 20, 2023

I ran some rounds of wikimediumall(sometimes there is noise), It looks a bit speed up :

.doc files were 0.4% larger overall (5.45GB to 5.47GB)

Round 1

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      TermDTSort       67.15      (6.4%)       67.35      (5.2%)    0.3% ( -10% -   12%) 0.872
             MedIntervalsOrdered        7.03      (4.2%)        7.07      (4.4%)    0.5% (  -7% -    9%) 0.731
             LowIntervalsOrdered        4.38      (4.3%)        4.40      (4.4%)    0.5% (  -7% -    9%) 0.690
                    OrNotHighMed      178.75      (3.7%)      179.82      (5.8%)    0.6% (  -8% -   10%) 0.697
                          IntNRQ        7.32      (8.5%)        7.36      (8.6%)    0.6% ( -15% -   19%) 0.821
                HighSloppyPhrase        8.50      (3.5%)        8.56      (4.1%)    0.7% (  -6% -    8%) 0.547
                        HighTerm      268.55      (3.7%)      270.62      (5.1%)    0.8% (  -7% -    9%) 0.583
                   OrHighNotHigh      250.85      (4.3%)      253.02      (4.4%)    0.9% (  -7% -    9%) 0.530
               HighTermMonthSort      894.77      (5.0%)      904.83      (6.6%)    1.1% ( -10% -   13%) 0.545
                    OrHighNotMed      188.96      (4.5%)      191.13      (5.2%)    1.1% (  -8% -   11%) 0.455
                      OrHighHigh       23.02      (6.2%)       23.31      (6.8%)    1.3% ( -11% -   15%) 0.543
                     AndHighHigh       18.21      (3.1%)       18.47      (5.2%)    1.4% (  -6% -   10%) 0.302
               HighTermTitleSort       64.91      (4.1%)       65.87      (5.5%)    1.5% (  -7% -   11%) 0.337
                    OrHighNotLow      164.86      (3.3%)      167.51      (5.5%)    1.6% (  -7% -   10%) 0.266
           HighTermDayOfYearSort      141.25      (6.0%)      143.54      (3.5%)    1.6% (  -7% -   11%) 0.298
                    HighSpanNear        2.93      (4.9%)        2.97      (5.2%)    1.6% (  -8% -   12%) 0.309
            HighIntervalsOrdered        1.22      (5.2%)        1.24      (4.7%)    1.7% (  -7% -   12%) 0.294
                      HighPhrase        4.27      (5.0%)        4.34      (6.0%)    1.7% (  -8% -   13%) 0.320
                 MedSloppyPhrase       19.72      (3.3%)       20.08      (4.8%)    1.8% (  -6% -   10%) 0.163
                        Wildcard       44.12      (2.2%)       44.94      (2.4%)    1.9% (  -2% -    6%) 0.011
                     MedSpanNear        5.58      (4.5%)        5.68      (5.3%)    1.9% (  -7% -   12%) 0.234
                   OrNotHighHigh      122.46      (2.9%)      124.76      (5.2%)    1.9% (  -6% -   10%) 0.158
                     LowSpanNear        3.35      (3.8%)        3.42      (4.6%)    1.9% (  -6% -   10%) 0.147
                         MedTerm      216.60      (4.3%)      220.85      (6.6%)    2.0% (  -8% -   13%) 0.264
                 LowSloppyPhrase       14.57      (2.9%)       14.86      (4.0%)    2.0% (  -4% -    9%) 0.072
                       OrHighMed       34.95      (5.3%)       35.66      (6.2%)    2.0% (  -8% -   14%) 0.264
                       OrHighLow      269.80      (4.0%)      275.59      (6.2%)    2.1% (  -7% -   12%) 0.196
                       LowPhrase       80.24      (4.3%)       82.45      (6.7%)    2.8% (  -7% -   14%) 0.122
                       MedPhrase       40.89      (5.0%)       42.10      (6.9%)    3.0% (  -8% -   15%) 0.122
                         LowTerm      233.14      (6.5%)      240.96      (6.7%)    3.4% (  -9% -   17%) 0.108
                      AndHighLow      279.72      (3.6%)      290.01      (5.9%)    3.7% (  -5% -   13%) 0.017
                          Fuzzy2       34.50      (2.5%)       35.87      (4.3%)    4.0% (  -2% -   11%) 0.000
                      AndHighMed       50.26      (4.2%)       52.31      (5.3%)    4.1% (  -5% -   14%) 0.007
                    OrNotHighLow      209.13      (4.1%)      219.47      (5.7%)    4.9% (  -4% -   15%) 0.002
                        PKLookup       84.26      (3.3%)       88.56      (4.7%)    5.1% (  -2% -   13%) 0.000
                          Fuzzy1       38.26      (2.6%)       40.23      (3.2%)    5.1% (   0% -   11%) 0.000
                         Respell       21.14      (2.3%)       22.52      (2.8%)    6.6% (   1% -   11%) 0.000
                         Prefix3       96.26      (4.8%)      104.41      (3.6%)    8.5% (   0% -   17%) 0.000

Round 2

                           TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          IntNRQ       13.42      (5.4%)       12.79      (2.5%)   -4.7% ( -11% -    3%) 0.000
            HighIntervalsOrdered        4.01      (4.3%)        3.96      (4.5%)   -1.2% (  -9% -    8%) 0.409
             LowIntervalsOrdered        8.26      (3.7%)        8.25      (4.0%)   -0.2% (  -7% -    7%) 0.892
                HighSloppyPhrase        1.22      (2.9%)        1.22      (3.0%)   -0.2% (  -5% -    5%) 0.862
                    HighSpanNear        2.90      (3.0%)        2.91      (4.5%)    0.4% (  -6% -    8%) 0.755
                     LowSpanNear       14.61      (3.9%)       14.71      (5.1%)    0.7% (  -7% -   10%) 0.639
                      TermDTSort       76.54      (4.7%)       77.12      (5.4%)    0.8% (  -8% -   11%) 0.632
                     MedSpanNear        4.14      (3.2%)        4.18      (4.5%)    0.9% (  -6% -    8%) 0.476
                      OrHighHigh        9.30      (8.7%)        9.40      (6.9%)    1.0% ( -13% -   18%) 0.676
                 LowSloppyPhrase       20.69      (6.1%)       20.91      (5.3%)    1.1% (  -9% -   13%) 0.560
                 MedSloppyPhrase        5.21      (2.6%)        5.27      (2.6%)    1.1% (  -4% -    6%) 0.190
             MedIntervalsOrdered       18.39      (3.0%)       18.71      (3.7%)    1.7% (  -4% -    8%) 0.105
               HighTermMonthSort      865.08      (5.2%)      880.31      (4.7%)    1.8% (  -7% -   12%) 0.263
                        Wildcard       67.89      (3.6%)       69.42      (3.8%)    2.3% (  -4% -    9%) 0.053
                    OrHighNotLow      121.81      (6.0%)      124.59      (3.3%)    2.3% (  -6% -   12%) 0.132
           HighTermDayOfYearSort      144.72      (5.2%)      148.06      (3.9%)    2.3% (  -6% -   12%) 0.115
               HighTermTitleSort       63.45      (4.3%)       65.04      (4.4%)    2.5% (  -5% -   11%) 0.069
                       LowPhrase       24.95      (4.4%)       25.70      (5.5%)    3.0% (  -6% -   13%) 0.056
                       MedPhrase       79.79      (6.4%)       82.22      (5.8%)    3.0% (  -8% -   16%) 0.113
                       OrHighMed       31.22      (7.5%)       32.20      (7.3%)    3.1% ( -10% -   19%) 0.179
                   OrHighNotHigh       81.94      (5.7%)       84.83      (3.8%)    3.5% (  -5% -   13%) 0.022
                     AndHighHigh       15.36      (7.6%)       15.94      (5.4%)    3.7% (  -8% -   18%) 0.073
                      HighPhrase       17.98      (4.2%)       18.66      (3.5%)    3.7% (  -3% -   12%) 0.002
                   OrNotHighHigh      157.83      (5.0%)      163.97      (3.6%)    3.9% (  -4% -   13%) 0.005
                    OrNotHighMed      114.61      (6.2%)      119.24      (5.2%)    4.0% (  -6% -   16%) 0.025
                    OrHighNotMed      132.35      (5.3%)      137.75      (3.7%)    4.1% (  -4% -   13%) 0.005
                        HighTerm      197.10      (6.8%)      205.28      (5.2%)    4.2% (  -7% -   17%) 0.030
                        PKLookup       82.78      (3.0%)       86.77      (6.1%)    4.8% (  -4% -   14%) 0.002
                         Prefix3      193.36      (5.7%)      202.80      (4.7%)    4.9% (  -5% -   16%) 0.003
                         MedTerm      255.38      (5.8%)      268.76      (6.2%)    5.2% (  -6% -   18%) 0.006
                      AndHighMed       47.21      (7.3%)       49.78     (10.1%)    5.5% ( -11% -   24%) 0.050
                         Respell       29.43      (2.5%)       31.07      (3.9%)    5.6% (   0% -   12%) 0.000
                          Fuzzy2       35.54      (2.8%)       37.52      (5.2%)    5.6% (  -2% -   14%) 0.000
                         LowTerm      265.21      (7.4%)      280.08      (8.5%)    5.6% (  -9% -   23%) 0.026
                    OrNotHighLow      186.93      (6.7%)      197.72      (4.6%)    5.8% (  -5% -   18%) 0.001
                      AndHighLow      170.86      (5.8%)      182.37      (3.9%)    6.7% (  -2% -   17%) 0.000
                       OrHighLow      186.98      (5.1%)      200.07      (4.8%)    7.0% (  -2% -   17%) 0.000
                          Fuzzy1       39.35      (2.5%)       42.58      (5.1%)    8.2% (   0% -   16%) 0.000

Round 3

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
             MedIntervalsOrdered       18.38      (4.2%)       18.37      (4.6%)   -0.0% (  -8% -    9%) 0.975
                    HighSpanNear        4.73      (2.7%)        4.77      (2.5%)    0.9% (  -4% -    6%) 0.261
                     MedSpanNear        6.79      (2.7%)        6.86      (3.3%)    1.0% (  -4% -    7%) 0.306
                      TermDTSort       78.98      (7.1%)       79.85      (6.5%)    1.1% ( -11% -   15%) 0.609
            HighIntervalsOrdered        6.24      (4.5%)        6.32      (5.5%)    1.4% (  -8% -   11%) 0.391
                      OrHighHigh       17.95      (4.7%)       18.19      (6.7%)    1.4% (  -9% -   13%) 0.454
             LowIntervalsOrdered        8.38      (3.0%)        8.50      (4.1%)    1.5% (  -5% -    8%) 0.189
                HighSloppyPhrase        3.26      (4.7%)        3.31      (4.8%)    1.6% (  -7% -   11%) 0.285
           HighTermDayOfYearSort      131.00      (5.2%)      133.44      (3.5%)    1.9% (  -6% -   11%) 0.185
               HighTermMonthSort      903.94      (5.6%)      921.63      (5.9%)    2.0% (  -9% -   14%) 0.284
                       OrHighMed       26.33      (4.4%)       26.85      (6.1%)    2.0% (  -8% -   13%) 0.233
                   OrNotHighHigh       87.49      (7.2%)       89.27      (5.2%)    2.0% (  -9% -   15%) 0.304
                     AndHighHigh       11.17      (3.6%)       11.41      (5.6%)    2.2% (  -6% -   11%) 0.146
                 MedSloppyPhrase        7.89      (3.2%)        8.07      (3.8%)    2.4% (  -4% -    9%) 0.033
                        HighTerm      199.48      (7.5%)      205.16      (7.7%)    2.8% ( -11% -   19%) 0.238
                      AndHighMed       30.06      (5.6%)       30.92      (6.9%)    2.9% (  -9% -   16%) 0.150
                    OrNotHighLow      218.73      (6.3%)      225.25      (7.3%)    3.0% (  -9% -   17%) 0.166
                       OrHighLow      172.98      (5.5%)      178.21      (6.5%)    3.0% (  -8% -   15%) 0.114
                       MedPhrase       18.46      (6.0%)       19.02      (5.1%)    3.1% (  -7% -   15%) 0.080
                         MedTerm      237.91      (8.6%)      245.24      (7.2%)    3.1% ( -11% -   20%) 0.219
                   OrHighNotHigh      120.17      (6.9%)      124.04      (4.2%)    3.2% (  -7% -   15%) 0.075
                      HighPhrase       53.66      (5.0%)       55.43      (4.7%)    3.3% (  -6% -   13%) 0.031
               HighTermTitleSort       63.82      (4.8%)       66.00      (5.8%)    3.4% (  -6% -   14%) 0.043
                 LowSloppyPhrase       20.55      (3.2%)       21.26      (3.7%)    3.4% (  -3% -   10%) 0.002
                     LowSpanNear       61.77      (5.5%)       63.92      (5.2%)    3.5% (  -6% -   15%) 0.041
                          IntNRQ       14.57      (8.5%)       15.09     (11.2%)    3.6% ( -14% -   25%) 0.257
                         Respell       23.67      (2.7%)       24.57      (4.2%)    3.8% (  -3% -   11%) 0.001
                    OrHighNotMed      152.63      (6.5%)      158.68      (5.5%)    4.0% (  -7% -   17%) 0.037
                      AndHighLow      283.84      (4.6%)      295.74      (5.9%)    4.2% (  -6% -   15%) 0.012
                          Fuzzy2       32.91      (3.9%)       34.37      (5.7%)    4.4% (  -4% -   14%) 0.004
                        PKLookup       82.30      (4.9%)       86.30      (6.2%)    4.9% (  -5% -   16%) 0.006
                       LowPhrase       28.29      (6.0%)       29.69      (5.8%)    5.0% (  -6% -   17%) 0.008
                    OrHighNotLow      194.24      (7.5%)      204.23      (6.3%)    5.1% (  -8% -   20%) 0.019
                    OrNotHighMed      127.26      (7.7%)      133.91      (7.5%)    5.2% (  -9% -   22%) 0.030
                         LowTerm      207.34      (8.6%)      218.60     (11.3%)    5.4% ( -13% -   27%) 0.088
                          Fuzzy1       43.40      (2.4%)       46.00      (4.5%)    6.0% (   0% -   13%) 0.000
                        Wildcard       37.62      (3.0%)       39.87      (3.7%)    6.0% (   0% -   13%) 0.000
                         Prefix3       34.35      (6.3%)       37.91      (6.5%)   10.4% (  -2% -   24%) 0.000

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for running these macro benchmarks, it's good to see that this change is translating into noticeable speedups. I see that fuzzy, wildcard and prefix queries get speedups with very low p values, which is what I'd have expected given that these queries need to visit many low-doc-frequency terms. Overall, the bigger size on disk looks worth the query speedup to me. We made a similar trade-off when switching from PFOR back to FOR for doc blocks.

Thanks for adding a unit test too. The change looks good to me, I just left a minor suggestion. Can you add a CHANGES entry?

public void testEncodeDecode() throws IOException {
long[] values = new long[ForUtil.BLOCK_SIZE];
long[] restored = new long[ForUtil.BLOCK_SIZE];
final int iterations = RandomNumbers.randomIntBetween(random(), 50, 1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an atLeast helper for this kind of thing, it automatically gives more iterations to nightly runs.

Suggested change
final int iterations = RandomNumbers.randomIntBetween(random(), 50, 1000);
final int iterations = atLeast(100);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, It's very nice, Thank you @jpountz

bytes[byteOffset++] = (byte) (v & 0xFF);
v >>>= 8;
}
bytes[byteOffset++] = (byte) v;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can simplify the above loop by using a do...while loop instead of a regular while loop, something like

do {
     bytes[byteOffset++] = (byte) (v & 0xFF);
     v >>>= 8;
} while (v != 0);

Then we don't need the extra write to bytes after the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more simpler, Thank you :)

@jpountz jpountz merged commit d0f63ec into apache:main Nov 20, 2023
4 checks passed
@jpountz jpountz added this to the 9.9.0 milestone Nov 20, 2023
jpountz added a commit that referenced this pull request Nov 20, 2023
Co-authored-by: Adrien Grand <jpountz@gmail.com>
slow-J pushed a commit to slow-J/lucene that referenced this pull request Nov 20, 2023
Co-authored-by: Adrien Grand <jpountz@gmail.com>
@jpountz
Copy link
Contributor

jpountz commented Nov 22, 2023

There seems to be a speedup on prefix queries in nightly benchmarks, I'll add an annotation.

@jpountz
Copy link
Contributor

jpountz commented Nov 22, 2023

Also the size increase is hardly noticeable.

@jpountz
Copy link
Contributor

jpountz commented Nov 22, 2023

For reference, I computed the most frequent flag values on wikibigall, which are the values that might be worth optimizing for:

  • 0x55 (4 2-bytes ints): 29.6%
  • 0xaa (5 3-bytes ints): 6.5%
  • every other combination is below 3.5%

Now broken down by number of bytes per int:

  • 1 byte: 13.8%
  • 2 bytes: 60.1%
  • 3 bytes: 26.1%
  • 4 bytes: 0%

@easyice
Copy link
Contributor Author

easyice commented Nov 22, 2023

It's very important as a reference! Thanks a lot!

@jpountz
Copy link
Contributor

jpountz commented Nov 22, 2023

I opened a PR to feed some of this data into the micro benchmark to make it more realistic: #12833.

@wjp719
Copy link
Contributor

wjp719 commented Feb 22, 2024

@easyice Hi, I have doubt that the encoding data result using group-varint encoding is different from the old way, so is this way compatible with the old index format data?

@easyice
Copy link
Contributor Author

easyice commented Feb 22, 2024

@easyice Hi, I have doubt that the encoding data result using group-varint encoding is different from the old way, so is this way compatible with the old index format data?

This change was released at lucene 9.9.0, which uses a new version of posting format Lucene99PostingsFormat, if you read a old index format, it will use the matching posting format class to decode the index. so it will compatible with the old index.

Have you got specific errors? could you give some detailed message? Thanks!

See https://lucene.apache.org/core/9_9_0/changes/Changes.html#v9.9.0.optimizations

@wjp719
Copy link
Contributor

wjp719 commented Feb 22, 2024

Have you got specific errors? could you give some detailed message? Thanks!

I have no errors,I didn't realize the new format was used, Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants