-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use group-varint encoding for the tail of postings #12782
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into it! I left some questions.
readVInts(docs, 0, limit); | ||
return; | ||
} | ||
int groupValues = limit / 4 * 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do this with a single instruction I believe?
int groupValues = limit / 4 * 4; | |
int groupValues = limit & 0xFFFFFFFC; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice idea :)
for (int i = 0; i < groupValues; i++) { | ||
cur = i % 4; | ||
if (cur == 0) { | ||
groupLengths = flagToLengths[Byte.toUnsignedInt(bytes[offset++])]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the flagToLengths
table approach is best on Java because of bound checks vs. recomputing the length from the flag using shits/masks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it also looks scary big, like 4KB? it could have negative impact on cpu cache that may not show in a jmh.
if (cur == 0) { | ||
groupLengths = flagToLengths[Byte.toUnsignedInt(bytes[offset++])]; | ||
} | ||
docs[i] = (int) BitUtil.VH_LE_INT.get(bytes, offset) & MASKS[groupLengths[cur]]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And likewise for masks, I wonder if the table lookup is actually better than recomputing the mask.
* Encode integers using group-varint. It uses VInt to encode tail values that are not enough for a | ||
* group | ||
*/ | ||
class GroupVintWriter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use an upper I for consistency with DataInput#readVInt
?
class GroupVintWriter { | |
class GroupVIntWriter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, sorry for my mistake ;)
@jpountz You are right, recomputing the length is faster than table lookup, here is the benchmark when reading the ints, each value will takes 4 bytes:
But i found that the group-varint encoding is faster than vint only when the value takes up 4 bytes, here is the benchmark for reading 64 int values.
The decoding for group-varint takes up constant time. vint decoding is faster when the values take up fewer bytes. so the actual payoff depends on factors like |
Could you check in your benchmark under |
At least in theory, group varint could be made faster than vints even with single-byte integers, because a single check on Your change keep mixing doc IDs and frequencies. I wonder if we should write them in separate varint blocks? |
Thank you @jpountz , I pushed the benchmark code, and added a new comparison between
+1
It's a good idea, It can use less memory when decoding. Code
|
Bulk decoding rather than one by one. Write docs and freqs separetely instead of interleaved. Write freqs as regular vints, as the benchmark suggest single-byte vints are fast and freqs are often small. Remove `len`/`numGroup` to save space. Read directly from the directory instead of using an intermediate buffer. This helps save memory copies.
Thanks @easyice. I took some time to look into the benchmark and improve a few things, hopefully you don't mind. Here is the output of the benchmark on my machine now:
The benchmark used to read directly from the in-memory byte[] by calling |
buffer[bufferOffset++] = v; | ||
} | ||
|
||
public void reset(int numValues) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the numValues
parameter since it looks like we can't actually rely on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. It's more simpler now.
byteBufferVIntIn.seek(0); | ||
for (int i = 0; i < size; i++) { | ||
values[i] = byteBufferVIntIn.readVInt(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to pass the values
array to a Blackhole
object to make sure that the JVM doesn't optimize away some of the decoding logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your guidance! ...
} | ||
|
||
/** only readValues or nextInt can be called after reset */ | ||
public void readValues(long[] docs, int limit) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need both a reset() and readValues(), maybe readValues()
could take a DataInput
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
return; | ||
} | ||
encodeValues(buffer, bufferOffset); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about making the API look more like the reader API, ie. replace reset/add/flush with a single writeValues(DataOutput, long[] values, int limit)
API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1.
// encode each group | ||
while ((limit - off) >= 4) { | ||
// the maximum size of one group is 4 ints + 1 byte flag. | ||
bytes = ArrayUtil.grow(bytes, byteOffset + 17); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we write the data to out
here instead of growing the buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. but a 17-byte array for single group is still required, because the DataOutput
cannot write an integer to the specified number of bytes. Is it okay?
Wow, what an incredible speedup! I would not have expected bulk decoding with read directly is so much faster than read from array, Thank you for your time @jpountz , and I'm sorry i didn't try this approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz Thank a lot for the great suggestions. :)
buffer[bufferOffset++] = v; | ||
} | ||
|
||
public void reset(int numValues) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. It's more simpler now.
} | ||
|
||
/** only readValues or nextInt can be called after reset */ | ||
public void readValues(long[] docs, int limit) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
return; | ||
} | ||
encodeValues(buffer, bufferOffset); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1.
// encode each group | ||
while ((limit - off) >= 4) { | ||
// the maximum size of one group is 4 ints + 1 byte flag. | ||
bytes = ArrayUtil.grow(bytes, byteOffset + 17); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. but a 17-byte array for single group is still required, because the DataOutput
cannot write an integer to the specified number of bytes. Is it okay?
byteBufferVIntIn.seek(0); | ||
for (int i = 0; i < size; i++) { | ||
values[i] = byteBufferVIntIn.readVInt(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your guidance! ...
I ran some rounds of wikimediumall(sometimes there is noise), It looks a bit speed up :
Round 1
Round 2
Round 3
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for running these macro benchmarks, it's good to see that this change is translating into noticeable speedups. I see that fuzzy, wildcard and prefix queries get speedups with very low p values, which is what I'd have expected given that these queries need to visit many low-doc-frequency terms. Overall, the bigger size on disk looks worth the query speedup to me. We made a similar trade-off when switching from PFOR back to FOR for doc blocks.
Thanks for adding a unit test too. The change looks good to me, I just left a minor suggestion. Can you add a CHANGES entry?
public void testEncodeDecode() throws IOException { | ||
long[] values = new long[ForUtil.BLOCK_SIZE]; | ||
long[] restored = new long[ForUtil.BLOCK_SIZE]; | ||
final int iterations = RandomNumbers.randomIntBetween(random(), 50, 1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have an atLeast
helper for this kind of thing, it automatically gives more iterations to nightly runs.
final int iterations = RandomNumbers.randomIntBetween(random(), 50, 1000); | |
final int iterations = atLeast(100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, It's very nice, Thank you @jpountz
bytes[byteOffset++] = (byte) (v & 0xFF); | ||
v >>>= 8; | ||
} | ||
bytes[byteOffset++] = (byte) v; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can simplify the above loop by using a do...while loop instead of a regular while loop, something like
do {
bytes[byteOffset++] = (byte) (v & 0xFF);
v >>>= 8;
} while (v != 0);
Then we don't need the extra write to bytes
after the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more simpler, Thank you :)
Co-authored-by: Adrien Grand <jpountz@gmail.com>
Co-authored-by: Adrien Grand <jpountz@gmail.com>
There seems to be a speedup on prefix queries in nightly benchmarks, I'll add an annotation. |
Also the size increase is hardly noticeable. |
For reference, I computed the most frequent
Now broken down by number of bytes per int:
|
It's very important as a reference! Thanks a lot! |
I opened a PR to feed some of this data into the micro benchmark to make it more realistic: #12833. |
@easyice Hi, I have doubt that the encoding data result using group-varint encoding is different from the old way, so is this way compatible with the old index format data? |
This change was released at lucene 9.9.0, which uses a new version of posting format Have you got specific errors? could you give some detailed message? Thanks! See https://lucene.apache.org/core/9_9_0/changes/Changes.html#v9.9.0.optimizations |
I have no errors,I didn't realize the new format was used, Thanks. |
As discussed in issue #12717
the read performance of group-varint is 14-30% faster than vint, the
size
16-248 is the number of ints will be read.feel free to close the PR if the performance improves is not enough :)