New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-28256 Enhance ByteBufferUtils.readVLong to read 8 bytes at a time #5576
HBASE-28256 Enhance ByteBufferUtils.readVLong to read 8 bytes at a time #5576
Conversation
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
In general, if we choose to use VLong, it means usually the value will be small, for example, only 1 or 2 bytes, so I wonder whether your JMH covers the most common scenario? For random long value which is encoded as VLong, I think read 8 bytes at once will be faster, but what if the values are often only 1 or 2 bytes? Thanks. |
Benchmark does cover a variety of vLong sizes. The microbenchmarks above have breakdowns for the following vLongs: 9, 512, 2146483640, and 1700104028981 which can be encoded as 1 byte, 3 byte, 5 byte, and 7 bytes respectively (so a wide range). In practice, ByteBufferUtils.readVLong is only used in the HBase codebase to decode the memstoreTs for the CopyKey, Diff, Fast Diff, Prefix, and RIV1 DBE encodings so performance when decoding a 7 byte vLong (an actual epoch millis timestamp is 6 bytes, but there is a 1 byte overhead for the vLong encoding) is the most important gauge of how this will affect performance. I agree that generally when vLong is used, the caller believes that a fair amount of the data will be smaller than 7 or 8 bytes so the vLong encoding saves space. However, in practice the ByteBufferUtils.readVLong method isn't used in that way.
From what the benchmarks show, there is absolutely no regression at the 1 byte case as the code is identical up to that point. When a 2 byte vLong is read, the benchmarks show a small performance penalty. It seems like the cost of reading an 8 byte word on a modern machine is about the same cost as reading a single byte word, so performance hit isn't super drastic (at least as benchmarked on my machine). Again, because of how this particular method is used to decode vLongs in the HBase codebase, we should be focused on the performance for the 1700104028981 vLong decoding case as it most closely represents the performance for decoding a timestamp. In any case, if we were to start using vLongs more widely, the performance for the 2 byte case for the optimized method isn't really that much worse than the current behavior. |
hbase-common/src/main/java/org/apache/hadoop/hbase/util/ByteBufferUtils.java
Show resolved
Hide resolved
So the vint column in your table is the vlong value your read/write? Where 9 would be 1 byte, and 512 would be 2 bytes? |
Yes! The vint column in the table is the vlong value for the read/write (sorry for the confusing name).
Yeah 9 can be encoded with 1 byte, but 512 will require 3 bytes (1st byte is vLong size byte, since vLong can't fit completely inside a single byte, the first byte is just used to indicate size & sign and the next two bytes hold 512). |
<style>
</style>
|
Looking at the code again, for one byte vlong, we still only read one byte first so should not have big differences. But for 512(2 or 3 bytes?), the microbench result shows that there are some performance lost? |
It does look like there's a bit of a regression in the 512 case (3 bytes). I'm not as worried about that given that this codepath is only used in the HBase codebase to read memstore timestamps. I can add a comment to the method signature to warn to use |
It seems like some of the concern or confusion here is related to the generic name readVLong. If that method was used to read a good use case for vlong, the performance at small byte sizes would matter. As it stands, it's only used in one bad usecase the memstore ts. It's questionable whether a timestamp should be a vlong in the first place. Since this is only used for a timestamp use case, maybe we can move on from the small byte concerns. Would it make people feel better if we renamed the method to (Edit: it may not be fair to call it a bad use case. What I mean is that if we care about saving bytes it feels like a timestamp could have a special encoding given the range of possible values, rather than vlong which is designed for all values) |
I think rename it to readVLongTimestamp is good. |
That sounds good—I'll go down that path. @Apache9 & @bbeaudreault I have a question before proceeding with making those changes: in branch 2, ByteBufferUtils is an IA.Public and in branch-3/master it's IA.Private (see HBASE-22044). That means that we can rename the method without worry in branch-3/master, but to backport to branch-2 (which I see no reason not to do) we would probably need to either: keep the existing readVLong methods in branch-2 and make the readVLongTimestamp methods completely separate (or possibly just have readVLong delegate to the new readVLongTimestamp methods. Any preference on how I approach these changes for branch-2 compatibility? |
For branch-2 in that case, I'd probably just create a new readVLongTimestamp method and annotate the old readVLong method as Deprecated. If a use-case comes up down the line, we can un-deprecate it. |
I've done the |
@@ -226,7 +226,7 @@ public static KeyValue nextShallowCopy(final ByteBuffer bb, final boolean includ | |||
int kvLength = (int) KeyValue.getKeyValueDataStructureSize(keyLength, valueLength, tagsLength); | |||
KeyValue keyValue = new KeyValue(bb.array(), underlyingArrayOffset, kvLength); | |||
if (includesMvccVersion) { | |||
long mvccVersion = ByteBufferUtils.readVLong(bb); | |||
long mvccVersion = ByteBufferUtils.readVLongTimestamp(bb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I don't think the mvcc version is a timestamp. Just taking a look at a random server that just opened a region (where it logs the next mvcc version), the value was 564069676
Maybe we keep readVLong and only use the new method for actual memstoreTs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like mvcc version and memstoreTs are sort of used interchangeably for the same thing. So I guess the above indicates that this patch is not going to only affect "timestamp-like" values. That said I do think that in most production environments the mvcc is probably pretty large (as above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it looks like mvcc, memstoreTs, and sequenceId are all pretty much used interchangeably from a quick read. I've dug up a bit of literature on the topic in HBASE-13389. To summarize: MVCC values stick around for a while and the performance hit associated with that is known. They're not timestamps, but they're also not likely to be super small.
This was a known issue and the performance impact was previously solved with HBASE-14861. That patch affects only HFiles without a data block encoding applied and works by attempting to eagerly read 4 bytes, then 2 bytes, and then falling back to single byte reads as opposed to the approach taken in this PR where we attempt to eagerly read 8 bytes. It's hard to say which approach is better, but I'm thinking we should at least apply similar optimizations to data block encodings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what approach is better between HBASE-14186 and this approach. This approach has the benefit of in the average case (where the remaining size of the buffer contains 8 bytes) doing less operations on the underlying ByteBuff after checking the size byte regardless of the size of the vLong (3 ops). The ByteBuff operations are typically slow in flamegraphs not because the actual methods are expensive to evaluate, but because checkRefCount()
is typically quite expensive.
I can do some microbenchmarks comparing the current approach, HBASE-14186-like approach, and the approach taken in this patch against each other for vLongs of all sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, if the actual problem is the refcnt stuff, Duo and I had an idea for that a while ago: https://issues.apache.org/jira/browse/HBASE-27730. It was never implemented only because I was focused on the snapshot scan problem and we hadn't actually seen evidence of the problem elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmarks are about halfway through running, so the results aren't totally settled yet. Just wanted to reiterate what a good find that was @bbeaudreault as the results when using ByteBuff's with the NONE recycler look to be substantially faster so far (in the 50% range).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I havent had time to implement the idea in HBASE-27730, but it should be a pretty easy change. Would you want to tackle it?
Is the optimization here still worth it on top of the recycler fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just posted the raw benchmarking results in the JIRA in this comment. I haven't had the chance to go fully over it yet, but my preliminary thought is that the optimization here and the recycler fix are worth it. It seems like they each yield like a 50% performance improvement over our current setup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can tackle https://issues.apache.org/jira/browse/HBASE-27730 after this issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think after quick pass of the benchmarking numbers that the approach taken in https://issues.apache.org/jira/browse/HBASE-14186 is likely also the one we should adopt here
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Based on the most recent benchmarking results (see JIRA comment), I went and converted the approach here to use an approach adapted from HBASE-14186. This approach has microbenchmarks that outperform our current setup. Additionally, there's no regression in the smaller vLong range like there is with the first approach that I used in this patch and performance is similar to or slightly worse in the higher vLong range than the approach I originally presented. However, after I implement HBASE-27730, this newer approach drastically outperforms my original approach. Additionally, since there is no performance regression I think that it's safe to just replace the body of the existing readVLong method with this new implementation. Thus, this patch can be back-ported to branch-3 & branch-2. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
The approach that I benchmarked from HBASE-14186 (and ultimately updated this PR to use) had a bug which I fixed in efcb066. Fixing the bug required making the method potentially slower, so I'm going to rerun the benchmarks for the fixed implementation. |
Updated benchmarks can be found in this JIRA comment. The HBASE-14186 approach is now much closer in performance to the readVLongTimestamp approach after fixing the bug. I think that the numbers generally show that the HBASE-14186 is at least as good if not better for almost all vLong sizes than our existing approach. |
hbase-common/src/main/java/org/apache/hadoop/hbase/util/ByteBufferUtils.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I will merge on Tuesday unless I hear Duo has any remaining concerns. This looks like a big win.
… time (#5576) Signed-off-by: Bryan Beaudreault <bbeaudreault@apache.org> Signed-off-by: Duo Zhang <zhangduo@apache.org>
… time (#5576) Signed-off-by: Bryan Beaudreault <bbeaudreault@apache.org> Signed-off-by: Duo Zhang <zhangduo@apache.org>
… time (#5576) Signed-off-by: Bryan Beaudreault <bbeaudreault@apache.org> Signed-off-by: Duo Zhang <zhangduo@apache.org>
… time (#5576) Signed-off-by: Bryan Beaudreault <bbeaudreault@apache.org> Signed-off-by: Duo Zhang <zhangduo@apache.org>
…ad more bytes at a time (apache#5576) Signed-off-by: Bryan Beaudreault <bbeaudreault@apache.org> Signed-off-by: Duo Zhang <zhangduo@apache.org>
What
This PR updates
ByteBufferUtils.readVLong
to attempt to read 8 bytes from the underlying buffer (if able too) when reading vLongs instead of a single byte at a time to increase performance.Implementation Notes
Previously, these methods relied on a
ByteVisitor
interface which abstracted thebyte get()
method between a Java ByteBuffer and an HBase ByteBuff. To make these updates, I needed access to a larger variety of methods from both ByteBuf and ByteBuffer. Instead of continuing to use the theByteVisitor
abstraction, I split the implementations. They share a bit of common code but have distinct differences now. I believe splitting the call site fromByteVisitor
to a separate one forByteBuffer
andByteBuff
enables the JIT compiler to generate slightly better code as the call sites are now each bimorphic.Benchmarking
I wrote a JMH benchmarking suite (I'll attach the code to the JIRA) that measures the performance of reading various vLongs from both on & off heap buffers with and without padding (padding meaning that the buffer is at least 9 bytes long so that the vectorized read path can be applied). That is to say, for the unoptimized readVLong path, padding shouldn't incur any performance penalty, but with the optimized method only the padded benchmarks should show a substantial performance improvement (and they do show a pretty nice performance improvement).
In terms of how this microbenchmark translates to better seek performance, I'm seeing a consistent 20% performance improvement in the
TestDataBlockEncoders
test performance with this patch vs. without this patch.HBASE-28256