New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance optimizations: Merged all LittleEndianDataInputStream functionality into ByteBufferInputStream #953
Conversation
…nputStream Deprecated LittleEndianDataInputStream Optmized performance of: - ByteBitPackingValuesReader - PlainValuesReader - RunLengthBitPackingHybridDecoder - Optimized performance of readInt, readLong, and related methods
I forgot to add this to a comment in the code: |
...mn/src/main/java/org/apache/parquet/column/values/bitpacking/ByteBitPackingValuesReader.java
Outdated
Show resolved
Hide resolved
Undid whitespace changes
Fixed whitespace change
Undid whitespace changes
Undid whitespace changes
Undid whitespace changes
Undid whitespace change
Undid whitespace changes
parquet-common/src/main/java/org/apache/parquet/bytes/SingleBufferInputStream.java
Outdated
Show resolved
Hide resolved
parquet-common/src/main/java/org/apache/parquet/bytes/SingleBufferInputStream.java
Outdated
Show resolved
Hide resolved
@Override | ||
public void skipFully(long n) throws IOException { | ||
try { | ||
buffer.position(buffer.position() + (int)n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is just trying to avoid the checks that are being done in skip
. I don't think that's a good idea. This should delegate to skip instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this specifically because it's performance-critical. I did a bunch of profiling, and skips are among the operations that have to have minimal overhead. Delegating to skip() would introduce a bunch of checks that the JIT isn't going to be smart enough to remove.
@@ -174,4 +248,63 @@ public boolean markSupported() { | |||
public int available() { | |||
return buffer.remaining(); | |||
} | |||
|
|||
@Override | |||
public byte readByte() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes from here on out look like what you're really trying to do because we want to read directly from the stream. Can you remove all the other changes that aren't needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you're referring to. All of the methods beyond this point are absolutely necessary. We need to be able to read ints and longs and such from the bytebuffer, and this is the only way to get them.
parquet-common/src/main/java/org/apache/parquet/bytes/MultiBufferInputStream.java
Outdated
Show resolved
Hide resolved
parquet-common/src/main/java/org/apache/parquet/bytes/MultiBufferInputStream.java
Outdated
Show resolved
Hide resolved
parquet-common/src/main/java/org/apache/parquet/bytes/MultiBufferInputStream.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think most of these changes are good, but there are a few things that should be done before committing this:
- Revert any unnecessary changes, like new constructors and style changes that are non-functional (e.g. using
x++
instead ofx += 1
) - Separate the
ByteBufferInputStream
additions into a dedicated PR with tests - Make real changes to
PlainValuesReader
rather than keeping both input streams and changing the reference toin2
- Update for project style
Made ByteBuffer exceptions mode specific
Changed 255 to 0xFF
Reverted whitespace change
Changed ++ to += 1
Added blank line after control flow blocks (except in a few places where it would add a non-functional change to code I didn't edit).
Thanks for reviewing my PR. I made all the cosmetic changes you asked for. I'm not sure why you're asking to separate the ByteBufferInputStream additions into its own PR, since the PR was all about improving performance by moving functionality from LittleEndianDataInputStream into ByteBufferInputStream. The changes to PlainValuesReader rely on all of those changes. The only reason I kept the reference to LittleEndianDataInputStream in PlainValuesReader is because otherwise the build fails with a compatibility break against 1.12.0. I'm going to go ahead with the change in the hopes that that doesn't cause a check failure. |
Got rid of reference to LittleEndianByteBufferInputStream
Also, you mentioned tests. Since I'm not making any functional changes, I'm not sure what to test for. The new code should behave exactly as the old version, just a bit faster. |
Modified skip() and skipFully() to handle negative and out-of-range arguments. Made EOF exceptions preserve any error message.
This PR is all performance optimization. In benchmarking with Trino, we find query performance to improve from 5% to 15%, depending on the query, and that includes all the I/O time from S3.
The main modification is to merge all of LittleEndianDataInputStream functionality into ByteBufferInputStream, which yields the following benefits:
This also includes and enables performance optimizations to:
Context:
I've been working on improving Parquet reading performance in Trino, mostly by profiling while running performance benchmarks and TPCDS queries. This PR is a subset of the changes I made that have more than doubled the performance of a lot of TPCDS queries (wall clock time, including the S3 access time). If you are kind enough to accept these changes, I have more I would like to contribute.