New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-1087: [C++] Handle unloaded seek positions when seeking in an uncompressed chunk #1008
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @stiga-huang .
cc @wgtmac and @williamhyun |
According to @stiga-huang 's investigation result, I added a milestone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, some comments about comments
c++/src/Compression.cc
Outdated
@@ -533,24 +545,37 @@ DIAGNOSTIC_PUSH | |||
} | |||
|
|||
/** There are three possible scenarios when seeking a position: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now it is four possible scenarios
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
c++/src/Compression.cc
Outdated
outputBuffer = outputBufferStart + posInChunk; | ||
return; | ||
} | ||
// Case 2: The position is outside the decompressed buffer. Skip bytes to seek. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing as it can only happen in the uncompressed case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the method comments about this.
c++/src/Compression.cc
Outdated
* read yet. | ||
* 3. It is already read from the input stream, but has not been decompressed yet, ie. | ||
* it's not in the output stream. | ||
* 4. It is not read yet from the input stream. | ||
*/ | ||
void DecompressionStream::seek(PositionProvider& position) { | ||
size_t seekedPosition = position.current(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really related to the change, but I think that it would be clearer if this would be renamed, e.g. to startOfSeekedChunk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this confused me too. I renamed it to seekedHeaderPosition
so it's similar to headerPosition
.
c++/src/Compression.cc
Outdated
* 3. It is not read yet from the inputstream. | ||
* 1. The chunk of the seeked position is already read and decompressed into the output | ||
* stream, ie. chunk header is read and chunk contents are in the output stream. | ||
* 2. The chunk of the seeked position is partially read. This only happens for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The root cause is in line 495 where DecompressionStream::Next only reads availableSize but not the full uncompressed chunk. So another fix is to enforce read full uncompressed chunk there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I also thought about that but finally chose this fix for simplicity. Loading the whole uncompressed chunk at once might also improve the performance, we can do it in follow-up tasks. :)
Gentle ping, @stiga-huang . |
Sorry to be late here. Thank you guys for the feekback! It also took me some time to understand the existing codes. So I updated some more comments to improve the readability. Please let me know if we need to update more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @stiga-huang !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @stiga-huang , @wgtmac , and csringhofer .
Merged to main/1.7.
…ompressed chunk (#1008) ### What changes were proposed in this pull request? This PR fixes an unhandled case when seeking in an uncompressed chunk. ### Why are the changes needed? The bug causes position overflow and fails the reader when encountered the unhandled case. Some background: * Compressed streams are compressed in chunks. If the compressed size of a chunk is larger than the original size, the original (uncompressed) chunk will be kept. The chunk header records the chunk length and whether it's compressed. * Seek position in a compressed stream is encoded into 2 numbers: pos in the input stream and pos in the chunk. The first number locates the chunk header. The second number locates the position in the decompressed chunk. * Compressed chunks are decompressed in a whole so the whole chunk is in the output buffer. Uncompressed chunks don't need decompression so the input buffer is used directly. However, the chunk could be read in pieces depending on the block size of the input stream. So the seek position might not be loaded yet. The unhandled case is: the seek position is in the current chunk but posInChunk is not loaded yet. We should skip the remaining bytes to seek to it. ### How was this patch tested? Added a unit test in TestDecompression.cc. Verified the issue described in the JIRA is resolved. (cherry picked from commit d175525) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ompressed chunk (apache#1008) ### What changes were proposed in this pull request? This PR fixes an unhandled case when seeking in an uncompressed chunk. ### Why are the changes needed? The bug causes position overflow and fails the reader when encountered the unhandled case. Some background: * Compressed streams are compressed in chunks. If the compressed size of a chunk is larger than the original size, the original (uncompressed) chunk will be kept. The chunk header records the chunk length and whether it's compressed. * Seek position in a compressed stream is encoded into 2 numbers: pos in the input stream and pos in the chunk. The first number locates the chunk header. The second number locates the position in the decompressed chunk. * Compressed chunks are decompressed in a whole so the whole chunk is in the output buffer. Uncompressed chunks don't need decompression so the input buffer is used directly. However, the chunk could be read in pieces depending on the block size of the input stream. So the seek position might not be loaded yet. The unhandled case is: the seek position is in the current chunk but posInChunk is not loaded yet. We should skip the remaining bytes to seek to it. ### How was this patch tested? Added a unit test in TestDecompression.cc. Verified the issue described in the JIRA is resolved. (cherry picked from commit d175525) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ompressed chunk (apache#1008) ### What changes were proposed in this pull request? This PR fixes an unhandled case when seeking in an uncompressed chunk. ### Why are the changes needed? The bug causes position overflow and fails the reader when encountered the unhandled case. Some background: * Compressed streams are compressed in chunks. If the compressed size of a chunk is larger than the original size, the original (uncompressed) chunk will be kept. The chunk header records the chunk length and whether it's compressed. * Seek position in a compressed stream is encoded into 2 numbers: pos in the input stream and pos in the chunk. The first number locates the chunk header. The second number locates the position in the decompressed chunk. * Compressed chunks are decompressed in a whole so the whole chunk is in the output buffer. Uncompressed chunks don't need decompression so the input buffer is used directly. However, the chunk could be read in pieces depending on the block size of the input stream. So the seek position might not be loaded yet. The unhandled case is: the seek position is in the current chunk but posInChunk is not loaded yet. We should skip the remaining bytes to seek to it. ### How was this patch tested? Added a unit test in TestDecompression.cc. Verified the issue described in the JIRA is resolved.
What changes were proposed in this pull request?
This PR fixes an unhandled case when seeking in an uncompressed chunk.
Why are the changes needed?
The bug causes position overflow and fails the reader when encountered the unhandled case. Some background:
The unhandled case is: the seek position is in the current chunk but posInChunk is not loaded yet. We should skip the remaining bytes to seek to it.
How was this patch tested?
Added a unit test in TestDecompression.cc. Verified the issue described in the JIRA is resolved.