Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1087: [C++] Handle unloaded seek positions when seeking in an uncompressed chunk #1008

Merged
merged 4 commits into from Jan 18, 2022

Conversation

stiga-huang
Copy link
Contributor

What changes were proposed in this pull request?

This PR fixes an unhandled case when seeking in an uncompressed chunk.

Why are the changes needed?

The bug causes position overflow and fails the reader when encountered the unhandled case. Some background:

  • Compressed streams are compressed in chunks. If the compressed size of a chunk is larger than the original size, the original (uncompressed) chunk will be kept. The chunk header records the chunk length and whether it's compressed.
  • Seek position in a compressed stream is encoded into 2 numbers: pos in the input stream and pos in the chunk. The first number locates the chunk header. The second number locates the position in the decompressed chunk.
  • Compressed chunks are decompressed in a whole so the whole chunk is in the output buffer. Uncompressed chunks don't need decompression so the input buffer is used directly. However, the chunk could be read in pieces depending on the block size of the input stream. So the seek position might not be loaded yet.

The unhandled case is: the seek position is in the current chunk but posInChunk is not loaded yet. We should skip the remaining bytes to seek to it.

How was this patch tested?

Added a unit test in TestDecompression.cc. Verified the issue described in the JIRA is resolved.

@github-actions github-actions bot added the CPP label Jan 11, 2022
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @stiga-huang .

@dongjoon-hyun
Copy link
Member

cc @wgtmac and @williamhyun

@dongjoon-hyun dongjoon-hyun added this to the 1.7.3 milestone Jan 12, 2022
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 12, 2022

According to @stiga-huang 's investigation result, I added a milestone 1.7.3 label on this PR.

Copy link

@csringhofer csringhofer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, some comments about comments

@@ -533,24 +545,37 @@ DIAGNOSTIC_PUSH
}

/** There are three possible scenarios when seeking a position:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now it is four possible scenarios

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

c++/src/Compression.cc Show resolved Hide resolved
outputBuffer = outputBufferStart + posInChunk;
return;
}
// Case 2: The position is outside the decompressed buffer. Skip bytes to seek.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing as it can only happen in the uncompressed case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the method comments about this.

* read yet.
* 3. It is already read from the input stream, but has not been decompressed yet, ie.
* it's not in the output stream.
* 4. It is not read yet from the input stream.
*/
void DecompressionStream::seek(PositionProvider& position) {
size_t seekedPosition = position.current();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really related to the change, but I think that it would be clearer if this would be renamed, e.g. to startOfSeekedChunk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this confused me too. I renamed it to seekedHeaderPosition so it's similar to headerPosition.

* 3. It is not read yet from the inputstream.
* 1. The chunk of the seeked position is already read and decompressed into the output
* stream, ie. chunk header is read and chunk contents are in the output stream.
* 2. The chunk of the seeked position is partially read. This only happens for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root cause is in line 495 where DecompressionStream::Next only reads availableSize but not the full uncompressed chunk. So another fix is to enforce read full uncompressed chunk there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I also thought about that but finally chose this fix for simplicity. Loading the whole uncompressed chunk at once might also improve the performance, we can do it in follow-up tasks. :)

c++/src/Compression.cc Outdated Show resolved Hide resolved
@dongjoon-hyun
Copy link
Member

Gentle ping, @stiga-huang .

@stiga-huang
Copy link
Contributor Author

Sorry to be late here. Thank you guys for the feekback! It also took me some time to understand the existing codes. So I updated some more comments to improve the readability. Please let me know if we need to update more.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @stiga-huang !

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @stiga-huang , @wgtmac , and csringhofer .
Merged to main/1.7.

@dongjoon-hyun dongjoon-hyun merged commit d175525 into apache:main Jan 18, 2022
dongjoon-hyun pushed a commit that referenced this pull request Jan 18, 2022
…ompressed chunk (#1008)

### What changes were proposed in this pull request?

This PR fixes an unhandled case when seeking in an uncompressed chunk.

### Why are the changes needed?

The bug causes position overflow and fails the reader when encountered the unhandled case. Some background:
* Compressed streams are compressed in chunks. If the compressed size of a chunk is larger than the original size, the original (uncompressed) chunk will be kept. The chunk header records the chunk length and whether it's compressed.
* Seek position in a compressed stream is encoded into 2 numbers: pos in the input stream and pos in the chunk. The first number locates the chunk header. The second number locates the position in the decompressed chunk.
* Compressed chunks are decompressed in a whole so the whole chunk is in the output buffer. Uncompressed chunks don't need decompression so the input buffer is used directly. However, the chunk could be read in pieces depending on the block size of the input stream. So the seek position might not be loaded yet.

The unhandled case is: the seek position is in the current chunk but posInChunk is not loaded yet. We should skip the remaining bytes to seek to it.

### How was this patch tested?

Added a unit test in TestDecompression.cc. Verified the issue described in the JIRA is resolved.

(cherry picked from commit d175525)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
stiga-huang added a commit to stiga-huang/orc that referenced this pull request May 4, 2022
…ompressed chunk (apache#1008)

### What changes were proposed in this pull request?

This PR fixes an unhandled case when seeking in an uncompressed chunk.

### Why are the changes needed?

The bug causes position overflow and fails the reader when encountered the unhandled case. Some background:
* Compressed streams are compressed in chunks. If the compressed size of a chunk is larger than the original size, the original (uncompressed) chunk will be kept. The chunk header records the chunk length and whether it's compressed.
* Seek position in a compressed stream is encoded into 2 numbers: pos in the input stream and pos in the chunk. The first number locates the chunk header. The second number locates the position in the decompressed chunk.
* Compressed chunks are decompressed in a whole so the whole chunk is in the output buffer. Uncompressed chunks don't need decompression so the input buffer is used directly. However, the chunk could be read in pieces depending on the block size of the input stream. So the seek position might not be loaded yet.

The unhandled case is: the seek position is in the current chunk but posInChunk is not loaded yet. We should skip the remaining bytes to seek to it.

### How was this patch tested?

Added a unit test in TestDecompression.cc. Verified the issue described in the JIRA is resolved.

(cherry picked from commit d175525)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
cxzl25 pushed a commit to cxzl25/orc that referenced this pull request Jan 11, 2024
…ompressed chunk (apache#1008)

### What changes were proposed in this pull request?

This PR fixes an unhandled case when seeking in an uncompressed chunk.

### Why are the changes needed?

The bug causes position overflow and fails the reader when encountered the unhandled case. Some background:
* Compressed streams are compressed in chunks. If the compressed size of a chunk is larger than the original size, the original (uncompressed) chunk will be kept. The chunk header records the chunk length and whether it's compressed.
* Seek position in a compressed stream is encoded into 2 numbers: pos in the input stream and pos in the chunk. The first number locates the chunk header. The second number locates the position in the decompressed chunk.
* Compressed chunks are decompressed in a whole so the whole chunk is in the output buffer. Uncompressed chunks don't need decompression so the input buffer is used directly. However, the chunk could be read in pieces depending on the block size of the input stream. So the seek position might not be loaded yet. 

The unhandled case is: the seek position is in the current chunk but posInChunk is not loaded yet. We should skip the remaining bytes to seek to it.

### How was this patch tested?

Added a unit test in TestDecompression.cc. Verified the issue described in the JIRA is resolved.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants