-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-458: [C++][Parquet] Add support for reading/writing DataPageV2 format #6481
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot speak to the correctness of this, but some comments.
cpp/src/parquet/column_reader.cc
Outdated
@@ -143,6 +181,9 @@ class SerializedPageReader : public PageReader { | |||
|
|||
void InitDecryption(); | |||
|
|||
void DecompressPage(int compressed_len, int uncompressed_len, | |||
std::shared_ptr<Buffer>& page_buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This signature would be better:
std::shared_ptr<Buffer> DecompressPage(int compressed_len, int uncompressed_len, const std::shared_ptr<Buffer>& page_buffer);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tip, I went for the following signature:
std::shared_ptr<Buffer> DecompressPage(int compressed_len, int uncompressed_len, const uint8_t* page_buffer);
cpp/src/parquet/column_reader.cc
Outdated
int32_t levels_length = | ||
header.repetition_levels_byte_length + header.definition_levels_byte_length; | ||
uint8_t* decompressed = decompression_buffer_->mutable_data(); | ||
memcpy(decompressed, page_buffer->data(), levels_length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the memcpy
could be avoided at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be avoided but would require some invasive refactoring to separate extracting the levels and the values from the pages. I might be wrong but my understanding is that typically the levels are much smaller than the values so I wouldn't expect this memcpy
to cost too much.
int64_t encoded_size = level_encoder_.len() + sizeof(int32_t); | ||
|
||
if (include_length_prefix) { | ||
reinterpret_cast<int32_t*>(dest_buffer->mutable_data())[0] = level_encoder_.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure big-endian machines love this ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting point...the Parquet RLE spec explicitly says this length should be stored as little endian:
length := length of the in bytes stored as 4 bytes little endian (unsigned int32)
I wonder if this is one reason why this length prefix is not part of the DataPageV2
format?
@hatemhelal It looks like you need to resolve the conflicts with git master. Hopefully |
The appveyor build failure is due to ARROW-8132 |
I will review this when I can, have been pretty backlogged / distracted last couple weeks |
Looks good from my side but I like to have @wesm to have a final look as I have not touched the code base for a longer time. |
Yes I'll review as soon as I can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks mostly good to me, thanks for working on this. I left a few nitpick comments and asked a question about the interpretation of the compressed_len/uncompressed_len
(I looked at parquet.thrift but wanted to double check)
int32_t levels_length = | ||
header.repetition_levels_byte_length + header.definition_levels_byte_length; | ||
uint8_t* decompressed = decompression_buffer_->mutable_data(); | ||
memcpy(decompressed, page_buffer, levels_length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know for sure that uncompressed_len
includes the size of the uncompressed encoded levels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my interpretation of this comment in parquet.thrift :
/** Uncompressed page size in bytes (not including this header) **/
2: required i32 uncompressed_page_size
I think the parquet-mr project seems to also follow this interpretation based on my superficial read of this source.
Happy to ask about this on the dev@parquet list to be sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like we're in the clear here
cpp/src/parquet/column_writer.cc
Outdated
// TODO(PARQUET-594) crc checksum | ||
|
||
if (page.type() == PageType::DATA_PAGE) { | ||
const DataPageV1& v1_page = dynamic_cast<const DataPageV1&>(page); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use checked_cast
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good tip, I wasn't aware of that utility.
The GLib failure here doesn't look like it could be related to this patch
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
thanks @hatemhelal |
This patch adds support for reading and writing the Parquet DataPageV2 format. Currently, this change will make all V2 Parquet files use the DataPageV2 format. I'm still working on some basic unittests and would welcome any feedback on this in the meantime.