Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-458: [C++][Parquet] Add support for reading/writing DataPageV2 format #6481

Closed
wants to merge 9 commits into from

Conversation

hatemhelal
Copy link
Contributor

This patch adds support for reading and writing the Parquet DataPageV2 format. Currently, this change will make all V2 Parquet files use the DataPageV2 format. I'm still working on some basic unittests and would welcome any feedback on this in the meantime.

@github-actions
Copy link

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot speak to the correctness of this, but some comments.

cpp/src/parquet/column_reader.cc Outdated Show resolved Hide resolved
@@ -143,6 +181,9 @@ class SerializedPageReader : public PageReader {

void InitDecryption();

void DecompressPage(int compressed_len, int uncompressed_len,
std::shared_ptr<Buffer>& page_buffer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This signature would be better:

std::shared_ptr<Buffer> DecompressPage(int compressed_len, int uncompressed_len, const std::shared_ptr<Buffer>& page_buffer);

Copy link
Contributor Author

@hatemhelal hatemhelal Mar 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip, I went for the following signature:

std::shared_ptr<Buffer> DecompressPage(int compressed_len, int uncompressed_len, const uint8_t* page_buffer);

int32_t levels_length =
header.repetition_levels_byte_length + header.definition_levels_byte_length;
uint8_t* decompressed = decompression_buffer_->mutable_data();
memcpy(decompressed, page_buffer->data(), levels_length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the memcpy could be avoided at some point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be avoided but would require some invasive refactoring to separate extracting the levels and the values from the pages. I might be wrong but my understanding is that typically the levels are much smaller than the values so I wouldn't expect this memcpy to cost too much.

int64_t encoded_size = level_encoder_.len() + sizeof(int32_t);

if (include_length_prefix) {
reinterpret_cast<int32_t*>(dest_buffer->mutable_data())[0] = level_encoder_.len();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure big-endian machines love this ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting point...the Parquet RLE spec explicitly says this length should be stored as little endian:

length := length of the in bytes stored as 4 bytes little endian (unsigned int32)

I wonder if this is one reason why this length prefix is not part of the DataPageV2 format?

cpp/src/parquet/column_writer.cc Outdated Show resolved Hide resolved
cpp/src/parquet/column_writer.cc Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Mar 12, 2020

@hatemhelal It looks like you need to resolve the conflicts with git master. Hopefully git rebase will do.

@hatemhelal
Copy link
Contributor Author

I cannot speak to the correctness of this, but some comments.

Thanks for reviewing @pitrou! Would be good to get some feedback on the correctness of these changes. @wesm @xhochy any chance you could take a look at this or suggest who could?

@hatemhelal
Copy link
Contributor Author

The appveyor build failure is due to ARROW-8132

@wesm
Copy link
Member

wesm commented Mar 17, 2020

I will review this when I can, have been pretty backlogged / distracted last couple weeks

@xhochy
Copy link
Member

xhochy commented Mar 19, 2020

Looks good from my side but I like to have @wesm to have a final look as I have not touched the code base for a longer time.

@wesm
Copy link
Member

wesm commented Mar 19, 2020

Yes I'll review as soon as I can

@hatemhelal
Copy link
Contributor Author

@xhochy, thanks for having a look!

@wesm, no immediate rush to get this reviewed. FYI, I'm changing jobs (great timing...) and have some time off until mid-April. Would be nice to wrap this up before I start at the new job and this completely leaks from my brain.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks mostly good to me, thanks for working on this. I left a few nitpick comments and asked a question about the interpretation of the compressed_len/uncompressed_len (I looked at parquet.thrift but wanted to double check)

int32_t levels_length =
header.repetition_levels_byte_length + header.definition_levels_byte_length;
uint8_t* decompressed = decompression_buffer_->mutable_data();
memcpy(decompressed, page_buffer, levels_length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know for sure that uncompressed_len includes the size of the uncompressed encoded levels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my interpretation of this comment in parquet.thrift :

  /** Uncompressed page size in bytes (not including this header) **/
  2: required i32 uncompressed_page_size

I think the parquet-mr project seems to also follow this interpretation based on my superficial read of this source.

Happy to ask about this on the dev@parquet list to be sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like we're in the clear here

cpp/src/parquet/column_reader.cc Outdated Show resolved Hide resolved
// TODO(PARQUET-594) crc checksum

if (page.type() == PageType::DATA_PAGE) {
const DataPageV1& v1_page = dynamic_cast<const DataPageV1&>(page);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use checked_cast here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good tip, I wasn't aware of that utility.

cpp/src/parquet/column_writer.cc Outdated Show resolved Hide resolved
cpp/src/parquet/column_writer.cc Outdated Show resolved Hide resolved
cpp/src/parquet/column_writer.cc Outdated Show resolved Hide resolved
cpp/src/parquet/column_writer.cc Outdated Show resolved Hide resolved
@wesm
Copy link
Member

wesm commented Mar 25, 2020

The GLib failure here doesn't look like it could be related to this patch

2020-03-25T11:45:58.2512570Z [51/60] Linking target example/read-stream.
2020-03-25T11:45:58.6864290Z [52/60] Generating Arrow-1.0.gir with a custom command.
2020-03-25T11:45:58.6865640Z g-ir-scanner: link: clang -o /Users/runner/runners/2.165.2/work/arrow/arrow/build/c_glib/tmp-introspectr1vuir_9/Arrow-1.0 -DARROW_NO_DEPRECATED_API /Users/runner/runners/2.165.2/work/arrow/arrow/build/c_glib/tmp-introspectr1vuir_9/Arrow-1.0.o -L. -Wl,-rpath,. -L/Users/runner/runners/2.165.2/work/arrow/arrow/build/c_glib/arrow-glib -Wl,-rpath,/Users/runner/runners/2.165.2/work/arrow/arrow/build/c_glib/arrow-glib -L/usr/local/lib -Wl,-rpath,/usr/local/lib -L/usr/local/lib -Wl,-rpath,/usr/local/lib -L/usr/local/Cellar/glib/2.64.1/lib -Wl,-rpath,/usr/local/Cellar/glib/2.64.1/lib -L/usr/local/opt/gettext/lib -Wl,-rpath,/usr/local/opt/gettext/lib -larrow-glib -larrow -lgobject-2.0 -lglib-2.0 -lintl -lgio-2.0 -L/usr/local/Cellar/glib/2.64.1/lib -L/usr/local/opt/gettext/lib -lgio-2.0 -lgobject-2.0 -lgmodule-2.0 -lglib-2.0 -lintl
2020-03-25T11:45:59.1816140Z [53/60] Generating Arrow-1.0.typelib with a custom command.
2020-03-25T11:45:59.7802850Z [54/60] Compiling C++ object 'parquet-glib/13ba34e@@parquet-glib@sha/arrow-file-reader.cpp.o'.
2020-03-25T11:46:00.5846690Z [55/60] Generating Gandiva-1.0.gir with a custom command.
2020-03-25T11:46:00.5847780Z Package arrow-glib was not found in the pkg-config search path.
2020-03-25T11:46:00.5848590Z Perhaps you should add the directory containing `arrow-glib.pc'
2020-03-25T11:46:00.5849160Z to the PKG_CONFIG_PATH environment variable
2020-03-25T11:46:00.5850080Z No package 'arrow-glib' found
2020-03-25T11:46:00.5850480Z 
2020-03-25T11:46:00.5852590Z g-ir-scanner: link: clang -o /Users/runner/runners/2.165.2/work/arrow/arrow/build/c_glib/tmp-introspectt9heu8g4/Gandiva-1.0 -DARROW_NO_DEPRECATED_API /Users/runner/runners/2.165.2/work/arrow/arrow/build/c_glib/tmp-introspectt9heu8g4/Gandiva-1.0.o -L. -Wl,-rpath,. -L/Users/runner/runners/2.165.2/work/arrow/arrow/build/c_glib/gandiva-glib -Wl,-rpath,/Users/runner/runners/2.165.2/work/arrow/arrow/build/c_glib/gandiva-glib -L/Users/runner/runners/2.165.2/work/arrow/arrow/build/c_glib/arrow-glib -Wl,-rpath,/Users/runner/runners/2.165.2/work/arrow/arrow/build/c_glib/arrow-glib -L/usr/local/lib -Wl,-rpath,/usr/local/lib -L/usr/local/lib -Wl,-rpath,/usr/local/lib -L/usr/local/Cellar/glib/2.64.1/lib -Wl,-rpath,/usr/local/Cellar/glib/2.64.1/lib -L/usr/local/opt/gettext/lib -Wl,-rpath,/usr/local/opt/gettext/lib -lgandiva-glib -lgandiva -larrow -lgobject-2.0 -lglib-2.0 -lintl -lgio-2.0 -L/usr/local/Cellar/glib/2.64.1/lib -L/usr/local/opt/gettext/lib -lgio-2.0 -lgobject-2.0 -lgmodule-2.0 -lglib-2.0 -lintl
2020-03-25T11:46:00.7353160Z [56/60] Compiling C++ object 'parquet-glib/13ba34e@@parquet-glib@sha/arrow-file-writer.cpp.o'.
2020-03-25T11:46:00.8392710Z [57/60] Linking target parquet-glib/libparquet-glib.100.dylib.
2020-03-25T11:46:00.8731900Z [58/60] Generating Gandiva-1.0.typelib with a custom command.
2020-03-25T11:46:02.3266270Z [59/60] Generating Parquet-1.0.gir with a custom command.
2020-03-25T11:46:02.3266940Z Package arrow-glib was not found in the pkg-config search path.
2020-03-25T11:46:02.3267400Z Perhaps you should add the directory containing `arrow-glib.pc'
2020-03-25T11:46:02.3267530Z to the PKG_CONFIG_PATH environment variable
2020-03-25T11:46:02.3267930Z No package 'arrow-glib' found
2020-03-25T11:46:02.3268010Z 

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@wesm
Copy link
Member

wesm commented Mar 25, 2020

thanks @hatemhelal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants