PARQUET-458: [C++][Parquet] Add support for reading/writing DataPageV2 format #6481

hatemhelal · 2020-02-24T16:06:01Z

This patch adds support for reading and writing the Parquet DataPageV2 format. Currently, this change will make all V2 Parquet files use the DataPageV2 format. I'm still working on some basic unittests and would welcome any feedback on this in the meantime.

github-actions · 2020-02-24T16:16:30Z

https://issues.apache.org/jira/browse/PARQUET-458

pitrou

I cannot speak to the correctness of this, but some comments.

cpp/src/parquet/column_reader.cc

pitrou · 2020-03-04T10:11:52Z

cpp/src/parquet/column_reader.cc

@@ -143,6 +181,9 @@ class SerializedPageReader : public PageReader {

  void InitDecryption();

+  void DecompressPage(int compressed_len, int uncompressed_len,
+                      std::shared_ptr<Buffer>& page_buffer);


This signature would be better:

std::shared_ptr<Buffer> DecompressPage(int compressed_len, int uncompressed_len, const std::shared_ptr<Buffer>& page_buffer);

Thanks for the tip, I went for the following signature:

std::shared_ptr<Buffer> DecompressPage(int compressed_len, int uncompressed_len, const uint8_t* page_buffer);

pitrou · 2020-03-04T10:17:53Z

cpp/src/parquet/column_reader.cc

+    int32_t levels_length =
+        header.repetition_levels_byte_length + header.definition_levels_byte_length;
+    uint8_t* decompressed = decompression_buffer_->mutable_data();
+    memcpy(decompressed, page_buffer->data(), levels_length);


I wonder if the memcpy could be avoided at some point.

I think it could be avoided but would require some invasive refactoring to separate extracting the levels and the values from the pages. I might be wrong but my understanding is that typically the levels are much smaller than the values so I wouldn't expect this memcpy to cost too much.

pitrou · 2020-03-04T10:22:56Z

cpp/src/parquet/column_writer.cc

-  int64_t encoded_size = level_encoder_.len() + sizeof(int32_t);
+
+  if (include_length_prefix) {
+    reinterpret_cast<int32_t*>(dest_buffer->mutable_data())[0] = level_encoder_.len();


I'm sure big-endian machines love this ;-)

This is an interesting point...the Parquet RLE spec explicitly says this length should be stored as little endian:

length := length of the in bytes stored as 4 bytes little endian (unsigned int32)

I wonder if this is one reason why this length prefix is not part of the DataPageV2 format?

cpp/src/parquet/column_writer.cc

pitrou · 2020-03-12T15:50:53Z

@hatemhelal It looks like you need to resolve the conflicts with git master. Hopefully git rebase will do.

hatemhelal · 2020-03-16T11:51:25Z

I cannot speak to the correctness of this, but some comments.

Thanks for reviewing @pitrou! Would be good to get some feedback on the correctness of these changes. @wesm @xhochy any chance you could take a look at this or suggest who could?

hatemhelal · 2020-03-16T12:28:04Z

The appveyor build failure is due to ARROW-8132

wesm · 2020-03-17T19:36:49Z

I will review this when I can, have been pretty backlogged / distracted last couple weeks

xhochy · 2020-03-19T10:27:40Z

Looks good from my side but I like to have @wesm to have a final look as I have not touched the code base for a longer time.

wesm · 2020-03-19T17:20:55Z

Yes I'll review as soon as I can

hatemhelal · 2020-03-19T20:25:00Z

@xhochy, thanks for having a look!

@wesm, no immediate rush to get this reviewed. FYI, I'm changing jobs (great timing...) and have some time off until mid-April. Would be nice to wrap this up before I start at the new job and this completely leaks from my brain.

wesm

This looks mostly good to me, thanks for working on this. I left a few nitpick comments and asked a question about the interpretation of the compressed_len/uncompressed_len (I looked at parquet.thrift but wanted to double check)

wesm · 2020-03-19T22:01:05Z

cpp/src/parquet/column_reader.cc

+    int32_t levels_length =
+        header.repetition_levels_byte_length + header.definition_levels_byte_length;
+    uint8_t* decompressed = decompression_buffer_->mutable_data();
+    memcpy(decompressed, page_buffer, levels_length);


Do we know for sure that uncompressed_len includes the size of the uncompressed encoded levels?

That was my interpretation of this comment in parquet.thrift :

/** Uncompressed page size in bytes (not including this header) **/ 2: required i32 uncompressed_page_size

I think the parquet-mr project seems to also follow this interpretation based on my superficial read of this source.

Happy to ask about this on the dev@parquet list to be sure.

Sounds like we're in the clear here

cpp/src/parquet/column_reader.cc

wesm · 2020-03-19T22:03:22Z

cpp/src/parquet/column_writer.cc

    // TODO(PARQUET-594) crc checksum

+    if (page.type() == PageType::DATA_PAGE) {
+      const DataPageV1& v1_page = dynamic_cast<const DataPageV1&>(page);


Use checked_cast here

Good tip, I wasn't aware of that utility.

cpp/src/parquet/column_writer.cc

wesm · 2020-03-25T20:55:31Z

The GLib failure here doesn't look like it could be related to this patch

2020-03-25T11:45:58.2512570Z [51/60] Linking target example/read-stream.
2020-03-25T11:45:58.6864290Z [52/60] Generating Arrow-1.0.gir with a custom command.
2020-03-25T11:45:58.6865640Z g-ir-scanner: link: clang -o /Users/runner/runners/2.165.2/work/arrow/arrow/build/c_glib/tmp-introspectr1vuir_9/Arrow-1.0 -DARROW_NO_DEPRECATED_API /Users/runner/runners/2.165.2/work/arrow/arrow/build/c_glib/tmp-introspectr1vuir_9/Arrow-1.0.o -L. -Wl,-rpath,. -L/Users/runner/runners/2.165.2/work/arrow/arrow/build/c_glib/arrow-glib -Wl,-rpath,/Users/runner/runners/2.165.2/work/arrow/arrow/build/c_glib/arrow-glib -L/usr/local/lib -Wl,-rpath,/usr/local/lib -L/usr/local/lib -Wl,-rpath,/usr/local/lib -L/usr/local/Cellar/glib/2.64.1/lib -Wl,-rpath,/usr/local/Cellar/glib/2.64.1/lib -L/usr/local/opt/gettext/lib -Wl,-rpath,/usr/local/opt/gettext/lib -larrow-glib -larrow -lgobject-2.0 -lglib-2.0 -lintl -lgio-2.0 -L/usr/local/Cellar/glib/2.64.1/lib -L/usr/local/opt/gettext/lib -lgio-2.0 -lgobject-2.0 -lgmodule-2.0 -lglib-2.0 -lintl
2020-03-25T11:45:59.1816140Z [53/60] Generating Arrow-1.0.typelib with a custom command.
2020-03-25T11:45:59.7802850Z [54/60] Compiling C++ object 'parquet-glib/13ba34e@@parquet-glib@sha/arrow-file-reader.cpp.o'.
2020-03-25T11:46:00.5846690Z [55/60] Generating Gandiva-1.0.gir with a custom command.
2020-03-25T11:46:00.5847780Z Package arrow-glib was not found in the pkg-config search path.
2020-03-25T11:46:00.5848590Z Perhaps you should add the directory containing `arrow-glib.pc'
2020-03-25T11:46:00.5849160Z to the PKG_CONFIG_PATH environment variable
2020-03-25T11:46:00.5850080Z No package 'arrow-glib' found
2020-03-25T11:46:00.5850480Z 
2020-03-25T11:46:00.5852590Z g-ir-scanner: link: clang -o /Users/runner/runners/2.165.2/work/arrow/arrow/build/c_glib/tmp-introspectt9heu8g4/Gandiva-1.0 -DARROW_NO_DEPRECATED_API /Users/runner/runners/2.165.2/work/arrow/arrow/build/c_glib/tmp-introspectt9heu8g4/Gandiva-1.0.o -L. -Wl,-rpath,. -L/Users/runner/runners/2.165.2/work/arrow/arrow/build/c_glib/gandiva-glib -Wl,-rpath,/Users/runner/runners/2.165.2/work/arrow/arrow/build/c_glib/gandiva-glib -L/Users/runner/runners/2.165.2/work/arrow/arrow/build/c_glib/arrow-glib -Wl,-rpath,/Users/runner/runners/2.165.2/work/arrow/arrow/build/c_glib/arrow-glib -L/usr/local/lib -Wl,-rpath,/usr/local/lib -L/usr/local/lib -Wl,-rpath,/usr/local/lib -L/usr/local/Cellar/glib/2.64.1/lib -Wl,-rpath,/usr/local/Cellar/glib/2.64.1/lib -L/usr/local/opt/gettext/lib -Wl,-rpath,/usr/local/opt/gettext/lib -lgandiva-glib -lgandiva -larrow -lgobject-2.0 -lglib-2.0 -lintl -lgio-2.0 -L/usr/local/Cellar/glib/2.64.1/lib -L/usr/local/opt/gettext/lib -lgio-2.0 -lgobject-2.0 -lgmodule-2.0 -lglib-2.0 -lintl
2020-03-25T11:46:00.7353160Z [56/60] Compiling C++ object 'parquet-glib/13ba34e@@parquet-glib@sha/arrow-file-writer.cpp.o'.
2020-03-25T11:46:00.8392710Z [57/60] Linking target parquet-glib/libparquet-glib.100.dylib.
2020-03-25T11:46:00.8731900Z [58/60] Generating Gandiva-1.0.typelib with a custom command.
2020-03-25T11:46:02.3266270Z [59/60] Generating Parquet-1.0.gir with a custom command.
2020-03-25T11:46:02.3266940Z Package arrow-glib was not found in the pkg-config search path.
2020-03-25T11:46:02.3267400Z Perhaps you should add the directory containing `arrow-glib.pc'
2020-03-25T11:46:02.3267530Z to the PKG_CONFIG_PATH environment variable
2020-03-25T11:46:02.3267930Z No package 'arrow-glib' found
2020-03-25T11:46:02.3268010Z

wesm

+1

wesm · 2020-03-25T20:56:31Z

thanks @hatemhelal

pitrou reviewed Mar 4, 2020

View reviewed changes

hatemhelal force-pushed the parquet-458 branch from c57a41c to f5f56cb Compare March 12, 2020 20:54

hatemhelal force-pushed the parquet-458 branch from f5f56cb to 3cf340b Compare March 16, 2020 12:02

hatemhelal force-pushed the parquet-458 branch from 3cf340b to d38aee1 Compare March 16, 2020 20:39

wesm self-requested a review March 17, 2020 19:36

wesm reviewed Mar 19, 2020

View reviewed changes

hatemhelal and others added 9 commits March 25, 2020 11:11

Support Parquet DataPageV2 format

e2656e3

fix narrowing warning on windows

10d3409

fix another narrowing warning on windows

09a7655

fix wrong pointer passed to definition level decoder

b7c3920

Update DataPage* tests to cover stats

ae7d6e6

code review fixes

a45e8a4

remove vestigial todo

e8a5920

format after rebase

563d94b

additional code review readability fixes

c0e4d31

hatemhelal force-pushed the parquet-458 branch from d38aee1 to c0e4d31 Compare March 25, 2020 11:34

wesm approved these changes Mar 25, 2020

View reviewed changes

wesm closed this in 809d40a Mar 25, 2020

kevingurney deleted the parquet-458 branch February 7, 2022 19:30

asfimport mentioned this pull request Jun 23, 2024

[C++][Parquet] Implement support for DataPageV2 #42256

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-458: [C++][Parquet] Add support for reading/writing DataPageV2 format #6481

PARQUET-458: [C++][Parquet] Add support for reading/writing DataPageV2 format #6481

hatemhelal commented Feb 24, 2020

github-actions bot commented Feb 24, 2020

pitrou left a comment

pitrou Mar 4, 2020

hatemhelal Mar 4, 2020 •

edited

Loading

pitrou Mar 4, 2020

hatemhelal Mar 16, 2020

pitrou Mar 4, 2020

hatemhelal Mar 16, 2020

pitrou commented Mar 12, 2020

hatemhelal commented Mar 16, 2020

hatemhelal commented Mar 16, 2020

wesm commented Mar 17, 2020

xhochy commented Mar 19, 2020

wesm commented Mar 19, 2020

hatemhelal commented Mar 19, 2020

wesm left a comment

wesm Mar 19, 2020

hatemhelal Mar 25, 2020

wesm Mar 25, 2020

wesm Mar 19, 2020

hatemhelal Mar 25, 2020

wesm commented Mar 25, 2020

wesm left a comment

wesm commented Mar 25, 2020

PARQUET-458: [C++][Parquet] Add support for reading/writing DataPageV2 format #6481

PARQUET-458: [C++][Parquet] Add support for reading/writing DataPageV2 format #6481

Conversation

hatemhelal commented Feb 24, 2020

github-actions bot commented Feb 24, 2020

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hatemhelal Mar 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Mar 12, 2020

hatemhelal commented Mar 16, 2020

hatemhelal commented Mar 16, 2020

wesm commented Mar 17, 2020

xhochy commented Mar 19, 2020

wesm commented Mar 19, 2020

hatemhelal commented Mar 19, 2020

wesm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented Mar 25, 2020

wesm left a comment

Choose a reason for hiding this comment

wesm commented Mar 25, 2020

hatemhelal Mar 4, 2020 •

edited

Loading