PARQUET-1372: Add an API to allow writing RowGroups based on size #484

majetideepak · 2018-08-13T13:44:17Z

I split the changes into multiple commits to ease the review.
Used the example program to test the new API.
I will add unit tests once we converge on the API after review.
Thanks to @AnatoliShein for collaborating with the API design.

jamesclampffer · 2018-08-13T15:30:05Z

examples/low-level-api/reader-writer2.cc

+#include <cassert>
+#include <fstream>
+#include <iostream>
+#include <list>


nit: Doesn't look like you're using anything from list in here. Maybe I am missing something implicit.

Could be result of copy-paste, IWYU could confirm

Will check and fix!

jamesclampffer · 2018-08-13T15:32:37Z

src/parquet/file_writer.h

+// All the values are compressed and stored in memory
+// Values are written to the file on Close
+
+class PARQUET_EXPORT RowGroupWriter2 {


A name like "FixedSizeRowGroupWriter", really anything other than 2, would be a little more intuitive in the public API.

Agreed about the name. Something like ContinuousRowGroupWriter might work a bit better though since we now allow writing to each column multiple times, and this way suffix "2" can be replaced in all other locations with something like "_cont". "Repeated" might work too.

Agree that RowGroupWriter2 is not a good name.

Why do we need a new class? It's not clear to me that it's needed

The current RowGroupWriter class has virtual ColumnWriter* NextColumn() = 0; which does not fit well with the use case here which is to be able to write to a column chunk any number of times (can cycle around all column chunks any number of times) and also any order.
The current RowGroupWriter requires us to write a column chunk at one go and move to the immediate next column.
We need the following external API to support this use case.
virtual ColumnWriter* get_column(i) = 0;
The new RowGroupWriter2 has different semantics in this aspect.

I am not tied to this name. Will change to FixedSizeRowGroupWriter

jamesclampffer · 2018-08-13T15:34:48Z

src/parquet/file_writer.h

+    virtual int num_columns() const = 0;
+    virtual int64_t num_rows() const = 0;
+    virtual int64_t current_compressed_bytes() const = 0;
+    virtual ColumnWriter* get_column(int i) const = 0;


If the ColumnWriter can indirectly touch the RowGroupWriter's internal state it'd be best to return a const ColumnWriter* from the const accessor.

jamesclampffer · 2018-08-13T15:41:25Z

src/parquet/file_writer.h

@@ -102,6 +144,8 @@ class PARQUET_EXPORT ParquetFileWriter {

    virtual RowGroupWriter* AppendRowGroup() = 0;

+    virtual RowGroupWriter2* AppendRowGroup2() = 0;
+


Consider a subclass for writing fixed size rowgroups that overrides AppendRowGroup or take a param in the ParquetFileWriter ctor that determines how it wants to pack rowgroups. That way whatever client code is pushing tuples in doesn't have to care about the on-disk layout after the writer is created.

Agreed. I think these details should be handled in the WriterProperties and the API of RowGroupWriter should be able to indicate to the caller whether it is "safe" to continue writing, or whether the current row group needs to be terminated and flushed

This will restrict the clients from being able to loosely or tightly bound a RowGroup size.
Example: Clients might want to write a row at a time or a batch at a time or do both adaptively to achieve their target RowGroup size.
Clients requirements may vary based on a strict upper bound or a loose upper bound as well.

Edit: I might have misunderstood your comments. If the proposal is to use the current RowGroupWriter and extend it, then it all boils down to how we can extend the current RowGroupWriter API to write multiple column chunks.

wesm

It's great to add this functionality, but I think we should spend some energy to determine what is the best API for end users and results in more maintainable code. We also need some unit tests

wesm · 2018-08-15T13:25:48Z

examples/low-level-api/reader-writer2.cc

+#include <cassert>
+#include <fstream>
+#include <iostream>
+#include <list>


Could be result of copy-paste, IWYU could confirm

wesm · 2018-08-15T13:26:39Z

examples/low-level-api/reader-writer2.cc

+  std::cout << "Parquet Writing and Reading Complete" << std::endl;
+
+  return 0;
+}


Is there a way to reuse code between this file and the other read/write example?

The two examples are quite different. The other example writes a single row group with a predetermined number of rows. I can move the schema creation part to a common header.

wesm · 2018-08-15T13:27:13Z

src/parquet/column_writer.cc

@@ -294,6 +295,8 @@ ColumnWriter::ColumnWriter(ColumnChunkMetaDataBuilder* metadata,
      num_buffered_encoded_values_(0),
      rows_written_(0),
      total_bytes_written_(0),
+      current_compressed_bytes_(0),
+      flush_on_close_(flush_on_close),


Would it be better to make this part of WriterProperties?

It will depend on what RowGroupWriter API we provide to the users. If parquet-cpp takes the responsibility of bounding RowGroup sizes (which I think is not in its scope based on the example above), then this can be.

wesm · 2018-08-15T13:31:53Z

src/parquet/column_writer.cc

      WriteDictionaryPage();
    }

+    flush_on_close_ = false;


This seems a tad hacky to me; I think there is a missing use of abstraction here which is the management of pages to be written to the file. The Parquet Java library has notions that provide for multiple implementations of this:

https://github.com/apache/parquet-mr/tree/master/parquet-column/src/main/java/org/apache/parquet/column/page

We don't need to necessarily do it in this patch, but using the existing PageWriter (or something similar if what is there now doesn't have the right features) abstraction with APIs to provide for

a) an implementation that accumulates pages in memory
b) an implementation that writes pages immediately

At some point we will want to make page writes asynchronous for better write performance so this refactoring will probably be desired at some point anyway

The PageWriter class is responsible for writing compressed data pages out. But the goal of either writing compressed Pages or accumulating them is more efficient if its part of the ColumnWriter.

wesm · 2018-08-15T13:34:09Z

src/parquet/column_writer.cc

+
+  if (flush_on_close_) {
+    current_compressed_bytes_ += page.size() + sizeof(format::DictionaryPageHeader);
+    saved_dictionary_page_.push_back(std::move(page));


This aligns with my comments above

wesm · 2018-08-15T13:37:07Z

src/parquet/column_writer.h

@@ -209,6 +220,8 @@ class PARQUET_EXPORT ColumnWriter {

  std::vector<CompressedDataPage> data_pages_;

+  std::vector<DictionaryPage> saved_dictionary_page_;


Should the pages all be kept in the same collection?

wesm · 2018-08-15T13:38:40Z

src/parquet/file_writer.h

+// All the values are compressed and stored in memory
+// Values are written to the file on Close
+
+class PARQUET_EXPORT RowGroupWriter2 {


Agree that RowGroupWriter2 is not a good name.

Why do we need a new class? It's not clear to me that it's needed

wesm · 2018-08-15T13:39:03Z

src/parquet/file_writer.h

+  int64_t num_rows() const;
+
+  // Only considers the size of the compressed pages + page header in all the columns
+  // Some values might be still buffered an not written to a page yet


See note above re: this comment

wesm · 2018-08-15T13:40:06Z

src/parquet/file_writer.h

@@ -102,6 +144,8 @@ class PARQUET_EXPORT ParquetFileWriter {

    virtual RowGroupWriter* AppendRowGroup() = 0;

+    virtual RowGroupWriter2* AppendRowGroup2() = 0;
+


Agreed. I think these details should be handled in the WriterProperties and the API of RowGroupWriter should be able to indicate to the caller whether it is "safe" to continue writing, or whether the current row group needs to be terminated and flushed

wesm · 2018-08-15T13:40:55Z

src/parquet/types.h

-    AES_GCM_V1 = 0,
-    AES_GCM_CTR_V1 = 1
-  };
+  enum type { AES_GCM_V1 = 0, AES_GCM_CTR_V1 = 1 };


clang-format artifact?

yes! I did use CLANG_FORMAT_VERSION 6.0

majetideepak · 2018-08-21T15:17:04Z

src/parquet/file_writer.cc

-         << " while previous column had " << num_rows_;
-      throw ParquetException(ss.str());
+    // verify when only one column is written at a time
+    if (!row_group_by_size_ && column_writers_.size() > 0 && column_writers_[0]) {


This if-else is one of the reasons to use a new RowGroupWriter class in my previous version of this.

majetideepak · 2018-08-21T15:18:29Z

src/parquet/file_writer.cc

+  }
+
+  ColumnWriter* get_column(int i) override {
+    if (!row_group_by_size_) {


This restriction is the other misfit in using a single RowGroupWriter API

majetideepak · 2018-08-21T15:21:13Z

@wesm, @xhochy I made changes based on the feedback. I can quickly add unit tests if we agree on the API.

xhochy · 2018-08-22T11:32:16Z

examples/low-level-api/CMakeLists.txt

@@ -15,7 +15,11 @@
 # specific language governing permissions and limitations
 # under the License.

+include_directories(SYSTEM . )


Please use https://cmake.org/cmake/help/v3.3/command/target_include_directories.html so that this does not spread to the main library.

+1. See example here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/CMakeLists.txt#L121

xhochy · 2018-08-22T11:36:48Z

src/parquet/file_writer.h

@@ -50,8 +54,16 @@ class PARQUET_EXPORT RowGroupWriter {
    virtual int64_t num_rows() const = 0;

    virtual ColumnWriter* NextColumn() = 0;
+    // to be used only when row_group_by_size = true
+    virtual ColumnWriter* get_column(int i) = 0;


I would just name this column()

xhochy · 2018-08-22T11:37:04Z

src/parquet/file_writer.h

@@ -69,11 +81,17 @@ class PARQUET_EXPORT RowGroupWriter {

  int num_columns() const;

+  // to be used only when row_group_by_size = true
+  ColumnWriter* get_column(int i);


wesm · 2018-08-23T03:24:04Z

Sorry for the delay in my follow up review. This is a priority for me tomorrow (Thursday) so we can release

wesm

I have a high level comment. The "row_groups_by_size" logic is up to the user, so this is a little bit misleading (unless I've misunderstood the code). Really what this is doing is buffering pages in memory and allowing you to decide whether to keep writing data before flushing to the file -- the decision may be based on how big the row group is. There are other reasons why you might want to do this, such as waiting until a row group is "ready" before issuing the write to an underlying file system. Since this is more general than writing by size, i.e. a type of buffering mode, we should name the APIs and parameters accordingly

wesm · 2018-08-23T15:31:30Z

examples/low-level-api/CMakeLists.txt

@@ -15,7 +15,11 @@
 # specific language governing permissions and limitations
 # under the License.

+include_directories(SYSTEM . )


+1. See example here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/CMakeLists.txt#L121

wesm · 2018-08-23T15:32:20Z

examples/low-level-api/reader-writer.cc


 /*
 * This example describes writing and reading Parquet Files in C++ and serves as a
 * reference to the API.
 * The file contains all the physical data types supported by Parquet.
+ * This example uses the RowGroupWriter API that supports writing RowGroups optimized for memory consumption


Seems maybe clang-format isn't hitting this file

wesm · 2018-08-23T15:40:36Z

src/parquet/file_writer.h

@@ -100,7 +118,7 @@ class PARQUET_EXPORT ParquetFileWriter {
    /// \note Deprecated since 1.3.0
    RowGroupWriter* AppendRowGroup(int64_t num_rows);

-    virtual RowGroupWriter* AppendRowGroup() = 0;
+    virtual RowGroupWriter* AppendRowGroup(bool row_group_by_size = false) = 0;


Adding a new method AppendFixedSizeRowGroup() would result in more readable code than AppendRowGroup(true)

wesm · 2018-08-23T15:42:04Z

src/parquet/file_writer.h

@@ -50,8 +54,16 @@ class PARQUET_EXPORT RowGroupWriter {
    virtual int64_t num_rows() const = 0;

    virtual ColumnWriter* NextColumn() = 0;
+    // to be used only when row_group_by_size = true


Clarify in a comment for NextColumn() that ColumnWriter* objects become invalid when NextColumn is called unless writing fixed size row groups

wesm · 2018-08-23T15:42:41Z

src/parquet/file_writer.h

@@ -39,6 +39,10 @@ class GroupNode;

 }  // namespace schema

+// RowGroupWriter implementation that optimizes memory requirement
+// All columns must be written one after the other
+// Writing to a new column prevents modification to the previous column


Add \brief here and use /// so these comments will show up in doxygen. We can make sure the Parquet docs show up well in http://arrow.apache.org/docs/cpp/ after the merge

wesm · 2018-08-23T15:43:24Z

src/parquet/file_writer.cc

 int RowGroupWriter::current_column() { return contents_->current_column(); }

 int RowGroupWriter::num_columns() const { return contents_->num_columns(); }

 int64_t RowGroupWriter::num_rows() const { return contents_->num_rows(); }

+inline void throwRowsMisMatchError(int col, int64_t prev, int64_t curr) {


ThrowRowsMismatchError

wesm · 2018-08-23T15:50:37Z

I suggest maybe calling this AppendBufferedRowGroup or something

wesm

+1. Thanks @majetideepak!

xhochy

+1, LGTM

jamesclampffer reviewed Aug 13, 2018

View reviewed changes

majetideepak force-pushed the PARQUET-1372 branch from d4208b7 to 27e4fb9 Compare August 13, 2018 17:35

wesm suggested changes Aug 15, 2018

View reviewed changes

Deepak Majeti added 12 commits August 20, 2018 16:47

RowGroupWriter2, implementation that writes all columns at once

f2f420d

Extend Column Writer to flush pages on Close

0fc1f5c

example for RowGroupWriter2

530b835

clang format

21642b3

fix compiler errors

cb7d69c

Combine RowGroupWriter2 with RowGroupWriter

9db26a2

modify examples

20049c0

clang format

26a52c1

add BufferedPageWriter

e148817

remove flush_on_close

410a3af

reorg examples

9e03004

clang format

710bbe0

majetideepak force-pushed the PARQUET-1372 branch from 734be69 to 710bbe0 Compare August 21, 2018 15:03

add example header

e179a4c

majetideepak commented Aug 21, 2018

View reviewed changes

fix compiler warnings

cb99b3f

majetideepak force-pushed the PARQUET-1372 branch from 87828ae to cb99b3f Compare August 21, 2018 18:08

xhochy reviewed Aug 22, 2018

View reviewed changes

wesm reviewed Aug 23, 2018

View reviewed changes

Deepak Majeti added 3 commits August 23, 2018 15:26

Review comments

d12b10b

Add test

c10fe08

improve comments

143ed51

majetideepak force-pushed the PARQUET-1372 branch from 45cec15 to 143ed51 Compare August 23, 2018 22:37

wesm approved these changes Aug 24, 2018

View reviewed changes

xhochy approved these changes Aug 25, 2018

View reviewed changes

xhochy closed this in 80e110c Aug 25, 2018

		@@ -102,6 +144,8 @@ class PARQUET_EXPORT ParquetFileWriter {

		virtual RowGroupWriter* AppendRowGroup() = 0;

		virtual RowGroupWriter2* AppendRowGroup2() = 0;

		@@ -209,6 +220,8 @@ class PARQUET_EXPORT ColumnWriter {

		std::vector<CompressedDataPage> data_pages_;

		std::vector<DictionaryPage> saved_dictionary_page_;

PARQUET-1372: Add an API to allow writing RowGroups based on size #484

PARQUET-1372: Add an API to allow writing RowGroups based on size #484

Conversation

majetideepak commented Aug 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

majetideepak Aug 15, 2018 • edited Loading

Choose a reason for hiding this comment

wesm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

majetideepak Aug 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

majetideepak commented Aug 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented Aug 23, 2018

wesm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented Aug 23, 2018

wesm left a comment

Choose a reason for hiding this comment

xhochy left a comment

Choose a reason for hiding this comment

majetideepak commented Aug 13, 2018 •

edited

Loading

majetideepak Aug 15, 2018 •

edited

Loading

majetideepak Aug 15, 2018 •

edited

Loading