Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Interface total_bytes_written is Confusing #33652

Closed
mapleFU opened this issue Jan 13, 2023 · 6 comments · Fixed by #33897
Closed

[C++][Parquet] Interface total_bytes_written is Confusing #33652

mapleFU opened this issue Jan 13, 2023 · 6 comments · Fixed by #33897

Comments

@mapleFU
Copy link
Member

mapleFU commented Jan 13, 2023

Describe the enhancement requested

Hi, all, when using parquet's total_bytes_written, I found it really confusing.

In RowGroupWriter:

class PARQUET_EXPORT RowGroupWriter {
 public:
  // Forward declare a virtual class 'Contents' to aid dependency injection and more
  // easily create test fixtures
  // An implementation of the Contents class is defined in the .cc file
  struct Contents {
    virtual ~Contents() = default;
    virtual int num_columns() const = 0;
    virtual int64_t num_rows() const = 0;

    // to be used only with ParquetFileWriter::AppendRowGroup
    virtual ColumnWriter* NextColumn() = 0;
    // to be used only with ParquetFileWriter::AppendBufferedRowGroup
    virtual ColumnWriter* column(int i) = 0;

    virtual int current_column() const = 0;
    virtual void Close() = 0;

    // total bytes written by the page writer
    virtual int64_t total_bytes_written() const = 0;
    // total bytes still compressed but not written
    virtual int64_t total_compressed_bytes() const = 0;

    virtual bool buffered() const = 0;
  };
  1. Syntax of total_compressed_bytes() is clear
  2. total_bytes_written means "total bytes written by page writer".

And in ColumnWriter, the interface says:

  /// \brief The total size of the compressed pages + page headers. Some values
  /// might be still buffered and not written to a page yet
  virtual int64_t total_compressed_bytes() const = 0;

  /// \brief The total number of bytes written as serialized data and
  /// dictionary pages to the ColumnChunk so far
  virtual int64_t total_bytes_written() const = 0;

In ColumnWriterImpl, it's:

  // Records the total number of uncompressed bytes written by the serializer
  int64_t total_bytes_written_;

  // Records the current number of compressed bytes in a column
  int64_t total_compressed_bytes_;

So, in fact, the total_bytes_written() in RowGroup just means the uncompressed bytes it output? Is there any interface for compressed bytes? In non-buffer mode, we can just call sink_.Tell(), but in buffer mode, how can we get compressed-output length?

Component(s)

C++, Parquet

@mapleFU
Copy link
Member Author

mapleFU commented Jan 18, 2023

@pitrou @wjones127 Hi, would you mind me add a compressed_written_bytes api? Or did I miss some api?

@wjones127
Copy link
Member

wjones127 commented Jan 23, 2023

I'm don't think you are missing any API, although I am not sure the use case for exposing it during the write. Why not call Tell() on the sink after flushing?

@mapleFU
Copy link
Member Author

mapleFU commented Jan 24, 2023

I'm don't think you are missing any API, although I am not sure the use case for exposing it during the write. Why not call Tell() on the sink after flushing?

Because we have "buffered" page writer.

@wgtmac
Copy link
Member

wgtmac commented Jan 24, 2023

I'm don't think you are missing any API, although I am not sure the use case for exposing it during the write. Why not call Tell() on the sink after flushing?

I think compressed_written_bytes would be useful if we want to limit compressed size of row group if we buffer row group in the parquet writer.

For now we can only get size of flushed row groups via Tell() and raw (uncompressed) size of buffered row group via total_bytes_written(). As we cannot precisely estimate the compression ratio, it is not easy to limit the actual row group size after compression.

@mapleFU
Copy link
Member Author

mapleFU commented Jan 24, 2023

@wgtmac @wjones127

To detail explain this problem, let me assume a schema: (a: required int, b: required int).

Assume the data below:

a:  | Dict Page | Page 1 |  Page 2 | Unbuffered Written Values |
b:  | Page 1 |  Page 2 | Page 3 | Unbuffered Written Values |

So, we have a RowGroupWriter, and two ColumnWriter here. Each column writer holds a PageWriter.

There are two kinds of page writer:

  1. SerializedPageWriter, which just write "compressed page" to sink.
  2. BufferedPageWriter, which buffer the writing pages, and write multiple compressed page to sink.

So, assume we want to get "currently buffered value size" and "Unbuffered estimated size", sink_.Tell() will not be enough, because we have buffered page writer, which haven't write to sink.

We still have interface in this way, yes, total_bytes_written. But, as you can see, it's just "uncompressed size". For example, assume column b uses DELTA_BINARY_PACKED, it size would be 50B, but the total_bytes_written may be 5KB, which cannot tell use the size it written

@wjones127
Copy link
Member

As we cannot precisely estimate the compression ratio, it is not easy to limit the actual row group size after compression.

That makes sense. It sounds like you want to be able to get the data size while writing and not wait until you have finished and flushed all data. That makes sense. 👍

wjones127 pushed a commit that referenced this issue Feb 27, 2023
…#33897)

### Rationale for this change

### What changes are included in this PR?

Talked in #33652 . Main issue is that `total_bytes_written` is confusing, because it only tells the size of "uncompressed" bytes size written. For buffered page writer, the actually written size cannot be known until all column chunk is written and call `sink_.Tell()`.

I'd like to:
* [x] Add interface for PageWriter
* [x] Add interface for ColumnWriter
* [x] Add interface for RowGroupWriter
* [x] Testing them

### Are these changes tested?

I'd like to test it later.

### Are there any user-facing changes?

User can get excatly size written by: `total_compressed_bytes_written() + total_compressed_bytes()` for ColumnWriter before finish writing.

* Closes: #33652

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
@wjones127 wjones127 added this to the 12.0.0 milestone Feb 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants