-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] Interface total_bytes_written is Confusing #33652
Comments
@pitrou @wjones127 Hi, would you mind me add a |
I'm don't think you are missing any API, although I am not sure the use case for exposing it during the write. Why not call |
Because we have "buffered" page writer. |
I think For now we can only get size of flushed row groups via |
To detail explain this problem, let me assume a schema: Assume the data below:
So, we have a There are two kinds of page writer:
So, assume we want to get "currently buffered value size" and "Unbuffered estimated size", We still have interface in this way, yes, |
That makes sense. It sounds like you want to be able to get the data size while writing and not wait until you have finished and flushed all data. That makes sense. 👍 |
…#33897) ### Rationale for this change ### What changes are included in this PR? Talked in #33652 . Main issue is that `total_bytes_written` is confusing, because it only tells the size of "uncompressed" bytes size written. For buffered page writer, the actually written size cannot be known until all column chunk is written and call `sink_.Tell()`. I'd like to: * [x] Add interface for PageWriter * [x] Add interface for ColumnWriter * [x] Add interface for RowGroupWriter * [x] Testing them ### Are these changes tested? I'd like to test it later. ### Are there any user-facing changes? User can get excatly size written by: `total_compressed_bytes_written() + total_compressed_bytes()` for ColumnWriter before finish writing. * Closes: #33652 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: mwish <1506118561@qq.com> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>
Describe the enhancement requested
Hi, all, when using parquet's
total_bytes_written
, I found it really confusing.In
RowGroupWriter
:total_compressed_bytes()
is clearAnd in
ColumnWriter
, the interface says:In
ColumnWriterImpl
, it's:So, in fact, the
total_bytes_written()
inRowGroup
just means the uncompressed bytes it output? Is there any interface for compressed bytes? In non-buffer mode, we can just callsink_.Tell()
, but in buffer mode, how can we get compressed-output length?Component(s)
C++, Parquet
The text was updated successfully, but these errors were encountered: