You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The pq.write_to_dataset (legacy implementation) supports the row_group_size/chunk_size keyword to specify the row group size of the written parquet files.
Now that we made use_legacy_dataset=False the default, this keyword doesn't work anymore.
This is because dataset.write_dataset(..) doesn't support the parquet row_group_size keyword. The ParquetFileWriteOptions class doesn't support this keyword.
On the parquet side, this is also the only keyword that is not passed to the ParquetWriter init (and thus to parquet's WriterProperties or ArrowWriterProperties), but to the actual write_table call. In C++ this can be seen at
The dataset API now has a max_rows_per_group, I see, but that doesn't necessarily directly relate to Parquet row groups?
It's more generic about how many rows are written in one go, but so effectively is therefore also a max parquet row group size? (since those need to be written in one go) In that sense the parquet row_group_size keyword could be translated into that keyword to preserve the intended usecase?
Weston Pace / @westonpace:
Your understanding is correct. I think max_rows_per_group is the correct choice here. Each call to Write (e.g. one go) results in
It might also be useful to also set min_rows_per_group to row_group_size but that would be a change in behavior so maybe we shouldn't do this too (the legacy behavior would just write tiny groups in this case).
The
pq.write_to_dataset
(legacy implementation) supports therow_group_size
/chunk_size
keyword to specify the row group size of the written parquet files.Now that we made
use_legacy_dataset=False
the default, this keyword doesn't work anymore.This is because
dataset.write_dataset(..)
doesn't support the parquetrow_group_size
keyword. TheParquetFileWriteOptions
class doesn't support this keyword.On the parquet side, this is also the only keyword that is not passed to the
ParquetWriter
init (and thus to parquet'sWriterProperties
orArrowWriterProperties
), but to the actualwrite_table
call. In C++ this can be seen atarrow/cpp/src/parquet/arrow/writer.h
Lines 62 to 71 in 76d064c
See discussion: #12811 (comment)
Reporter: Alenka Frim / @AlenkaF
Assignee: Alenka Frim / @AlenkaF
PRs and other links:
Note: This issue was originally created as ARROW-16240. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: