Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Support row_group_size/chunk_size keyword in pq.write_to_dataset with use_legacy_dataset=False #31636

Closed
Tracked by #31529
asfimport opened this issue Apr 19, 2022 · 3 comments

Comments

@asfimport
Copy link

The pq.write_to_dataset (legacy implementation) supports the row_group_size/chunk_size keyword to specify the row group size of the written parquet files.

Now that we made use_legacy_dataset=False the default, this keyword doesn't work anymore.

This is because dataset.write_dataset(..) doesn't support the parquet row_group_size keyword. The ParquetFileWriteOptions class doesn't support this keyword.

On the parquet side, this is also the only keyword that is not passed to the ParquetWriter init (and thus to parquet's WriterProperties or ArrowWriterProperties), but to the actual write_table call. In C++ this can be seen at

static ::arrow::Status Open(const ::arrow::Schema& schema, MemoryPool* pool,
std::shared_ptr<::arrow::io::OutputStream> sink,
std::shared_ptr<WriterProperties> properties,
std::shared_ptr<ArrowWriterProperties> arrow_properties,
std::unique_ptr<FileWriter>* writer);
virtual std::shared_ptr<::arrow::Schema> schema() const = 0;
/// \brief Write a Table to Parquet.
virtual ::arrow::Status WriteTable(const ::arrow::Table& table, int64_t chunk_size) = 0;

See discussion: #12811 (comment)

Reporter: Alenka Frim / @AlenkaF
Assignee: Alenka Frim / @AlenkaF

PRs and other links:

Note: This issue was originally created as ARROW-16240. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
cc @westonpace do you remember if this has been discussed before how the row_group_size/chunk_size setting from Parquet fits into the dataset API?

The dataset API now has a max_rows_per_group, I see, but that doesn't necessarily directly relate to Parquet row groups?
It's more generic about how many rows are written in one go, but so effectively is therefore also a max parquet row group size? (since those need to be written in one go) In that sense the parquet row_group_size keyword could be translated into that keyword to preserve the intended usecase?

@asfimport
Copy link
Author

Weston Pace / @westonpace:
Your understanding is correct. I think max_rows_per_group is the correct choice here. Each call to Write (e.g. one go) results in


parquet_writer_->WriteTable(*table, batch->num_rows())

so it will create a new parquet row group.

It might also be useful to also set min_rows_per_group to row_group_size but that would be a change in behavior so maybe we shouldn't do this too (the legacy behavior would just write tiny groups in this case).

@asfimport
Copy link
Author

Krisztian Szucs / @kszucs:
Issue resolved by pull request 12955
#12955

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants