[Python] Support row_group_size/chunk_size keyword in pq.write_to_dataset with use_legacy_dataset=False #31636

asfimport · 2022-04-19T18:03:20Z

The pq.write_to_dataset (legacy implementation) supports the row_group_size/chunk_size keyword to specify the row group size of the written parquet files.

Now that we made use_legacy_dataset=False the default, this keyword doesn't work anymore.

This is because dataset.write_dataset(..) doesn't support the parquet row_group_size keyword. The ParquetFileWriteOptions class doesn't support this keyword.

On the parquet side, this is also the only keyword that is not passed to the ParquetWriter init (and thus to parquet's WriterProperties or ArrowWriterProperties), but to the actual write_table call. In C++ this can be seen at

arrow/cpp/src/parquet/arrow/writer.h

Lines 62 to 71 in 76d064c

    
           static ::arrow::Status Open(const ::arrow::Schema& schema, MemoryPool* pool, 
        
                                       std::shared_ptr<::arrow::io::OutputStream> sink, 
        
                                       std::shared_ptr<WriterProperties> properties, 
        
                                       std::shared_ptr<ArrowWriterProperties> arrow_properties, 
        
                                       std::unique_ptr<FileWriter>* writer); 
        
           virtual std::shared_ptr<::arrow::Schema> schema() const = 0; 
        
           /// \brief Write a Table to Parquet. 
        
           virtual ::arrow::Status WriteTable(const ::arrow::Table& table, int64_t chunk_size) = 0;

See discussion: #12811 (comment)

Reporter: Alenka Frim / @AlenkaF
Assignee: Alenka Frim / @AlenkaF

PRs and other links:

GitHub Pull Request #12955

_{Note: This issue was originally created as ARROW-16240. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2022-04-21T18:38:11Z

Joris Van den Bossche / @jorisvandenbossche:
cc @westonpace do you remember if this has been discussed before how the row_group_size/chunk_size setting from Parquet fits into the dataset API?

The dataset API now has a max_rows_per_group, I see, but that doesn't necessarily directly relate to Parquet row groups?
It's more generic about how many rows are written in one go, but so effectively is therefore also a max parquet row group size? (since those need to be written in one go) In that sense the parquet row_group_size keyword could be translated into that keyword to preserve the intended usecase?

asfimport · 2022-04-21T19:42:06Z

Weston Pace / @westonpace:
Your understanding is correct. I think max_rows_per_group is the correct choice here. Each call to Write (e.g. one go) results in


parquet_writer_->WriteTable(*table, batch->num_rows())

so it will create a new parquet row group.

It might also be useful to also set min_rows_per_group to row_group_size but that would be a change in behavior so maybe we shouldn't do this too (the legacy behavior would just write tiny groups in this case).

asfimport · 2022-04-22T21:43:39Z

Krisztian Szucs / @kszucs:
Issue resolved by pull request 12955
#12955

asfimport closed this as completed Apr 22, 2022

asfimport assigned AlenkaF Jan 11, 2023

asfimport mentioned this issue Jan 11, 2023

[Python] Deprecate the legacy ParquetDataset custom python-based implementation #31529

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Support row_group_size/chunk_size keyword in pq.write_to_dataset with use_legacy_dataset=False #31636

[Python] Support row_group_size/chunk_size keyword in pq.write_to_dataset with use_legacy_dataset=False #31636

asfimport commented Apr 19, 2022

asfimport commented Apr 21, 2022

asfimport commented Apr 21, 2022

asfimport commented Apr 22, 2022

[Python] Support row_group_size/chunk_size keyword in pq.write_to_dataset with use_legacy_dataset=False #31636

[Python] Support row_group_size/chunk_size keyword in pq.write_to_dataset with use_legacy_dataset=False #31636

Comments

asfimport commented Apr 19, 2022

PRs and other links:

asfimport commented Apr 21, 2022

asfimport commented Apr 21, 2022

asfimport commented Apr 22, 2022