Skip to content

Support parquet content-defined chunking options #21408

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

This issue is filed to track the changes proposed in #21110.

DataFusion currently does not expose the new Parquet Content-Defined Chunking (CDC) support added in parquet-rs by apache/arrow-rs#9450. Traditional Parquet writing splits data pages at fixed sizes, so inserting or deleting a row causes subsequent pages to shift and can force nearly all bytes to be re-uploaded in content-addressable storage systems.

CDC instead determines page boundaries using a rolling hash over column values, so unchanged data can produce identical pages across writes. This can reduce storage and upload costs and improve deduplication behavior for rewritten datasets.

Describe the solution you'd like

Expose the Parquet CDC writer options in DataFusion so users can enable the feature when writing Parquet files.

This should cover the configuration surface introduced upstream in parquet-rs, including:

  • enabling content-defined chunking with the default settings
  • configuring explicit CDC parameters such as min_chunk_size, max_chunk_size, and norm_level

The implementation and rationale are largely derived from apache/arrow-rs#9450, and this issue exists to track carrying those changes through in DataFusion via #21110.

Describe alternatives you've considered

Continue using the existing fixed-size Parquet page splitting behavior and do not expose CDC-related writer options in DataFusion.

That preserves current behavior, but it means users cannot take advantage of the improved page stability and deduplication characteristics now available in parquet-rs.

Additional context

Most of the technical content above is intentionally derived from apache/arrow-rs#9450, with the additional context that this issue tracks the corresponding DataFusion work in #21110.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions