Is your feature request related to a problem?
ArrowWriter must buffer every column's compressed pages for an entire row group before it can splice them into contiguous column chunks at flush, so peak memory is ≈ Σ(compressed bytes of all column chunks) and grows with row group size. On wide, skewed schemas (e.g. ~400 columns, some columns far larger than others) this can consume >=12 GB of memory just to write Parquet. The only existing lever is flushing smaller row groups via in_progress_size, which trades away compression and read-time page/row-group pruning.
Describe the solution you'd like
Some way to buffer not in memory
Describe alternatives you've considered
Reducing row group size to limit buffering, but this sacrifices encoding efficiency and read performance and doesn't address the underlying coupling where one column's size forces the page layout of others.
Additional context
Related issues:
Is your feature request related to a problem?
ArrowWritermust buffer every column's compressed pages for an entire row group before it can splice them into contiguous column chunks at flush, so peak memory is ≈ Σ(compressed bytes of all column chunks) and grows with row group size. On wide, skewed schemas (e.g. ~400 columns, some columns far larger than others) this can consume >=12 GB of memory just to write Parquet. The only existing lever is flushing smaller row groups viain_progress_size, which trades away compression and read-time page/row-group pruning.Describe the solution you'd like
Some way to buffer not in memory
Describe alternatives you've considered
Reducing row group size to limit buffering, but this sacrifices encoding efficiency and read performance and doesn't address the underlying coupling where one column's size forces the page layout of others.
Additional context
Related issues:
ArrowWriter#5484