-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem?
When writing ZSTD-compressed JSON (NDJSON) files using DataFrame::write_json(), DataFusion uses the default ZSTD compression level (3). There's no way to configure a higher compression level (e.g., level 9) for better compression ratios.
Currently, JsonOptions only exposes:
pub struct JsonOptions {
pub compression: CompressionTypeVariant,
pub schema_infer_max_rec: Option<usize>,
}The underlying async-compression crate supports configurable levels via ZstdEncoder::with_quality(writer, Level::Precise(level)), but DataFusion hardcodes the default:
// In file_compression_type.rs
CompressionTypeVariant::ZSTD => Box::new(ZstdEncoder::new(w))Describe the solution you'd like
Add an optional compression_level field to JsonOptions and CsvOptions:
pub struct JsonOptions {
pub compression: CompressionTypeVariant,
pub compression_level: Option<u32>, // NEW
pub schema_infer_max_rec: Option<usize>,
}When specified, pass the level through to the encoder creation in FileCompressionType:
CompressionTypeVariant::ZSTD => match compression_level {
Some(level) => Box::new(ZstdEncoder::with_quality(w, Level::Precise(level as i32))),
None => Box::new(ZstdEncoder::new(w)), // Default level 3
}Use Case
- Storage optimization: Level 9 provides ~10-15% better compression than level 3
- Cost savings: Smaller files = lower S3/GCS storage costs
- Bandwidth: Faster transfers for compressed data over networks
- Consistency with Parquet: Parquet already supports
parquet.compression_leveloption
ZSTD Level Reference
| Level | Compression Ratio | Speed (MB/s) |
|---|---|---|
| 1 | ~2.8:1 | ~500 |
| 3 | ~3.2:1 | ~350 |
| 9 | ~3.6:1 | ~90 |
| 19 | ~4.0:1 | ~10 |
Additional context
This would align JSON/CSV writers with Parquet, which already supports compression level configuration. The change is backward-compatible since the field is optional and defaults to the current behavior.
Related Parquet issues for reference:
- Update Default Parquet Write Compression to Sensible Default #7691 - Update Default Parquet Write Compression
- Update Default Parquet Write Compression #7692 - Parquet compression level configuration