Skip to content

Add Compression Level Configuration for JSON/CSV Output #18947

@Smotrov

Description

@Smotrov

Is your feature request related to a problem?

When writing ZSTD-compressed JSON (NDJSON) files using DataFrame::write_json(), DataFusion uses the default ZSTD compression level (3). There's no way to configure a higher compression level (e.g., level 9) for better compression ratios.

Currently, JsonOptions only exposes:

pub struct JsonOptions {
    pub compression: CompressionTypeVariant,
    pub schema_infer_max_rec: Option<usize>,
}

The underlying async-compression crate supports configurable levels via ZstdEncoder::with_quality(writer, Level::Precise(level)), but DataFusion hardcodes the default:

// In file_compression_type.rs
CompressionTypeVariant::ZSTD => Box::new(ZstdEncoder::new(w))

Describe the solution you'd like

Add an optional compression_level field to JsonOptions and CsvOptions:

pub struct JsonOptions {
    pub compression: CompressionTypeVariant,
    pub compression_level: Option<u32>,  // NEW
    pub schema_infer_max_rec: Option<usize>,
}

When specified, pass the level through to the encoder creation in FileCompressionType:

CompressionTypeVariant::ZSTD => match compression_level {
    Some(level) => Box::new(ZstdEncoder::with_quality(w, Level::Precise(level as i32))),
    None => Box::new(ZstdEncoder::new(w)),  // Default level 3
}

Use Case

  • Storage optimization: Level 9 provides ~10-15% better compression than level 3
  • Cost savings: Smaller files = lower S3/GCS storage costs
  • Bandwidth: Faster transfers for compressed data over networks
  • Consistency with Parquet: Parquet already supports parquet.compression_level option

ZSTD Level Reference

Level Compression Ratio Speed (MB/s)
1 ~2.8:1 ~500
3 ~3.2:1 ~350
9 ~3.6:1 ~90
19 ~4.0:1 ~10

Additional context

This would align JSON/CSV writers with Parquet, which already supports compression level configuration. The change is backward-compatible since the field is optional and defaults to the current behavior.

Related Parquet issues for reference:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions