[SPARK-57723][CORE] Enable `spark.rdd.compress` by default like PySpark by dongjoon-hyun · Pull Request #56824 · apache/spark

dongjoon-hyun · 2026-06-26T20:18:12Z

What changes were proposed in this pull request?

This PR aims to enable spark.rdd.compress by default for non-Python languages because we are supporting more and more languages via Spark Connect, we had better match the behavior with PySpark for the feature parity across all languages. Currently, only Python language has this advantage unfairly in the Apache Spark language eco-system.

spark/python/pyspark/core/context.py

Line 88 in c3e9ea2

"spark.rdd.compress": True,

Scala/Java (main repository)
https://github.com/apache/spark-connect-swift
https://github.com/apache/spark-connect-rust
https://github.com/apache/spark-connect-go

Why are the changes needed?

Since Apache Spark 2.0, PySpark has been taking advantage of spark.rdd.compres by default stably to compress serialized RDD partitions and has been saving memory and disk at a small CPU cost. Since PySpark has long defaulted this to true; making it the core default applies the same proven default across all languages.

[SPARK-2652] [PySpark] Turning some default configs for PySpark #1568

For the record, Python is the first user-facing language in the Apache Spark website and its behavior is also the same.

https://spark.apache.org

Does this PR introduce any user-facing change?

For PySpark users, no.
For non-PySpark users, yes. Serialized RDD partitions are now compressed by default. To restore the previous behavior, set spark.rdd.compress to false.

How was this patch tested?

Pass the CIs.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.8

dongjoon-hyun · 2026-06-26T20:27:56Z

cc @HyukjinKwon , @cloud-fan , @viirya , @MaxGekk

dongjoon-hyun · 2026-06-26T20:48:38Z

Thank you, @MaxGekk . I'll make it sure that all CIs pass in this PR.

dongjoon-hyun · 2026-06-26T21:44:24Z

Thank you, @viirya !

### What changes were proposed in this pull request? This PR aims to enable `spark.rdd.compress` by default for **non-Python** languages because we are supporting more and more languages via `Spark Connect`, we had better match the behavior with PySpark for the feature parity across all languages. Currently, only `Python` language has this advantage unfairly in the Apache Spark language eco-system. https://github.com/apache/spark/blob/c3e9ea2299c0fb064e913b2c8463ec777ef6cdb4/python/pyspark/core/context.py#L88 - Scala/Java (main repository) - https://github.com/apache/spark-connect-swift - https://github.com/apache/spark-connect-rust - https://github.com/apache/spark-connect-go ### Why are the changes needed? Since Apache Spark 2.0, PySpark has been taking advantage of `spark.rdd.compres` by default stably to compress serialized RDD partitions and has been saving memory and disk at a small CPU cost. Since `PySpark` has long defaulted this to `true`; making it the core default applies the same proven default across all languages. - #1568 For the record, Python is the first user-facing language in the Apache Spark website and its behavior is also the same. - https://spark.apache.org <img width="1071" height="347" alt="Screenshot 2026-06-26 at 13 37 43" src="https://github.com/user-attachments/assets/cc92c785-9763-44cc-9277-c1b40bf2a0c3" /> ### Does this PR introduce _any_ user-facing change? - For PySpark users, no. - For non-PySpark users, yes. Serialized RDD partitions are now compressed by default. To restore the previous behavior, set `spark.rdd.compress` to `false`. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.8 Closes #56824 from dongjoon-hyun/SPARK-57723. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 0ac3189) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

dongjoon-hyun · 2026-06-27T01:28:13Z

Merged to master/4.x for Apache Spark 4.3.0.

[SPARK-57723][CORE] Enable spark.rdd.compress by default like PySpark

fb177ec

MaxGekk approved these changes Jun 26, 2026

View reviewed changes

viirya approved these changes Jun 26, 2026

View reviewed changes

Fix BlockManagerSuite

dee6901

dongjoon-hyun closed this in 0ac3189 Jun 27, 2026

dongjoon-hyun deleted the SPARK-57723 branch June 27, 2026 01:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-57723][CORE] Enable `spark.rdd.compress` by default like PySpark#56824

[SPARK-57723][CORE] Enable `spark.rdd.compress` by default like PySpark#56824
dongjoon-hyun wants to merge 2 commits into
apache:masterfrom
dongjoon-hyun:SPARK-57723

dongjoon-hyun commented Jun 26, 2026 •

edited

Loading

Uh oh!

dongjoon-hyun commented Jun 26, 2026

Uh oh!

dongjoon-hyun commented Jun 26, 2026

Uh oh!

dongjoon-hyun commented Jun 26, 2026

Uh oh!

dongjoon-hyun commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

dongjoon-hyun commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun commented Jun 26, 2026

Uh oh!

dongjoon-hyun commented Jun 26, 2026

Uh oh!

dongjoon-hyun commented Jun 26, 2026

Uh oh!

dongjoon-hyun commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dongjoon-hyun commented Jun 26, 2026 •

edited

Loading