Skip to content

[SPARK-57723][CORE] Enable spark.rdd.compress by default like PySpark#56824

Closed
dongjoon-hyun wants to merge 2 commits into
apache:masterfrom
dongjoon-hyun:SPARK-57723
Closed

[SPARK-57723][CORE] Enable spark.rdd.compress by default like PySpark#56824
dongjoon-hyun wants to merge 2 commits into
apache:masterfrom
dongjoon-hyun:SPARK-57723

Conversation

@dongjoon-hyun

@dongjoon-hyun dongjoon-hyun commented Jun 26, 2026

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

This PR aims to enable spark.rdd.compress by default for non-Python languages because we are supporting more and more languages via Spark Connect, we had better match the behavior with PySpark for the feature parity across all languages. Currently, only Python language has this advantage unfairly in the Apache Spark language eco-system.

"spark.rdd.compress": True,

Why are the changes needed?

Since Apache Spark 2.0, PySpark has been taking advantage of spark.rdd.compres by default stably to compress serialized RDD partitions and has been saving memory and disk at a small CPU cost. Since PySpark has long defaulted this to true; making it the core default applies the same proven default across all languages.

For the record, Python is the first user-facing language in the Apache Spark website and its behavior is also the same.

Screenshot 2026-06-26 at 13 37 43

Does this PR introduce any user-facing change?

  • For PySpark users, no.
  • For non-PySpark users, yes. Serialized RDD partitions are now compressed by default. To restore the previous behavior, set spark.rdd.compress to false.

How was this patch tested?

Pass the CIs.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.8

@dongjoon-hyun

Copy link
Copy Markdown
Member Author

cc @HyukjinKwon , @cloud-fan , @viirya , @MaxGekk

@dongjoon-hyun

Copy link
Copy Markdown
Member Author

Thank you, @MaxGekk . I'll make it sure that all CIs pass in this PR.

@dongjoon-hyun

Copy link
Copy Markdown
Member Author

Thank you, @viirya !

dongjoon-hyun added a commit that referenced this pull request Jun 27, 2026
### What changes were proposed in this pull request?

This PR aims to enable `spark.rdd.compress` by default for **non-Python** languages because we are supporting more and more languages via `Spark Connect`, we had better match the behavior with PySpark for the feature parity across all languages. Currently, only `Python` language has this advantage unfairly in the Apache Spark language eco-system.

https://github.com/apache/spark/blob/c3e9ea2299c0fb064e913b2c8463ec777ef6cdb4/python/pyspark/core/context.py#L88

- Scala/Java (main repository)
- https://github.com/apache/spark-connect-swift
- https://github.com/apache/spark-connect-rust
- https://github.com/apache/spark-connect-go

### Why are the changes needed?

Since Apache Spark 2.0, PySpark has been taking advantage of `spark.rdd.compres` by default stably to compress serialized RDD partitions and has been saving memory and disk at a small CPU cost. Since `PySpark` has long defaulted this to `true`; making it the core default applies the same proven default across all languages.

- #1568

For the record, Python is the first user-facing language in the Apache Spark website and its behavior is also the same.

- https://spark.apache.org

<img width="1071" height="347" alt="Screenshot 2026-06-26 at 13 37 43" src="https://github.com/user-attachments/assets/cc92c785-9763-44cc-9277-c1b40bf2a0c3" />

### Does this PR introduce _any_ user-facing change?

- For PySpark users, no.
- For non-PySpark users, yes. Serialized RDD partitions are now compressed by default. To restore the previous behavior, set `spark.rdd.compress` to `false`.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.8

Closes #56824 from dongjoon-hyun/SPARK-57723.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 0ac3189)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun

Copy link
Copy Markdown
Member Author

Merged to master/4.x for Apache Spark 4.3.0.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-57723 branch June 27, 2026 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants