[SPARK-35181][CORE] Use zstd for spark.io.compression.codec by default #32286

dongjoon-hyun · 2021-04-22T05:26:14Z

What changes were proposed in this pull request?

This PR aims to use zstd as spark.io.compression.codec instead of lz4 in order to reduce the disk IOs and traffic during shuffle processing and worker decommission storage migration (between executors and to external storage).

Since SPARK-29434 and SPARK-29576, Apache Spark 3.0+ uses ZSTD spark.shuffle.mapStatus.compression.codec by default instead of GZIP.
Since SPARK-34503, Apache Spark 3.2 uses ZSTD for spark.eventLog.compression.codec by default instead of LZ4.

Why are the changes needed?

To reduce the disk footprint. For TPCDS 3TB case, zstd has 44% less shuffle write size and 43% less shuffle read size
For some cases, the query execution with zstd io is 20% faster than lz4 io.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the CIs.

SparkQA · 2021-04-22T06:17:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42304/

SparkQA · 2021-04-22T06:17:34Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42304/

MaxGekk

to reduce the disk IOs and traffic during shuffle

@dongjoon-hyun How about CPU load. If user's job is CPU bound than zstd can introduce perf regression in the case if zstd consumes more CPU.

SparkQA · 2021-04-22T09:42:25Z

Test build #137777 has finished for PR 32286 at commit b8266fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Seems OK for consistency and maybe better speed; we should mention it in a migration guide.

srowen · 2021-04-23T17:14:58Z

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

@@ -52,6 +52,8 @@ class AdaptiveQueryExecSuite

  import testImplicits._

+  override protected def sparkConf = super.sparkConf.set("spark.io.compression.codec", "lz4")


Should this be define more like a method with braces, etc or is that how similar code does it?

Sorry for late reply.

For this one, we can use this simple form like the other places.

$ git grep 'override protected def sparkConf' | grep super sql/core/src/test/scala/org/apache/spark/sql/AggregateHashMapSuite.scala: override protected def sparkConf: SparkConf = super.sparkConf sql/core/src/test/scala/org/apache/spark/sql/AggregateHashMapSuite.scala: override protected def sparkConf: SparkConf = super.sparkConf sql/core/src/test/scala/org/apache/spark/sql/AggregateHashMapSuite.scala: override protected def sparkConf: SparkConf = super.sparkConf sql/core/src/test/scala/org/apache/spark/sql/connector/FileDataSourceV2FallBackSuite.scala: override protected def sparkConf: SparkConf = super.sparkConf.set(SQLConf.USE_V1_SOURCE_LIST, "") sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala: override protected def sparkConf: SparkConf = super.sparkConf sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala: override protected def sparkConf: SparkConf = super.sparkConf sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala: override protected def sparkConf: SparkConf = super.sparkConf sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala: override protected def sparkConf = super.sparkConf.set("spark.io.compression.codec", "lz4") sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/AlterTableRecoverPartitionsSuite.scala: override protected def sparkConf = super.sparkConf sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/AlterTableRecoverPartitionsSuite.scala: override protected def sparkConf = super.sparkConf sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcPartitionDiscoverySuite.scala: override protected def sparkConf: SparkConf = super.sparkConf.set(SQLConf.USE_V1_SOURCE_LIST, "") sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala: override protected def sparkConf: SparkConf = super.sparkConf

Or, like the following~

override protected def sparkConf: SparkConf = super .sparkConf .set(SQLConf.USE_V1_SOURCE_LIST, "parquet")

SparkQA · 2021-05-09T14:51:42Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42828/

SparkQA · 2021-05-09T16:17:01Z

Test build #138306 has finished for PR 32286 at commit e89681b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-05-09T16:50:48Z

retest this please

SparkQA · 2021-05-09T17:54:31Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42830/

SparkQA · 2021-05-09T17:54:33Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42830/

SparkQA · 2021-05-09T19:27:49Z

Test build #138308 has finished for PR 32286 at commit e89681b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

LucaCanali · 2021-05-21T13:03:08Z

I have tested this with a few runs of TPCDS query q64 that is shuffle-intensive. I see indeed a very inportant reduction of shuffle write and read bytes (about -40% as reported by @dongjoon-hyun). Also shuffleWriteTime and shuffleFetchWaitTime are improved (about -25% and a -75%) in my test.
I also measured an increase of CPU usage though, of about 10% in my test. (note q64 used for this test performed using TPCDS scale 1500 and spark 3.2.0-SNAPSHOT from master 20210510, reads 473 GB of shuffle data when using lz4, which is compressed to 290 GB with zstd. I would imagine the CPU overhead to be proportional to the amount of shuffle data that is compressed/decompressed).
Overall the query execution time in my setup/test was basically unchanged, considering the test measurement noise.
It still seems worth doing this, because of the large improvement on the shuffle footprint, however the increase of CPU usage should be noted as pointed out by @MaxGekk

srowen

I think this is probably good idea, to trade a little more CPU for less I/O and storage. @dongjoon-hyun I'm a little concerned about changing a fairly fundamental default. Would you put this in 3.2.0? not out of the question IMHO just needs to be in a migration guide I think, and maybe worth one more ping to dev@ to solicit input.

dongjoon-hyun · 2021-05-23T16:18:05Z

Thank you for your feedbacks, @LucaCanali and @srowen .

SparkQA · 2021-06-17T20:01:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44473/

SparkQA · 2021-06-17T20:36:07Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44473/

SparkQA · 2021-06-17T21:09:43Z

Test build #139946 has finished for PR 32286 at commit a5d26a0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-21T00:10:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44581/

SparkQA · 2021-06-21T00:43:09Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44581/

SparkQA · 2021-06-21T01:19:35Z

Test build #140053 has finished for PR 32286 at commit ee50d33.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-21T16:09:34Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44609/

SparkQA · 2021-06-21T16:18:25Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44609/

SparkQA · 2021-06-21T17:32:34Z

Test build #140081 has finished for PR 32286 at commit 0a745b8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm

Just some minor queries, I like the change.

mridulm · 2021-06-22T15:17:21Z

sql/core/src/test/scala/org/apache/spark/sql/execution/CoalesceShufflePartitionsSuite.scala

@@ -66,6 +66,7 @@ class CoalesceShufflePartitionsSuite extends SparkFunSuite with BeforeAndAfterAl
        .setAppName("test")
        .set(UI_ENABLED, false)
        .set(IO_ENCRYPTION_ENABLED, enableIOEncryption)
+        .set(IO_COMPRESSION_CODEC.key, "lz4")


Curious, does this test require it to be lz4 ?
Same for AdaptiveQueryExecSuite below - why not rely on the new default value of zstd ?

dongjoon-hyun · 2021-06-26T20:52:36Z

Thank you for review, @mridulm . The change is required because the UT depends on the results based on the intermediate statistics of the query.

mridulm · 2021-06-27T05:15:25Z

Thanks for clarifying @dongjoon-hyun !
I am guessing this is use of spark.sql.adaptive.advisoryPartitionSizeInBytes ?
Sounds good to continue using lz4 to preserve current behavior.
We can always modify the test in a later PR to adapt targetPostShuffleInputSize for the new codec.

dongjoon-hyun · 2021-06-27T06:54:43Z

Thank you. Sure, @mridulm .

This PR aims to use `zstd` as `spark.io.compression.codec` instead of `lz4` in order to reduce the disk IOs and traffic during shuffle processing and worker decommission storage migration (between executors and to external storage). - Since SPARK-29434 and SPARK-29576, Apache Spark 3.0+ uses ZSTD MapOutputStatus Ser/Deser instead of `GZIP`. - Since SPARK-34503, Apache Spark 3.2 uses ZSTD for `spark.eventLog.compression.codec` by default. **BEFORE** **AFTER**

mridulm · 2021-06-28T02:47:03Z

Looks good to me (pending other reviews comments ofcourse).
Why is this still draft btw ? Are we still testing this or waiting for other feedback/eval ?

dongjoon-hyun · 2021-06-28T05:04:23Z

Thank you for your comments, @mridulm . I'm looking at the stability of GitHub Action.
As you know, recently, ZStandard 1.5.0 landed at master branch and it seems to increase the memory usage and causes GitHub Action UT failures.

SparkQA · 2021-06-29T18:38:50Z

Test build #140387 has started for PR 32286 at commit 70cb336.

SparkQA · 2021-06-30T00:26:56Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44910/

SparkQA · 2021-07-02T02:51:04Z

Test build #140554 has finished for PR 32286 at commit 15c5818.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-02T03:25:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45066/

SparkQA · 2021-07-02T03:59:31Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45066/

github-actions · 2021-10-12T00:10:29Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions bot added CORE SQL labels Apr 22, 2021

MaxGekk reviewed Apr 22, 2021

View reviewed changes

srowen reviewed Apr 23, 2021

View reviewed changes

srowen reviewed May 21, 2021

View reviewed changes

mridulm reviewed Jun 22, 2021

View reviewed changes

Disable bufferPool

70cb336

Merge branch 'apache:master' into SPARK-35181

15c5818

github-actions bot added the Stale label Oct 12, 2021

github-actions bot closed this Oct 13, 2021

		@@ -52,6 +52,8 @@ class AdaptiveQueryExecSuite

		import testImplicits._

		override protected def sparkConf = super.sparkConf.set("spark.io.compression.codec", "lz4")

[SPARK-35181][CORE] Use zstd for spark.io.compression.codec by default #32286

[SPARK-35181][CORE] Use zstd for spark.io.compression.codec by default #32286

Conversation

dongjoon-hyun commented Apr 22, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Apr 22, 2021

SparkQA commented Apr 22, 2021

MaxGekk left a comment • edited

Choose a reason for hiding this comment

SparkQA commented Apr 22, 2021

srowen left a comment

Choose a reason for hiding this comment

srowen Apr 23, 2021

Choose a reason for hiding this comment

dongjoon-hyun Jun 17, 2021 • edited

Choose a reason for hiding this comment

SparkQA commented May 9, 2021

SparkQA commented May 9, 2021

viirya commented May 9, 2021

SparkQA commented May 9, 2021

SparkQA commented May 9, 2021

SparkQA commented May 9, 2021

LucaCanali commented May 21, 2021 • edited

srowen left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented May 23, 2021

SparkQA commented Jun 17, 2021

SparkQA commented Jun 17, 2021

SparkQA commented Jun 17, 2021

SparkQA commented Jun 21, 2021

SparkQA commented Jun 21, 2021

SparkQA commented Jun 21, 2021

SparkQA commented Jun 21, 2021

SparkQA commented Jun 21, 2021

SparkQA commented Jun 21, 2021

mridulm left a comment

Choose a reason for hiding this comment

mridulm Jun 22, 2021

Choose a reason for hiding this comment

dongjoon-hyun commented Jun 26, 2021

mridulm commented Jun 27, 2021 • edited

dongjoon-hyun commented Jun 27, 2021

mridulm commented Jun 28, 2021

dongjoon-hyun commented Jun 28, 2021

SparkQA commented Jun 29, 2021

SparkQA commented Jun 30, 2021

SparkQA commented Jul 2, 2021

SparkQA commented Jul 2, 2021

SparkQA commented Jul 2, 2021

github-actions bot commented Oct 12, 2021

dongjoon-hyun commented Apr 22, 2021 •

edited

MaxGekk left a comment •

edited

dongjoon-hyun Jun 17, 2021 •

edited

LucaCanali commented May 21, 2021 •

edited

mridulm commented Jun 27, 2021 •

edited