[SPARK-29939][CORE] Add spark.shuffle.mapStatus.compression.codec conf #26611

Ngone51 · 2019-11-20T16:54:26Z

What changes were proposed in this pull request?

Add a new conf named spark.shuffle.mapStatus.compression.codec for user to decide which codec should be used(default by zstd) for MapStatus compression.

Why are the changes needed?

We already have this functionality for broadcast/rdd/shuffle/shuflleSpill,
so it might be better to have the same functionality for MapStatus as well.

Does this PR introduce any user-facing change?

Yes, user now could use spark.shuffle.mapStatus.compression.codec to decide which codec should be used during MapStatus compression.

How was this patch tested?

N/A

dongjoon-hyun · 2019-11-20T17:24:19Z

cc @dbtsai , @tgravescs , @gatorsmile

dongjoon-hyun · 2019-11-20T17:25:57Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

@@ -1016,6 +1016,13 @@ package object config {
      .booleanConf
      .createWithDefault(true)

+  private[spark] val MAP_STATUS_COMPRESS =
+    ConfigBuilder("spark.shuffle.mapStatus.compress")


Ur, @gatorsmile , is this what you asked?

dongjoon-hyun

@Ngone51 .
We compress this always. Previously, it was Gzip. Now, master branch uses ZStd.
In this PR, it seems that you are switching to lz4 which is the worst compression codec in this case.

IIRC, @gatorsmile ask to a configuration to choose the codec. Not to choose the shouldCompress or Not. I'm a little confused.

dongjoon-hyun · 2019-11-20T17:26:59Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

-    val objOut = new ObjectOutputStream(codec.compressedOutputStream(out))
+    val shouldCompress = conf.get(MAP_STATUS_COMPRESS)
+    val codec = if (shouldCompress) {
+      Some(CompressionCodec.createCodec(conf, conf.get(IO_COMPRESSION_CODEC.key, "zstd")))


This is a regression, @Ngone51 . IO_COMPRESSION_CODEC is lz4 by default.

hmmm... actually, I'm indicated to use zstd by default in this way. There's difference between conf.get(IO_COMPRESSION_CODEC.key, "zstd") and conf.get(IO_COMPRESSION_CODEC), right ? Or something wrong with the former one ?

SparkQA · 2019-11-20T19:26:28Z

Test build #114172 has finished for PR 26611 at commit e1cd83f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-11-21T01:30:51Z

Not to choose the shouldCompress or Not.

Yeah. Actually, I have the same mind. Let me revert this.

cloud-fan · 2019-11-21T04:18:22Z

can we follow

private[spark] val EVENT_LOG_COMPRESSION_CODEC =
    ConfigBuilder("spark.eventLog.compression.codec")
      .doc("The codec used to compress event log. By default, Spark provides four codecs: " +
        "lz4, lzf, snappy, and zstd. You can also use fully qualified class names to specify " +
        "the codec. If this is not given, spark.io.compression.codec will be used.")
      .fallbackConf(IO_COMPRESSION_CODEC)

to create a new config for map status? then we can set the default value correctly.

SparkQA · 2019-11-21T04:25:16Z

Test build #114186 has finished for PR 26611 at commit 3d35431.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-11-21T04:38:19Z

Yes. @cloud-fan 's example is what we need, @Ngone51 ~

cloud-fan · 2019-11-21T06:35:20Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+      .doc("The codec used to compress MapStatus, which is generated by ShuffleMapTask. " +
+        "By default, Spark provides four codecs: lz4, lzf, snappy, and zstd. You can also " +
+        "use fully qualified class names to specify the codec. If this is not given, " +
+        "spark.io.compression.codec will be used.")


we can remove this sentence if we don't fallback to spark.io.compression.codec

SparkQA · 2019-11-21T08:05:02Z

Test build #114209 has finished for PR 26611 at commit b4a2daa.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-21T08:05:02Z

Test build #114212 has finished for PR 26611 at commit 6920dfc.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-11-21T08:27:51Z

retest this please

SparkQA · 2019-11-21T11:24:24Z

Test build #114222 has finished for PR 26611 at commit 6920dfc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-11-21T12:57:39Z

When would you want to set this separately? i'm just curious how often varying the codec for all these cases helps

cloud-fan · 2019-11-21T13:54:41Z

Personally I like fewer configs as well. Shall we use a fixed codec for event log too (added in #24921)? We should be consistent internally.

dongjoon-hyun · 2019-11-21T23:11:53Z

@cloud-fan and @srowen . IIUC, @gatorsmile asked this as a way to fallback to gzip if we face any issues at ZStd codec.

I also prefer a single best default implementation without configuration, but there is a reason we used gzip here before instead of the general lz4. It's because of the broadcasting. And, zstd was able to supersede the previous gzip because it shows faster and smaller result in the microbenchmark.

dongjoon-hyun · 2019-11-21T23:13:31Z

Please confirm the reason of your request, @gatorsmile .

cloud-fan · 2019-11-22T02:40:51Z

If it's justified for event log, then it's justified for map status as well. I think we should also give a different value for the event log codec.

A possible use case: people develop a custom codec for RDD/shuffle and set it with spark.io.compression.codec. But this codec doesn't work well with event log/map status, so ideally codec of these two should not be affected by spark.io.compression.codec

cloud-fan · 2019-11-22T02:41:16Z

We can also mark these configs as interval if we don't expect users to change it.

core/src/main/scala/org/apache/spark/internal/config/package.scala

dongjoon-hyun · 2019-11-23T19:26:04Z

Yes. The previous my comment is about MapStatus.

If it's justified for event log, then it's justified for map status as well.

Also, yes. #24921 allows it for event log codec.

I think we should also give a different value for the event log codec.

Thanks.

cloud-fan · 2019-11-25T04:14:16Z

LGTM

SparkQA · 2019-11-25T05:19:23Z

Test build #114370 has finished for PR 26611 at commit b64e68d.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-11-25T05:20:23Z

All test passed. The failure is CRAN incoming feasibility again.

 checking CRAN incoming feasibility ...Error in .check_package_CRAN_incoming(pkgdir) : 
  dims [product 26] do not match the length of object [0]
Execution halted

cc @viirya , @HyukjinKwon .

dongjoon-hyun

+1, LGTM. Merged to master.
Thank you all.

AmplabJenkins · 2019-11-25T05:21:02Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/114370/
Test FAILed.

Ngone51 · 2019-11-25T08:22:32Z

Thanks all!

HyukjinKwon

LGTM too

tgravescs · 2019-11-25T14:39:02Z

I was out of the office last week, it seems odd to me that we added this config but then we left it internal, what was the reasoning? If we think it's something worth changing, why not expose it. if a user has a problem with Zstd they have no way of knowing about this config unless they go through code or some other spark expert tells them about it.

srowen · 2019-11-25T15:09:41Z

It's an interesting philosophical question. If it's internal, why can users set it at all? I think it means something more like "developer API": use it if you know what you're doing, and it's subject to change without notice. It's not documented because, well, the audience is often reading the source code or consulting contributors directly for advanced, niche needs. These also often exist as safety-valve flags that are really not intended for use ever, unless something unforeseen goes wrong and we have to advertise a quick fix. So yeah, the semantics are a little fuzzy. Given that I generally prefer fewer configs anyway, anything we can do to keep this limited to the niche use cases that might need it probably helps avoid the info overload that is the Spark config page.

tgravescs · 2019-11-25T15:26:47Z

Yeah I definitely get that and we have left configs internal in the past, it just seemed here the only reason I saw was in case there is issue with Zstd and if we really think that could happen or has happened then maybe it shouldn't be internal. if there hasn't been then did we really need this config at all like was mentioned in original pr discussion.

Anyway, I'm fine with it just wondering if more reasoning behind it. We can always open it up later.

HyukjinKwon · 2019-11-25T23:20:43Z

Using difficult compression codec usually has its general purpose I think such as the efficiency given the pattern of the data. I dont believe zstd is always better in all cases. However, yeah I agree that it is important to describe the reason in general, and it better had to be clarified.

Ngone51 added 2 commits November 21, 2019 00:32

MapOutputStatus compress conf

1fb6dca

decompress

e1cd83f

dongjoon-hyun reviewed Nov 20, 2019

View reviewed changes

dongjoon-hyun requested changes Nov 20, 2019

View reviewed changes

dongjoon-hyun added the SPARK CORE label Nov 20, 2019

revert the config

3d35431

ADD CONF

b4a2daa

cloud-fan reviewed Nov 21, 2019

View reviewed changes

reword conf doc

6920dfc

gatorsmile reviewed Nov 22, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/internal/config/package.scala Show resolved Hide resolved

dongjoon-hyun changed the title ~~[SPARK-29939][CORE] Add a conf for CompressionCodec for Ser/Deser of MapOutputStatus~~ [SPARK-29939][CORE] Add spark.shuffle.mapStatus.compression.codec conf Nov 23, 2019

mark the conf as internal

b64e68d

dongjoon-hyun approved these changes Nov 25, 2019

View reviewed changes

dongjoon-hyun closed this in 456cfe6 Nov 25, 2019

HyukjinKwon reviewed Nov 25, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29939][CORE] Add spark.shuffle.mapStatus.compression.codec conf #26611

[SPARK-29939][CORE] Add spark.shuffle.mapStatus.compression.codec conf #26611

Ngone51 commented Nov 20, 2019 •

edited

dongjoon-hyun commented Nov 20, 2019

dongjoon-hyun Nov 20, 2019 •

edited

dongjoon-hyun left a comment

dongjoon-hyun Nov 20, 2019

Ngone51 Nov 21, 2019

SparkQA commented Nov 20, 2019

Ngone51 commented Nov 21, 2019

cloud-fan commented Nov 21, 2019

SparkQA commented Nov 21, 2019

dongjoon-hyun commented Nov 21, 2019

cloud-fan Nov 21, 2019

Ngone51 Nov 21, 2019

SparkQA commented Nov 21, 2019

SparkQA commented Nov 21, 2019

cloud-fan commented Nov 21, 2019

SparkQA commented Nov 21, 2019

srowen commented Nov 21, 2019

cloud-fan commented Nov 21, 2019

dongjoon-hyun commented Nov 21, 2019 •

edited

dongjoon-hyun commented Nov 21, 2019

cloud-fan commented Nov 22, 2019

cloud-fan commented Nov 22, 2019

dongjoon-hyun commented Nov 23, 2019 •

edited

cloud-fan commented Nov 25, 2019

SparkQA commented Nov 25, 2019

dongjoon-hyun commented Nov 25, 2019

dongjoon-hyun left a comment

AmplabJenkins commented Nov 25, 2019

Ngone51 commented Nov 25, 2019

HyukjinKwon left a comment

tgravescs commented Nov 25, 2019

srowen commented Nov 25, 2019

tgravescs commented Nov 25, 2019

HyukjinKwon commented Nov 25, 2019 •

edited

[SPARK-29939][CORE] Add spark.shuffle.mapStatus.compression.codec conf #26611

[SPARK-29939][CORE] Add spark.shuffle.mapStatus.compression.codec conf #26611

Conversation

Ngone51 commented Nov 20, 2019 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun commented Nov 20, 2019

dongjoon-hyun Nov 20, 2019 • edited

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun Nov 20, 2019

Choose a reason for hiding this comment

Ngone51 Nov 21, 2019

Choose a reason for hiding this comment

SparkQA commented Nov 20, 2019

Ngone51 commented Nov 21, 2019

cloud-fan commented Nov 21, 2019

SparkQA commented Nov 21, 2019

dongjoon-hyun commented Nov 21, 2019

cloud-fan Nov 21, 2019

Choose a reason for hiding this comment

Ngone51 Nov 21, 2019

Choose a reason for hiding this comment

SparkQA commented Nov 21, 2019

SparkQA commented Nov 21, 2019

cloud-fan commented Nov 21, 2019

SparkQA commented Nov 21, 2019

srowen commented Nov 21, 2019

cloud-fan commented Nov 21, 2019

dongjoon-hyun commented Nov 21, 2019 • edited

dongjoon-hyun commented Nov 21, 2019

cloud-fan commented Nov 22, 2019

cloud-fan commented Nov 22, 2019

dongjoon-hyun commented Nov 23, 2019 • edited

cloud-fan commented Nov 25, 2019

SparkQA commented Nov 25, 2019

dongjoon-hyun commented Nov 25, 2019

dongjoon-hyun left a comment

Choose a reason for hiding this comment

AmplabJenkins commented Nov 25, 2019

Ngone51 commented Nov 25, 2019

HyukjinKwon left a comment

Choose a reason for hiding this comment

tgravescs commented Nov 25, 2019

srowen commented Nov 25, 2019

tgravescs commented Nov 25, 2019

HyukjinKwon commented Nov 25, 2019 • edited

Ngone51 commented Nov 20, 2019 •

edited

dongjoon-hyun Nov 20, 2019 •

edited

dongjoon-hyun commented Nov 21, 2019 •

edited

dongjoon-hyun commented Nov 23, 2019 •

edited

HyukjinKwon commented Nov 25, 2019 •

edited