SPARK-8135. Don't load defaults when reconstituting Hadoop Configurations #6679

sryza · 2015-06-05T23:32:04Z

No description provided.

JoshRosen · 2015-06-05T23:39:04Z

I noticed this bottleneck while running some Hive compatibility tests in Spark SQL; I'm kicking off a test run which only runs that suite to see how much of a difference this makes by itself. More generally, I've noticed some perf. issues in SQL in places where we didn't broadcast configurations; I'm going to work on gradually tackling these + creating JIRAs for them in the next couple of weeks.

JoshRosen · 2015-06-06T00:47:53Z

Is there a fast way to clone Configuration objects that also avoids loading defaults? If so, we may be able to build on that and the changes in this patch to remove a bunch of code for broadcasting Hadoop configurations.

sryza · 2015-06-06T01:06:45Z

Where do we end up cloning Configuration objects? With these changes, we avoid loading defaults when we reconstitute Configuration objects from bytes. Are there hot paths where we create Configuration objects from other live Configuration objects?

JoshRosen · 2015-06-06T01:14:38Z

Actually, I think that I was misremembering: there's one spot where we optionally clone a broadcasted configuration to work around some thread-safety issue, but that sharing / cloning won't be necessary if we can directly broadcast configurations as part of the tasks rather than as their own broadcast variables.

One correctness question: does this change behavior if an executor consumes a deserialized configuration? If I have an option which inherited environmental defaults on the driver, are those defaults serialized across the wire and used on the executors? I'm just trying to think of whether there is any possibility that this could break things.

SparkQA · 2015-06-06T01:21:21Z

Test build #34338 has finished for PR 6679 at commit ca65543.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sryza · 2015-06-06T01:57:01Z

There's a situation in which there could be a behavior change in situations where the executor somehow has a different Hadoop configuration file than the driver. But I think it's the right change. I started to explain this stuff abstractly, but I think it might be easier to just put down some examples:

Example 1
core-site.xml on the driver contains optionA->value1
core-site.xml on the executor contains optionA->value2
Old behavior: on the executor, conf.get("optionA") returns value1
New behavior: same as old behavior

Example 2
core-site.xml on the driver does not contain optionA
core-site.xml on the executor contains optionA->value1
Old behavior: on the executor, conf.get("optionA") returns value1
New behavior: on the executor, conf.get("optionA") returns null

I can't find the JIRA, but I believe there was a recent change by @vanzin that made it so that the executor would use a copy of the Hadoop configuration files used on the driver. When that is the case, neither example 1 nor example 2 can occur.

sryza · 2015-06-06T01:58:55Z

In light of this change, do you think we should remove the broadcasting of Configurations? While we avoid the much larger cost of reading and parsing XML for each task, we would still pay the cost of turning bytes into Configuration objects.

JoshRosen · 2015-06-06T02:04:21Z

If we want to remove broadcasting then let's do it in a smaller followup patch to help incrementalize the review and testing.

Sent from my phone

On Jun 5, 2015, at 6:58 PM, Sandy Ryza notifications@github.com wrote:

In light of this change, do you think we should remove the broadcasting of Configurations? While we avoid the much larger cost of reading and parsing XML for each task, we would still pay the cost of turning bytes into Configuration objects.

—
Reply to this email directly or view it on GitHub.

sryza · 2015-06-06T02:05:54Z

Makes sense. In that case, this should be ready for review.

SparkQA · 2015-06-06T02:55:26Z

Test build #34342 has finished for PR 6679 at commit 71bdc51.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SerializableConfiguration(@transient var value: Configuration) extends Serializable
- class SerializableJobConf(@transient var value: JobConf) extends Serializable

JoshRosen · 2015-06-07T20:42:34Z

@sryza, I think @vanzin's PR that you were referring to is #4142? It looks like that PR only affects YARN mode, though. Do we have to worry about either of those two example scenarios under standalone mode or Mesos or are we somehow immune to these issues when running on those resource managers? Just wanted to check to be extra-sure that we understand the possible impact of this change.

sryza · 2015-06-08T07:30:07Z

Ah, yeah, that was the change I was referring to.

I'm not sure about the Mesos deployment model, but on Standalone mode at least, it would be possible to manufacture a situation where the behavior changes. That situation is the "Example 2" outlined above. This difference relies on config files with different "client" configs being distributed to the client and to the cluster nodes. I would argue that changing behavior in this way is OK basically because the current behavior is wrong. That is, client configs should come from the client and not from the cluster.

JoshRosen · 2015-06-09T20:53:59Z

For your "example 2" scenario, where there's a node-only property that's not overridden by the client, do you think that we might rely on such node-specific properties anywhere in Spark? I'm just wondering whether we should load the cluster node defaults once then overlay the client config on top of them as opposed to only using the client conf.

vanzin · 2015-06-10T00:07:39Z

FYI, the change I made and you guys reference only affects the YARN AM. The executors still rely on the configuration broadcast from the driver, even with that change.

That being said, change LGTM. I'm ok with the slight change in semantics - IMO the new semantics are more correct and easier to reason about than the previous. In general, when Hadoop is configured, it's not even required to have "client" configuration on the nodes doing computation.

sryza · 2015-06-10T03:17:52Z

I definitely don't think we rely on it in Spark. On Cloudera setups, as well as presumably Hortonworks and MapR setups, client configurations are synchronized globally across nodes, so this discrepancy couldn't occur.

JoshRosen · 2015-06-10T15:32:46Z

Alright, in that case I think we should go ahead and merge this. If it does turn out that someone was depending on the existing behavior and complains about it, then we can add an internal configuration flag to fall back to the old behavior (at the cost of a slight perf. hit).

JoshRosen · 2015-06-10T15:32:58Z

Jenkins, retest this please.

SparkQA · 2015-06-10T15:40:48Z

Test build #34608 has finished for PR 6679 at commit 71bdc51.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SerializableConfiguration(@transient var value: Configuration) extends Serializable
- class SerializableJobConf(@transient var value: JobConf) extends Serializable

JoshRosen · 2015-06-10T15:49:18Z

=========================================================================
Running Apache RAT checks
=========================================================================
Attempting to fetch rat
rm: cannot remove `/home/jenkins/workspace/SparkPullRequestBuilder/lib/apache-rat-0.10.jar': No such file or directory
Our attempt to download rat locally to /home/jenkins/workspace/SparkPullRequestBuilder/lib/apache-rat-0.10.jar failed. Please install rat manually.

Looks like a Jenkins environment issue of some sort? I'll investigate.

JoshRosen · 2015-06-10T15:49:31Z

Jenkins, retest this please.

SparkQA · 2015-06-10T17:35:28Z

Test build #34611 has finished for PR 6679 at commit 71bdc51.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SerializableConfiguration(@transient var value: Configuration) extends Serializable
- class SerializableJobConf(@transient var value: JobConf) extends Serializable

JoshRosen · 2015-06-10T17:40:58Z

Jenkins, retest this please.

SparkQA · 2015-06-10T20:38:21Z

Test build #34622 timed out for PR 6679 at commit 71bdc51 after a configured wait of 175m.

JoshRosen · 2015-06-10T23:41:47Z

Jenkins, retest this please.

SparkQA · 2015-06-11T01:45:13Z

Test build #34642 has finished for PR 6679 at commit 71bdc51.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SerializableConfiguration(@transient var value: Configuration) extends Serializable
- class SerializableJobConf(@transient var value: JobConf) extends Serializable

sryza · 2015-06-11T04:08:53Z

retest this please

SparkQA · 2015-06-11T06:27:36Z

Test build #34660 has finished for PR 6679 at commit 71bdc51.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SerializableConfiguration(@transient var value: Configuration) extends Serializable
- class SerializableJobConf(@transient var value: JobConf) extends Serializable

rxin · 2015-06-11T22:36:50Z

core/src/main/scala/org/apache/spark/SerializableConfiguration.scala

@sryza can you move this into util package?

rxin · 2015-06-11T22:37:27Z

core/src/main/scala/org/apache/spark/SerializableWritable.scala

ideally this class should go into util too... but i guess it is too late for that

To clarify, this is @DeveloperAPI and has been marked as such for a bunch of releases, which is why we don't want to rename / move it.

SparkQA · 2015-06-14T23:59:17Z

Test build #34897 has finished for PR 6679 at commit 254a793.

This patch fails to build.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- class SerializableConfiguration(@transient var value: Configuration) extends Serializable
- class SerializableJobConf(@transient var value: JobConf) extends Serializable

SparkQA · 2015-06-16T02:37:11Z

Test build #34962 has finished for PR 6679 at commit 3f1b865.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SerializableConfiguration(@transient var value: Configuration) extends Serializable
- class SerializableJobConf(@transient var value: JobConf) extends Serializable

…iating Configuration

sryza · 2015-06-19T00:08:27Z

@JoshRosen this should be ready for merge

JoshRosen · 2015-06-19T00:33:35Z

@sryza SGTM. I'll merge once this passes tests.

SparkQA · 2015-06-19T02:32:08Z

Test build #35197 has finished for PR 6679 at commit c5554ff.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SerializableConfiguration(@transient var value: Configuration) extends Serializable
- class SerializableJobConf(@transient var value: JobConf) extends Serializable

JoshRosen · 2015-06-19T02:34:27Z

LGTM, so I'm going to merge this into master. Per our discussion, if someone asks then we'll add a fallback path that's guarded by a flag in case anyone is adversely affected by this change and cannot change their deployment environment.

…tions Author: Sandy Ryza <sandy@cloudera.com> Closes apache#6679 from sryza/sandy-spark-8135 and squashes the following commits: c5554ff [Sandy Ryza] SPARK-8135. In SerializableWritable, don't load defaults when instantiating Configuration

watermen · 2015-06-26T06:45:30Z

It will throw NullPointerException when add this commit on Standalone mode, and reset this commit it's ok.

spark-sql> select count(*) from src;
15/06/26 14:27:39 ERROR TaskSetManager: Task 5 in stage 1.0 failed 4 times; aborting job
15/06/26 14:27:39 ERROR SparkSQLDriver: Failed in [select count(*) from src_100]
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 1.0 failed 4 times, most recent failure: Lost task 5.3 in stage 1.0 (TID 20, 9.96.1.54): java.lang.NullPointerException
        at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:658)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:439)
        at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:175)

JoshRosen · 2015-06-26T06:50:31Z

@watermen, @sryza and I are investigating this over at https://issues.apache.org/jira/browse/SPARK-8623

sryza · 2015-06-26T07:54:14Z

Hi @watermen, thanks for reporting this. Does the error occur every time or just occasionally? What Hadoop version are you running?

watermen · 2015-06-27T01:20:27Z

@sryza Hadoop-2.4.1, every time, Standalone Mode. Thanks for your work!

sryza changed the title ~~SPARK-8135. In SerializableWritable, don't load defaults when instant…~~ SPARK-8135. Don't load defaults when reconstituting Hadoop Configurations Jun 6, 2015

rxin reviewed Jun 11, 2015
View reviewed changes

core/src/main/scala/org/apache/spark/SerializableConfiguration.scala Outdated

Copy link

Contributor

rxin Jun 11, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sryza can you move this into util package?

rxin reviewed Jun 11, 2015
View reviewed changes

sryza force-pushed the sandy-spark-8135 branch from 254a793 to 3f1b865 Compare June 16, 2015 00:15

SPARK-8135. In SerializableWritable, don't load defaults when instant…

c5554ff

…iating Configuration

sryza force-pushed the sandy-spark-8135 branch from 3f1b865 to c5554ff Compare June 19, 2015 00:08

asfgit closed this in 43f50de Jun 19, 2015

SPARK-8135. Don't load defaults when reconstituting Hadoop Configurations #6679

SPARK-8135. Don't load defaults when reconstituting Hadoop Configurations #6679

Uh oh!

Conversation

sryza commented Jun 5, 2015

Uh oh!

JoshRosen commented Jun 5, 2015

Uh oh!

JoshRosen commented Jun 6, 2015

Uh oh!

sryza commented Jun 6, 2015

Uh oh!

JoshRosen commented Jun 6, 2015

Uh oh!

SparkQA commented Jun 6, 2015

Uh oh!

sryza commented Jun 6, 2015

Uh oh!

sryza commented Jun 6, 2015

Uh oh!

JoshRosen commented Jun 6, 2015

Uh oh!

sryza commented Jun 6, 2015

Uh oh!

SparkQA commented Jun 6, 2015

Uh oh!

JoshRosen commented Jun 7, 2015

Uh oh!

sryza commented Jun 8, 2015

Uh oh!

JoshRosen commented Jun 9, 2015

Uh oh!

vanzin commented Jun 10, 2015

Uh oh!

sryza commented Jun 10, 2015

Uh oh!

JoshRosen commented Jun 10, 2015

Uh oh!

JoshRosen commented Jun 10, 2015

Uh oh!

SparkQA commented Jun 10, 2015

Uh oh!

JoshRosen commented Jun 10, 2015

Uh oh!

JoshRosen commented Jun 10, 2015

Uh oh!

SparkQA commented Jun 10, 2015

Uh oh!

JoshRosen commented Jun 10, 2015

Uh oh!

SparkQA commented Jun 10, 2015

Uh oh!

JoshRosen commented Jun 10, 2015

Uh oh!

SparkQA commented Jun 11, 2015

Uh oh!

sryza commented Jun 11, 2015

Uh oh!

SparkQA commented Jun 11, 2015

Uh oh!

rxin Jun 11, 2015

Choose a reason for hiding this comment

Uh oh!

rxin Jun 11, 2015

Choose a reason for hiding this comment

Uh oh!

JoshRosen Jun 12, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 14, 2015

Uh oh!

SparkQA commented Jun 16, 2015

Uh oh!

sryza commented Jun 19, 2015

Uh oh!

JoshRosen commented Jun 19, 2015

Uh oh!

SparkQA commented Jun 19, 2015

Uh oh!

JoshRosen commented Jun 19, 2015