Skip to content

Conversation

@sryza
Copy link
Contributor

@sryza sryza commented Jun 5, 2015

No description provided.

@JoshRosen
Copy link
Contributor

I noticed this bottleneck while running some Hive compatibility tests in Spark SQL; I'm kicking off a test run which only runs that suite to see how much of a difference this makes by itself. More generally, I've noticed some perf. issues in SQL in places where we didn't broadcast configurations; I'm going to work on gradually tackling these + creating JIRAs for them in the next couple of weeks.

@sryza sryza changed the title SPARK-8135. In SerializableWritable, don't load defaults when instant… SPARK-8135. Don't load defaults when reconstituting Hadoop Configurations Jun 6, 2015
@JoshRosen
Copy link
Contributor

Is there a fast way to clone Configuration objects that also avoids loading defaults? If so, we may be able to build on that and the changes in this patch to remove a bunch of code for broadcasting Hadoop configurations.

@sryza
Copy link
Contributor Author

sryza commented Jun 6, 2015

Where do we end up cloning Configuration objects? With these changes, we avoid loading defaults when we reconstitute Configuration objects from bytes. Are there hot paths where we create Configuration objects from other live Configuration objects?

@JoshRosen
Copy link
Contributor

Actually, I think that I was misremembering: there's one spot where we optionally clone a broadcasted configuration to work around some thread-safety issue, but that sharing / cloning won't be necessary if we can directly broadcast configurations as part of the tasks rather than as their own broadcast variables.

One correctness question: does this change behavior if an executor consumes a deserialized configuration? If I have an option which inherited environmental defaults on the driver, are those defaults serialized across the wire and used on the executors? I'm just trying to think of whether there is any possibility that this could break things.

@SparkQA
Copy link

SparkQA commented Jun 6, 2015

Test build #34338 has finished for PR 6679 at commit ca65543.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sryza
Copy link
Contributor Author

sryza commented Jun 6, 2015

There's a situation in which there could be a behavior change in situations where the executor somehow has a different Hadoop configuration file than the driver. But I think it's the right change. I started to explain this stuff abstractly, but I think it might be easier to just put down some examples:

Example 1
core-site.xml on the driver contains optionA->value1
core-site.xml on the executor contains optionA->value2
Old behavior: on the executor, conf.get("optionA") returns value1
New behavior: same as old behavior

Example 2
core-site.xml on the driver does not contain optionA
core-site.xml on the executor contains optionA->value1
Old behavior: on the executor, conf.get("optionA") returns value1
New behavior: on the executor, conf.get("optionA") returns null

I can't find the JIRA, but I believe there was a recent change by @vanzin that made it so that the executor would use a copy of the Hadoop configuration files used on the driver. When that is the case, neither example 1 nor example 2 can occur.

@sryza
Copy link
Contributor Author

sryza commented Jun 6, 2015

In light of this change, do you think we should remove the broadcasting of Configurations? While we avoid the much larger cost of reading and parsing XML for each task, we would still pay the cost of turning bytes into Configuration objects.

@JoshRosen
Copy link
Contributor

If we want to remove broadcasting then let's do it in a smaller followup patch to help incrementalize the review and testing.

Sent from my phone

On Jun 5, 2015, at 6:58 PM, Sandy Ryza notifications@github.com wrote:

In light of this change, do you think we should remove the broadcasting of Configurations? While we avoid the much larger cost of reading and parsing XML for each task, we would still pay the cost of turning bytes into Configuration objects.


Reply to this email directly or view it on GitHub.

@sryza
Copy link
Contributor Author

sryza commented Jun 6, 2015

Makes sense. In that case, this should be ready for review.

@SparkQA
Copy link

SparkQA commented Jun 6, 2015

Test build #34342 has finished for PR 6679 at commit 71bdc51.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SerializableConfiguration(@transient var value: Configuration) extends Serializable
    • class SerializableJobConf(@transient var value: JobConf) extends Serializable

@JoshRosen
Copy link
Contributor

@sryza, I think @vanzin's PR that you were referring to is #4142? It looks like that PR only affects YARN mode, though. Do we have to worry about either of those two example scenarios under standalone mode or Mesos or are we somehow immune to these issues when running on those resource managers? Just wanted to check to be extra-sure that we understand the possible impact of this change.

@sryza
Copy link
Contributor Author

sryza commented Jun 8, 2015

Ah, yeah, that was the change I was referring to.

I'm not sure about the Mesos deployment model, but on Standalone mode at least, it would be possible to manufacture a situation where the behavior changes. That situation is the "Example 2" outlined above. This difference relies on config files with different "client" configs being distributed to the client and to the cluster nodes. I would argue that changing behavior in this way is OK basically because the current behavior is wrong. That is, client configs should come from the client and not from the cluster.

@JoshRosen
Copy link
Contributor

For your "example 2" scenario, where there's a node-only property that's not overridden by the client, do you think that we might rely on such node-specific properties anywhere in Spark? I'm just wondering whether we should load the cluster node defaults once then overlay the client config on top of them as opposed to only using the client conf.

@vanzin
Copy link
Contributor

vanzin commented Jun 10, 2015

FYI, the change I made and you guys reference only affects the YARN AM. The executors still rely on the configuration broadcast from the driver, even with that change.

That being said, change LGTM. I'm ok with the slight change in semantics - IMO the new semantics are more correct and easier to reason about than the previous. In general, when Hadoop is configured, it's not even required to have "client" configuration on the nodes doing computation.

@sryza
Copy link
Contributor Author

sryza commented Jun 10, 2015

I definitely don't think we rely on it in Spark. On Cloudera setups, as well as presumably Hortonworks and MapR setups, client configurations are synchronized globally across nodes, so this discrepancy couldn't occur.

@JoshRosen
Copy link
Contributor

Alright, in that case I think we should go ahead and merge this. If it does turn out that someone was depending on the existing behavior and complains about it, then we can add an internal configuration flag to fall back to the old behavior (at the cost of a slight perf. hit).

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jun 10, 2015

Test build #34608 has finished for PR 6679 at commit 71bdc51.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SerializableConfiguration(@transient var value: Configuration) extends Serializable
    • class SerializableJobConf(@transient var value: JobConf) extends Serializable

@JoshRosen
Copy link
Contributor

=========================================================================
Running Apache RAT checks
=========================================================================
Attempting to fetch rat
rm: cannot remove `/home/jenkins/workspace/SparkPullRequestBuilder/lib/apache-rat-0.10.jar': No such file or directory
Our attempt to download rat locally to /home/jenkins/workspace/SparkPullRequestBuilder/lib/apache-rat-0.10.jar failed. Please install rat manually.

Looks like a Jenkins environment issue of some sort? I'll investigate.

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jun 10, 2015

Test build #34611 has finished for PR 6679 at commit 71bdc51.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SerializableConfiguration(@transient var value: Configuration) extends Serializable
    • class SerializableJobConf(@transient var value: JobConf) extends Serializable

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jun 10, 2015

Test build #34622 timed out for PR 6679 at commit 71bdc51 after a configured wait of 175m.

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jun 11, 2015

Test build #34642 has finished for PR 6679 at commit 71bdc51.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SerializableConfiguration(@transient var value: Configuration) extends Serializable
    • class SerializableJobConf(@transient var value: JobConf) extends Serializable

@sryza
Copy link
Contributor Author

sryza commented Jun 11, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Jun 11, 2015

Test build #34660 has finished for PR 6679 at commit 71bdc51.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SerializableConfiguration(@transient var value: Configuration) extends Serializable
    • class SerializableJobConf(@transient var value: JobConf) extends Serializable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sryza can you move this into util package?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally this class should go into util too... but i guess it is too late for that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, this is @DeveloperAPI and has been marked as such for a bunch of releases, which is why we don't want to rename / move it.

@SparkQA
Copy link

SparkQA commented Jun 14, 2015

Test build #34897 has finished for PR 6679 at commit 254a793.

  • This patch fails to build.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class SerializableConfiguration(@transient var value: Configuration) extends Serializable
    • class SerializableJobConf(@transient var value: JobConf) extends Serializable

@sryza sryza force-pushed the sandy-spark-8135 branch from 254a793 to 3f1b865 Compare June 16, 2015 00:15
@SparkQA
Copy link

SparkQA commented Jun 16, 2015

Test build #34962 has finished for PR 6679 at commit 3f1b865.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SerializableConfiguration(@transient var value: Configuration) extends Serializable
    • class SerializableJobConf(@transient var value: JobConf) extends Serializable

@sryza sryza force-pushed the sandy-spark-8135 branch from 3f1b865 to c5554ff Compare June 19, 2015 00:08
@sryza
Copy link
Contributor Author

sryza commented Jun 19, 2015

@JoshRosen this should be ready for merge

@JoshRosen
Copy link
Contributor

@sryza SGTM. I'll merge once this passes tests.

@SparkQA
Copy link

SparkQA commented Jun 19, 2015

Test build #35197 has finished for PR 6679 at commit c5554ff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SerializableConfiguration(@transient var value: Configuration) extends Serializable
    • class SerializableJobConf(@transient var value: JobConf) extends Serializable

@JoshRosen
Copy link
Contributor

LGTM, so I'm going to merge this into master. Per our discussion, if someone asks then we'll add a fallback path that's guarded by a flag in case anyone is adversely affected by this change and cannot change their deployment environment.

@asfgit asfgit closed this in 43f50de Jun 19, 2015
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…tions

Author: Sandy Ryza <sandy@cloudera.com>

Closes apache#6679 from sryza/sandy-spark-8135 and squashes the following commits:

c5554ff [Sandy Ryza] SPARK-8135. In SerializableWritable, don't load defaults when instantiating Configuration
@watermen
Copy link
Contributor

It will throw NullPointerException when add this commit on Standalone mode, and reset this commit it's ok.

spark-sql> select count(*) from src;
15/06/26 14:27:39 ERROR TaskSetManager: Task 5 in stage 1.0 failed 4 times; aborting job
15/06/26 14:27:39 ERROR SparkSQLDriver: Failed in [select count(*) from src_100]
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 1.0 failed 4 times, most recent failure: Lost task 5.3 in stage 1.0 (TID 20, 9.96.1.54): java.lang.NullPointerException
        at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:658)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:439)
        at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:175)

@JoshRosen
Copy link
Contributor

@watermen, @sryza and I are investigating this over at https://issues.apache.org/jira/browse/SPARK-8623

@sryza
Copy link
Contributor Author

sryza commented Jun 26, 2015

Hi @watermen, thanks for reporting this. Does the error occur every time or just occasionally? What Hadoop version are you running?

@watermen
Copy link
Contributor

@sryza Hadoop-2.4.1, every time, Standalone Mode. Thanks for your work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants