-
Notifications
You must be signed in to change notification settings - Fork 29k
SPARK-8135. Don't load defaults when reconstituting Hadoop Configurations #6679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I noticed this bottleneck while running some Hive compatibility tests in Spark SQL; I'm kicking off a test run which only runs that suite to see how much of a difference this makes by itself. More generally, I've noticed some perf. issues in SQL in places where we didn't broadcast configurations; I'm going to work on gradually tackling these + creating JIRAs for them in the next couple of weeks. |
|
Is there a fast way to clone Configuration objects that also avoids loading defaults? If so, we may be able to build on that and the changes in this patch to remove a bunch of code for broadcasting Hadoop configurations. |
|
Where do we end up cloning Configuration objects? With these changes, we avoid loading defaults when we reconstitute Configuration objects from bytes. Are there hot paths where we create Configuration objects from other live Configuration objects? |
|
Actually, I think that I was misremembering: there's one spot where we optionally clone a broadcasted configuration to work around some thread-safety issue, but that sharing / cloning won't be necessary if we can directly broadcast configurations as part of the tasks rather than as their own broadcast variables. One correctness question: does this change behavior if an executor consumes a deserialized configuration? If I have an option which inherited environmental defaults on the driver, are those defaults serialized across the wire and used on the executors? I'm just trying to think of whether there is any possibility that this could break things. |
|
Test build #34338 has finished for PR 6679 at commit
|
|
There's a situation in which there could be a behavior change in situations where the executor somehow has a different Hadoop configuration file than the driver. But I think it's the right change. I started to explain this stuff abstractly, but I think it might be easier to just put down some examples: Example 1 Example 2 I can't find the JIRA, but I believe there was a recent change by @vanzin that made it so that the executor would use a copy of the Hadoop configuration files used on the driver. When that is the case, neither example 1 nor example 2 can occur. |
|
In light of this change, do you think we should remove the broadcasting of Configurations? While we avoid the much larger cost of reading and parsing XML for each task, we would still pay the cost of turning bytes into Configuration objects. |
|
If we want to remove broadcasting then let's do it in a smaller followup patch to help incrementalize the review and testing. Sent from my phone
|
|
Makes sense. In that case, this should be ready for review. |
|
Test build #34342 has finished for PR 6679 at commit
|
|
@sryza, I think @vanzin's PR that you were referring to is #4142? It looks like that PR only affects YARN mode, though. Do we have to worry about either of those two example scenarios under standalone mode or Mesos or are we somehow immune to these issues when running on those resource managers? Just wanted to check to be extra-sure that we understand the possible impact of this change. |
|
Ah, yeah, that was the change I was referring to. I'm not sure about the Mesos deployment model, but on Standalone mode at least, it would be possible to manufacture a situation where the behavior changes. That situation is the "Example 2" outlined above. This difference relies on config files with different "client" configs being distributed to the client and to the cluster nodes. I would argue that changing behavior in this way is OK basically because the current behavior is wrong. That is, client configs should come from the client and not from the cluster. |
|
For your "example 2" scenario, where there's a node-only property that's not overridden by the client, do you think that we might rely on such node-specific properties anywhere in Spark? I'm just wondering whether we should load the cluster node defaults once then overlay the client config on top of them as opposed to only using the client conf. |
|
FYI, the change I made and you guys reference only affects the YARN AM. The executors still rely on the configuration broadcast from the driver, even with that change. That being said, change LGTM. I'm ok with the slight change in semantics - IMO the new semantics are more correct and easier to reason about than the previous. In general, when Hadoop is configured, it's not even required to have "client" configuration on the nodes doing computation. |
|
I definitely don't think we rely on it in Spark. On Cloudera setups, as well as presumably Hortonworks and MapR setups, client configurations are synchronized globally across nodes, so this discrepancy couldn't occur. |
|
Alright, in that case I think we should go ahead and merge this. If it does turn out that someone was depending on the existing behavior and complains about it, then we can add an internal configuration flag to fall back to the old behavior (at the cost of a slight perf. hit). |
|
Jenkins, retest this please. |
|
Test build #34608 has finished for PR 6679 at commit
|
Looks like a Jenkins environment issue of some sort? I'll investigate. |
|
Jenkins, retest this please. |
|
Test build #34611 has finished for PR 6679 at commit
|
|
Jenkins, retest this please. |
|
Test build #34622 timed out for PR 6679 at commit |
|
Jenkins, retest this please. |
|
Test build #34642 has finished for PR 6679 at commit
|
|
retest this please |
|
Test build #34660 has finished for PR 6679 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sryza can you move this into util package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally this class should go into util too... but i guess it is too late for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, this is @DeveloperAPI and has been marked as such for a bunch of releases, which is why we don't want to rename / move it.
|
Test build #34897 has finished for PR 6679 at commit
|
|
Test build #34962 has finished for PR 6679 at commit
|
…iating Configuration
|
@JoshRosen this should be ready for merge |
|
@sryza SGTM. I'll merge once this passes tests. |
|
Test build #35197 has finished for PR 6679 at commit
|
|
LGTM, so I'm going to merge this into master. Per our discussion, if someone asks then we'll add a fallback path that's guarded by a flag in case anyone is adversely affected by this change and cannot change their deployment environment. |
…tions Author: Sandy Ryza <sandy@cloudera.com> Closes apache#6679 from sryza/sandy-spark-8135 and squashes the following commits: c5554ff [Sandy Ryza] SPARK-8135. In SerializableWritable, don't load defaults when instantiating Configuration
|
It will throw NullPointerException when add this commit on Standalone mode, and reset this commit it's ok. |
|
@watermen, @sryza and I are investigating this over at https://issues.apache.org/jira/browse/SPARK-8623 |
|
Hi @watermen, thanks for reporting this. Does the error occur every time or just occasionally? What Hadoop version are you running? |
|
@sryza Hadoop-2.4.1, every time, Standalone Mode. Thanks for your work! |
No description provided.