[SPARK-22294][Deploy] Reset spark.driver.bindAddress when starting a Checkpoint #19427

ssaavedra · 2017-10-04T11:12:41Z

What changes were proposed in this pull request?

It seems that recovering from a checkpoint can replace the old
driver and executor IP addresses, as the workload can now be taking
place in a different cluster configuration. It follows that the
bindAddress for the master may also have changed. Thus we should not be
keeping the old one, and instead be added to the list of properties to
reset and recreate from the new environment.

How was this patch tested?

This patch was tested via manual testing on AWS, using the experimental (not yet merged) Kubernetes scheduler, which uses bindAddress to bind to a Kubernetes service (and thus was how I first encountered the bug too), but it is not a code-path related to the scheduler and this may have slipped through when merging SPARK-4563.

yssharma · 2017-10-12T00:04:19Z

@ssaavedra Could you also update the Title as [SPARK-XXXXX][component] Title... please.

ssaavedra · 2017-10-12T18:37:24Z

Should I create the appropriate issue in JIRA? I'm not sure if there is any automation which does that.

yssharma · 2017-10-13T01:36:29Z

I don't think there is an automated way. You could create a JIRA ticket and rename this title with the ticket id and component name.

felixcheung · 2017-10-16T00:12:29Z

yes, if you can open an issue in JIRA and update this PR title it should link automatically.

felixcheung · 2017-10-19T16:31:50Z

Jenkins, ok to test

SparkQA · 2017-10-19T18:03:42Z

Test build #82908 has finished for PR 19427 at commit 892555f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ssaavedra · 2017-11-02T16:37:29Z

Is anyone considering this patch? Should I advertise it anywhere else?

zsxwing · 2017-11-02T18:25:32Z

streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala

@@ -62,6 +63,7 @@ class Checkpoint(ssc: StreamingContext, val checkpointTime: Time)

    val newSparkConf = new SparkConf(loadDefaults = false).setAll(sparkConfPairs)
      .remove("spark.driver.host")
+      .remove("spark.driver.bindAddress")


Do we have to remove this? It means we must drop spark.driver.bindAddress if it's not set in the new run.

Yes. If it is not set in the new run, it should still be meaningless anyway. It makes sense to know this property on the subsequent calls to spark-submit. If we are resuming a checkpoint it means we are re-submitting work, but it may be run in a different cluster configuration, and thus we may want to change the bindAddress or this different configuration may even wish to rely on falling back to the spark.driver.host configuration. In any case, it should make no sense to keep the old setting, unless we are running on a static configuration, in which case it is not a caveat to remove this, as the command-line to re-launch the job can still re-populate the property if it needs to keep being the same.

zsxwing · 2017-11-10T18:51:26Z

Thanks! LGTM. Merging to master and 2.2.

…Checkpoint ## What changes were proposed in this pull request? It seems that recovering from a checkpoint can replace the old driver and executor IP addresses, as the workload can now be taking place in a different cluster configuration. It follows that the bindAddress for the master may also have changed. Thus we should not be keeping the old one, and instead be added to the list of properties to reset and recreate from the new environment. ## How was this patch tested? This patch was tested via manual testing on AWS, using the experimental (not yet merged) Kubernetes scheduler, which uses bindAddress to bind to a Kubernetes service (and thus was how I first encountered the bug too), but it is not a code-path related to the scheduler and this may have slipped through when merging SPARK-4563. Author: Santiago Saavedra <ssaavedra@openshine.com> Closes #19427 from ssaavedra/fix-checkpointing-master. (cherry picked from commit 5ebdcd1) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>

…Checkpoint ## What changes were proposed in this pull request? It seems that recovering from a checkpoint can replace the old driver and executor IP addresses, as the workload can now be taking place in a different cluster configuration. It follows that the bindAddress for the master may also have changed. Thus we should not be keeping the old one, and instead be added to the list of properties to reset and recreate from the new environment. ## How was this patch tested? This patch was tested via manual testing on AWS, using the experimental (not yet merged) Kubernetes scheduler, which uses bindAddress to bind to a Kubernetes service (and thus was how I first encountered the bug too), but it is not a code-path related to the scheduler and this may have slipped through when merging SPARK-4563. Author: Santiago Saavedra <ssaavedra@openshine.com> Closes apache#19427 from ssaavedra/fix-checkpointing-master. (cherry picked from commit 5ebdcd1) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>

Reset spark.driver.bindAddress when starting a Checkpoint

892555f

jerryshao mentioned this pull request Oct 11, 2017

[SPARK-22243][DStreams]spark.yarn.jars reload from config when Checkpoint recovery #19469

Closed

ssaavedra changed the title ~~Reset spark.driver.bindAddress when starting a Checkpoint~~ [SparkStreaming] Reset spark.driver.bindAddress when starting a Checkpoint Oct 12, 2017

ssaavedra changed the title ~~[SparkStreaming] Reset spark.driver.bindAddress when starting a Checkpoint~~ [SPARK-22294] [SparkStreaming] Reset spark.driver.bindAddress when starting a Checkpoint Oct 17, 2017

ssaavedra changed the title ~~[SPARK-22294] [SparkStreaming] Reset spark.driver.bindAddress when starting a Checkpoint~~ [SPARK-22294][Deploy] Reset spark.driver.bindAddress when starting a Checkpoint Oct 17, 2017

zsxwing reviewed Nov 2, 2017

View reviewed changes

asfgit closed this in 5ebdcd1 Nov 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22294][Deploy] Reset spark.driver.bindAddress when starting a Checkpoint #19427

[SPARK-22294][Deploy] Reset spark.driver.bindAddress when starting a Checkpoint #19427

ssaavedra commented Oct 4, 2017

yssharma commented Oct 12, 2017

ssaavedra commented Oct 12, 2017

yssharma commented Oct 13, 2017

felixcheung commented Oct 16, 2017

felixcheung commented Oct 19, 2017 •

edited

Loading

SparkQA commented Oct 19, 2017

ssaavedra commented Nov 2, 2017

zsxwing Nov 2, 2017

ssaavedra Nov 5, 2017

zsxwing commented Nov 10, 2017

[SPARK-22294][Deploy] Reset spark.driver.bindAddress when starting a Checkpoint #19427

[SPARK-22294][Deploy] Reset spark.driver.bindAddress when starting a Checkpoint #19427

Conversation

ssaavedra commented Oct 4, 2017

What changes were proposed in this pull request?

How was this patch tested?

yssharma commented Oct 12, 2017

ssaavedra commented Oct 12, 2017

yssharma commented Oct 13, 2017

felixcheung commented Oct 16, 2017

felixcheung commented Oct 19, 2017 • edited Loading

SparkQA commented Oct 19, 2017

ssaavedra commented Nov 2, 2017

zsxwing Nov 2, 2017

Choose a reason for hiding this comment

ssaavedra Nov 5, 2017

Choose a reason for hiding this comment

zsxwing commented Nov 10, 2017

felixcheung commented Oct 19, 2017 •

edited

Loading