Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22294][Deploy] Reset spark.driver.bindAddress when starting a Checkpoint #19427

Closed
wants to merge 1 commit into from
Closed

[SPARK-22294][Deploy] Reset spark.driver.bindAddress when starting a Checkpoint #19427

wants to merge 1 commit into from

Conversation

ssaavedra
Copy link
Contributor

What changes were proposed in this pull request?

It seems that recovering from a checkpoint can replace the old
driver and executor IP addresses, as the workload can now be taking
place in a different cluster configuration. It follows that the
bindAddress for the master may also have changed. Thus we should not be
keeping the old one, and instead be added to the list of properties to
reset and recreate from the new environment.

How was this patch tested?

This patch was tested via manual testing on AWS, using the experimental (not yet merged) Kubernetes scheduler, which uses bindAddress to bind to a Kubernetes service (and thus was how I first encountered the bug too), but it is not a code-path related to the scheduler and this may have slipped through when merging SPARK-4563.

@yssharma
Copy link

@ssaavedra Could you also update the Title as [SPARK-XXXXX][component] Title... please.

@ssaavedra
Copy link
Contributor Author

Should I create the appropriate issue in JIRA? I'm not sure if there is any automation which does that.

@ssaavedra ssaavedra changed the title Reset spark.driver.bindAddress when starting a Checkpoint [SparkStreaming] Reset spark.driver.bindAddress when starting a Checkpoint Oct 12, 2017
@yssharma
Copy link

I don't think there is an automated way. You could create a JIRA ticket and rename this title with the ticket id and component name.

@felixcheung
Copy link
Member

yes, if you can open an issue in JIRA and update this PR title it should link automatically.

@ssaavedra ssaavedra changed the title [SparkStreaming] Reset spark.driver.bindAddress when starting a Checkpoint [SPARK-22294] [SparkStreaming] Reset spark.driver.bindAddress when starting a Checkpoint Oct 17, 2017
@ssaavedra ssaavedra changed the title [SPARK-22294] [SparkStreaming] Reset spark.driver.bindAddress when starting a Checkpoint [SPARK-22294][Deploy] Reset spark.driver.bindAddress when starting a Checkpoint Oct 17, 2017
@felixcheung
Copy link
Member

felixcheung commented Oct 19, 2017

Jenkins, ok to test

@SparkQA
Copy link

SparkQA commented Oct 19, 2017

Test build #82908 has finished for PR 19427 at commit 892555f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ssaavedra
Copy link
Contributor Author

Is anyone considering this patch? Should I advertise it anywhere else?

@@ -62,6 +63,7 @@ class Checkpoint(ssc: StreamingContext, val checkpointTime: Time)

val newSparkConf = new SparkConf(loadDefaults = false).setAll(sparkConfPairs)
.remove("spark.driver.host")
.remove("spark.driver.bindAddress")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to remove this? It means we must drop spark.driver.bindAddress if it's not set in the new run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If it is not set in the new run, it should still be meaningless anyway. It makes sense to know this property on the subsequent calls to spark-submit. If we are resuming a checkpoint it means we are re-submitting work, but it may be run in a different cluster configuration, and thus we may want to change the bindAddress or this different configuration may even wish to rely on falling back to the spark.driver.host configuration. In any case, it should make no sense to keep the old setting, unless we are running on a static configuration, in which case it is not a caveat to remove this, as the command-line to re-launch the job can still re-populate the property if it needs to keep being the same.

@zsxwing
Copy link
Member

zsxwing commented Nov 10, 2017

Thanks! LGTM. Merging to master and 2.2.

asfgit pushed a commit that referenced this pull request Nov 10, 2017
…Checkpoint

## What changes were proposed in this pull request?

It seems that recovering from a checkpoint can replace the old
driver and executor IP addresses, as the workload can now be taking
place in a different cluster configuration. It follows that the
bindAddress for the master may also have changed. Thus we should not be
keeping the old one, and instead be added to the list of properties to
reset and recreate from the new environment.

## How was this patch tested?

This patch was tested via manual testing on AWS, using the experimental (not yet merged) Kubernetes scheduler, which uses bindAddress to bind to a Kubernetes service (and thus was how I first encountered the bug too), but it is not a code-path related to the scheduler and this may have slipped through when merging SPARK-4563.

Author: Santiago Saavedra <ssaavedra@openshine.com>

Closes #19427 from ssaavedra/fix-checkpointing-master.

(cherry picked from commit 5ebdcd1)
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
@asfgit asfgit closed this in 5ebdcd1 Nov 10, 2017
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…Checkpoint

## What changes were proposed in this pull request?

It seems that recovering from a checkpoint can replace the old
driver and executor IP addresses, as the workload can now be taking
place in a different cluster configuration. It follows that the
bindAddress for the master may also have changed. Thus we should not be
keeping the old one, and instead be added to the list of properties to
reset and recreate from the new environment.

## How was this patch tested?

This patch was tested via manual testing on AWS, using the experimental (not yet merged) Kubernetes scheduler, which uses bindAddress to bind to a Kubernetes service (and thus was how I first encountered the bug too), but it is not a code-path related to the scheduler and this may have slipped through when merging SPARK-4563.

Author: Santiago Saavedra <ssaavedra@openshine.com>

Closes apache#19427 from ssaavedra/fix-checkpointing-master.

(cherry picked from commit 5ebdcd1)
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants