Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3398] [EC2] Have spark-ec2 intelligently wait for specific cluster states #2339

Closed
wants to merge 6 commits into from

Conversation

nchammas
Copy link
Contributor

@nchammas nchammas commented Sep 9, 2014

Instead of waiting arbitrary amounts of time for the cluster to reach a specific state, this patch lets spark-ec2 explicitly wait for a cluster to reach a desired state.

This is useful in a couple of situations:

  • The cluster is launching and you want to wait until SSH is available before installing stuff.
  • The cluster is being terminated and you want to wait until all the instances are terminated before trying to delete security groups.

This patch removes the need for the --wait option and removes some of the time-based retry logic that was being used.

@nchammas
Copy link
Contributor Author

nchammas commented Sep 9, 2014

Depending on what the reviewers think, there are some additional lines that can be removed, like:

  • the wait_for_cluster function
  • the wait_for_instances function
  • the --wait option

@SparkQA
Copy link

SparkQA commented Sep 9, 2014

QA tests have started for PR 2339 at commit 7969265.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have finished for PR 2339 at commit 7969265.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nchammas
Copy link
Contributor Author

@JoshRosen and @davies: This PR is ready for review.

if all(i.state == 'running' for i in cluster_instances) and \
all(is_ssh_available(host=i.ip_address, opts=opts) for i in cluster_instances):
print "" # so that next line of output starts on new line
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could put "break" here, and put 'sys.stdout.write("\n")' at the end of this funciton

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'll do that but put print "" at the end instead of sys.stdout.write("\n"), since we want to flush the output immediately. Does that sound good to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The long comment sounds boring than 'sys.stdout.write("\n")'. With 'print ""', you also need to flush manually. I think we do not need to flush here, because it has no visually changes until other loggings coming in.

Either is fine to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'll get back to you on this. Definitely a minor point either way, but I'd like to get it right. Gotta run for now but will check on this later tonight.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see your point. I've made the change per your recommendation.

@davies
Copy link
Contributor

davies commented Sep 10, 2014

This patch look good to me, just some minor comments.

I think we should keep --wait option (do not break user) and deprecate it. The dead code ( wait_for_cluster and wait_for_instances) should be removed.

@JoshRosen how do you think?

@nchammas
Copy link
Contributor Author

Alright, I've updated things per all the feedback I've gotten with one minor exception, which I will revisit later tonight.

@nchammas
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have started for PR 2339 at commit 26c5ed0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have started for PR 2339 at commit 26c5ed0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have finished for PR 2339 at commit 26c5ed0.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have finished for PR 2339 at commit 26c5ed0.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have started for PR 2339 at commit 9a9e035.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have finished for PR 2339 at commit 9a9e035.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nchammas
Copy link
Contributor Author

Flume test failed. My most recent commit just removed a couple of comment lines.

Jenkins, retest this please.

@nchammas
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have started for PR 2339 at commit 9a9e035.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have started for PR 2339 at commit 9a9e035.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have finished for PR 2339 at commit 9a9e035.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nchammas
Copy link
Contributor Author

@davies This PR is ready for another review. I believe I've covered all the feedback given so far.

@davies
Copy link
Contributor

davies commented Sep 11, 2014

@nchammas LGTM, thanks!

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have finished for PR 2339 at commit 9a9e035.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have started for PR 2339 at commit 0e648bc.

  • This patch merges cleanly.

@nchammas
Copy link
Contributor Author

I might remove some of the extra linbreaks/whitespace here.

Ah, OK, I just took care of that. An earlier version of that text would've put the line at over 100 characters, which is why it originally had those breaks in there.

@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have started for PR 2339 at commit 8b701d1.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have finished for PR 2339 at commit 0e648bc.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20832/

@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have finished for PR 2339 at commit 8b701d1.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20834/

@nchammas
Copy link
Contributor Author

Hey @pwendell, is there anything else you'd like to change about this PR?

@nchammas
Copy link
Contributor Author

BTW @shivaram I took your suggestion of checking the status checks before checking SSH itself.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21071/

@nchammas
Copy link
Contributor Author

nchammas commented Oct 1, 2014

Dunno where this most recent failure came from. The build output just shows a failure to checkout the patch.

@pwendell
Copy link
Contributor

pwendell commented Oct 1, 2014

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have started for PR 2339 at commit 43a69f0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have finished for PR 2339 at commit 43a69f0.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • println(s"Failed to load main class $childMainClass.")

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21163/

@nchammas
Copy link
Contributor Author

nchammas commented Oct 2, 2014

This patch adds the following public classes (experimental):
println(s"Failed to load main class $childMainClass.")

FYI: I believe I have these phantom new class notes finally sorted out in #2606.

@nchammas
Copy link
Contributor Author

nchammas commented Oct 6, 2014

@pwendell Is this good to merge in?

@nchammas
Copy link
Contributor Author

nchammas commented Oct 6, 2014

BTW, related side note: I just used Packer to create an example image from a template, and it looks like Packer follows a similar pattern of first waiting for the instance to come up, and then waiting for SSH to become available.

==> amazon-ebs: Launching a source AWS instance...
    amazon-ebs: Instance ID: i-1fbfcff1
==> amazon-ebs: Waiting for instance (i-1fbfcff1) to become ready...
==> amazon-ebs: Waiting for SSH to become available...
==> amazon-ebs: Connected to SSH!
==> amazon-ebs: Stopping the source instance...
==> amazon-ebs: Waiting for the instance to stop...

Just thought that was cool.

@JoshRosen
Copy link
Contributor

This looks good to me, so I'm going to merge it. Thanks for doing this!

@asfgit asfgit closed this in 5912ca6 Oct 7, 2014
@nchammas nchammas deleted the spark-ec2-wait-properly branch October 8, 2014 01:05
asfgit pushed a commit that referenced this pull request Nov 29, 2014
This PR re-introduces [0e648bc](0e648bc) from PR #2339, which somehow never made it into the codebase.

Additionally, it removes a now-unnecessary linear backoff on the SSH checks since we are blocking on EC2 status checks before testing SSH.

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #3195 from nchammas/remove-ec2-ssh-backoff and squashes the following commits:

efb29e1 [Nicholas Chammas] Revert "Remove linear backoff."
ef3ca99 [Nicholas Chammas] reuse conn
adb4eaa [Nicholas Chammas] Remove linear backoff.
55caa24 [Nicholas Chammas] Check EC2 status checks before SSH.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants