-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-5473] [EC2] Expose SSH failures after status checks pass #4262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c109c8e to
84250a8
Compare
|
Test build #26293 has started for PR 4262 at commit
|
|
The idea of printing the error message sounds good to me. FWIW I actually found the dots useful to figure out how long the script was going to wait before trying again -- Will we just see an SSH error message now in place ? |
I find this useful too, but it doesn't work when we throw in the SSH output, since the dots and SSH output mix. I could look into a solution for that.
Now we'll just see 3 dots followed by the start of a cluster build if everything goes well, and 3 dots followed by SSH output if something is potentially broken (as in the example above). In the former case the wait time until |
|
I'm thinking it would be useful for a separate PR to add some information about how many nodes of the cluster are up during a cluster launch like this: Or maybe just a simple spinner: The original trail of dots also works. In all cases we need to figure out how to do that neatly while allowing SSH output to also be printed to screen. |
|
Test build #26293 has finished for PR 4262 at commit
|
|
Test PASSed. |
|
I haven't tried out this solution, so I am not exactly sure what gets printed (I can do it over the weekend sometime). At a high-level my comment is that every attempt that checks if the cluster is |
|
You can see example output in the PR description. I will look into adding feedback while the script is waiting on the cluster to reach a certain state. |
|
Test build #26439 has started for PR 4262 at commit
|
|
OK, I added the dots back in. It doesn't look great, but it's serviceable: That's what we get in the case of a temporary failure while SSH is still coming up. In the case of a permanent failure, you just get the repeated What do you think? |
|
Test build #26444 has started for PR 4262 at commit
|
|
Test build #26444 has finished for PR 4262 at commit
|
|
Test FAILed. |
|
Test build #26439 has finished for PR 4262 at commit
|
|
Test PASSed. |
|
Test build #26445 has started for PR 4262 at commit
|
|
Test build #26445 has finished for PR 4262 at commit
|
|
Test PASSed. |
ec2/spark_ec2.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll change this to default to print_ssh_output=True.
2c4d310 to
7784669
Compare
|
Test build #26960 has started for PR 4262 at commit
|
7784669 to
2b92534
Compare
|
Test build #26961 has started for PR 4262 at commit
|
|
Test build #26961 has finished for PR 4262 at commit
|
|
Test FAILed. |
|
Test build #26960 has finished for PR 4262 at commit
|
|
Test PASSed. |
|
This PR is ready for a final review. |
|
Test build #26981 has started for PR 4262 at commit
|
|
Test build #26981 has finished for PR 4262 at commit
|
|
Test PASSed. |
|
@shivaram Do you have any more comments on this change? |
|
I just tried out the latest version and it worked fine - LGTM. Thanks @nchammas for the change. |
If there is some fatal problem with launching a cluster, `spark-ec2` just hangs without giving the user useful feedback on what the problem is. This PR exposes the output of the SSH calls to the user if the SSH test fails during cluster launch for any reason but the instance status checks are all green. It also removes the growing trail of dots while waiting in favor of a fixed 3 dots. For example: ``` $ ./ec2/spark-ec2 -k key -i /incorrect/path/identity.pem --instance-type m3.medium --slaves 1 --zone us-east-1c launch "spark-test" Setting up security groups... Searching for existing cluster spark-test... Spark AMI: ami-35b1885c Launching instances... Launched 1 slaves in us-east-1c, regid = r-7dadd096 Launched master in us-east-1c, regid = r-fcadd017 Waiting for cluster to enter 'ssh-ready' state... Warning: SSH connection error. (This could be temporary.) Host: 127.0.0.1 SSH return code: 255 SSH output: Warning: Identity file /incorrect/path/identity.pem not accessible: No such file or directory. Warning: Permanently added '127.0.0.1' (RSA) to the list of known hosts. Permission denied (publickey). ``` This should give users enough information when some unrecoverable error occurs during launch so they can know to abort the launch. This will help avoid situations like the ones reported [here on Stack Overflow](http://stackoverflow.com/q/28002443/) and [here on the user list](http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3C1422323829398-21381.postn3.nabble.com%3E), where the users couldn't tell what the problem was because it was being hidden by `spark-ec2`. This is a usability improvement that should be backported to 1.2. Resolves [SPARK-5473](https://issues.apache.org/jira/browse/SPARK-5473). Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #4262 from nchammas/expose-ssh-failure and squashes the following commits: 8bda6ed [Nicholas Chammas] default to print SSH output 2b92534 [Nicholas Chammas] show SSH output after status check pass (cherry picked from commit 4dfe180) Signed-off-by: Sean Owen <sowen@cloudera.com>
If there is some fatal problem with launching a cluster,
spark-ec2just hangs without giving the user useful feedback on what the problem is.This PR exposes the output of the SSH calls to the user if the SSH test fails during cluster launch for any reason but the instance status checks are all green. It also removes the growing trail of dots while waiting in favor of a fixed 3 dots.
For example:
This should give users enough information when some unrecoverable error occurs during launch so they can know to abort the launch. This will help avoid situations like the ones reported here on Stack Overflow and here on the user list, where the users couldn't tell what the problem was because it was being hidden by
spark-ec2.This is a usability improvement that should be backported to 1.2.
Resolves SPARK-5473.