Skip to content

Conversation

@nchammas
Copy link
Contributor

If there is some fatal problem with launching a cluster, spark-ec2 just hangs without giving the user useful feedback on what the problem is.

This PR exposes the output of the SSH calls to the user if the SSH test fails during cluster launch for any reason but the instance status checks are all green. It also removes the growing trail of dots while waiting in favor of a fixed 3 dots.

For example:

$ ./ec2/spark-ec2 -k key -i /incorrect/path/identity.pem --instance-type m3.medium --slaves 1 --zone us-east-1c launch "spark-test"
Setting up security groups...
Searching for existing cluster spark-test...
Spark AMI: ami-35b1885c
Launching instances...
Launched 1 slaves in us-east-1c, regid = r-7dadd096
Launched master in us-east-1c, regid = r-fcadd017
Waiting for cluster to enter 'ssh-ready' state...
Warning: SSH connection error. (This could be temporary.)
Host: 127.0.0.1
SSH return code: 255
SSH output: Warning: Identity file /incorrect/path/identity.pem not accessible: No such file or directory.
Warning: Permanently added '127.0.0.1' (RSA) to the list of known hosts.
Permission denied (publickey).

This should give users enough information when some unrecoverable error occurs during launch so they can know to abort the launch. This will help avoid situations like the ones reported here on Stack Overflow and here on the user list, where the users couldn't tell what the problem was because it was being hidden by spark-ec2.

This is a usability improvement that should be backported to 1.2.

Resolves SPARK-5473.

@SparkQA
Copy link

SparkQA commented Jan 29, 2015

Test build #26293 has started for PR 4262 at commit 84250a8.

  • This patch merges cleanly.

@nchammas nchammas changed the title [EC2] Expose SSH failures after status checks pass [SPARK-5473] [EC2] Expose SSH failures after status checks pass Jan 29, 2015
@nchammas
Copy link
Contributor Author

cc @shivaram @JoshRosen

@shivaram
Copy link
Contributor

The idea of printing the error message sounds good to me. FWIW I actually found the dots useful to figure out how long the script was going to wait before trying again -- Will we just see an SSH error message now in place ?

@nchammas
Copy link
Contributor Author

FWIW I actually found the dots useful to figure out how long the script was going to wait before trying again

I find this useful too, but it doesn't work when we throw in the SSH output, since the dots and SSH output mix. I could look into a solution for that.

Will we just see an SSH error message now in place ?

Now we'll just see 3 dots followed by the start of a cluster build if everything goes well, and 3 dots followed by SSH output if something is potentially broken (as in the example above). In the former case the wait time until ssh-ready will still be printed as before.

@nchammas
Copy link
Contributor Author

I'm thinking it would be useful for a separate PR to add some information about how many nodes of the cluster are up during a cluster launch like this:

Waiting for cluster to enter 'ssh-ready' state... ( 1/30)
...
Waiting for cluster to enter 'ssh-ready' state... (29/30)
Waiting for cluster to enter 'ssh-ready' state... (30/30)

Or maybe just a simple spinner:

Waiting for cluster to enter 'ssh-ready' state... /
Waiting for cluster to enter 'ssh-ready' state... -
Waiting for cluster to enter 'ssh-ready' state... \
Waiting for cluster to enter 'ssh-ready' state... |
Waiting for cluster to enter 'ssh-ready' state... /
...

The original trail of dots also works.

In all cases we need to figure out how to do that neatly while allowing SSH output to also be printed to screen.

@SparkQA
Copy link

SparkQA commented Jan 29, 2015

Test build #26293 has finished for PR 4262 at commit 84250a8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26293/
Test PASSed.

@nchammas
Copy link
Contributor Author

@shivaram Does this solution work for you, or do you want the growing dots back?

@pwendell Does this meet your suggestion in #4162?

@shivaram
Copy link
Contributor

I haven't tried out this solution, so I am not exactly sure what gets printed (I can do it over the weekend sometime). At a high-level my comment is that every attempt that checks if the cluster is ssh-ready should print some feedback on the screen so the user knows the script is not hun. If that is the case I'm fine with this solution.

@nchammas
Copy link
Contributor Author

You can see example output in the PR description.

I will look into adding feedback while the script is waiting on the cluster to reach a certain state.

@SparkQA
Copy link

SparkQA commented Jan 30, 2015

Test build #26439 has started for PR 4262 at commit 6721c25.

  • This patch merges cleanly.

@nchammas
Copy link
Contributor Author

OK, I added the dots back in. It doesn't look great, but it's serviceable:

$ ./ec2/spark-ec2 -k key -i /good/path/pem --instance-type m3.medium --slaves 40 --zone us-east-1c launch "spark-test"
Setting up security groups...
Searching for existing cluster spark-test...
Spark AMI: ami-35b1885c
Launching instances...
Launched 40 slaves in us-east-1c, regid = r-811e646a
Launched master in us-east-1c, regid = r-501c66bb
Waiting for cluster to enter 'ssh-ready' state...........

Warning: SSH connection error. (This could be temporary.)
Host: 54.54.54.54
SSH return code: 255
SSH output: ssh: connect to host 54.54.54.54 port 22: Connection refused

.

Warning: SSH connection error. (This could be temporary.)
Host: 54.54.54.54
SSH return code: 255
SSH output: ssh: connect to host 54.54.54.54 port 22: Connection refused

.
Cluster is now in 'ssh-ready' state. Waited 517 seconds.
Generating cluster's SSH key on master...

That's what we get in the case of a temporary failure while SSH is still coming up.

In the case of a permanent failure, you just get the repeated . on a separate line followed by the 4-line error report:

$ ./ec2/spark-ec2 -k key -i /wrong/path/pem --instance-type m3.medium --slaves 1 --zone us-east-1c launch "spark-test"
Setting up security groups...
Searching for existing cluster spark-test...
Spark AMI: ami-35b1885c
Launching instances...
Launched 1 slaves in us-east-1c, regid = r-283248c3
Launched master in us-east-1c, regid = r-2b3248c0
Waiting for cluster to enter 'ssh-ready' state..............

Warning: SSH connection error. (This could be temporary.)
Host: 54.54.54.54
SSH return code: 255
SSH output: Warning: Identity file /wrong/path/pem not accessible: No such file or directory.
Warning: Permanently added '54.54.54.54' (RSA) to the list of known hosts.
Permission denied (publickey).

.

Warning: SSH connection error. (This could be temporary.)
Host: 54.54.54.54
SSH return code: 255
SSH output: Warning: Identity file /wrong/path/pem not accessible: No such file or directory.
Permission denied (publickey).

.

Warning: SSH conn...

What do you think?

@SparkQA
Copy link

SparkQA commented Jan 31, 2015

Test build #26444 has started for PR 4262 at commit 07f86df.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 31, 2015

Test build #26444 has finished for PR 4262 at commit 07f86df.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26444/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 31, 2015

Test build #26439 has finished for PR 4262 at commit 6721c25.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26439/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Jan 31, 2015

Test build #26445 has started for PR 4262 at commit 2c4d310.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 31, 2015

Test build #26445 has finished for PR 4262 at commit 2c4d310.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26445/
Test PASSed.

ec2/spark_ec2.py Outdated
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll change this to default to print_ssh_output=True.

@SparkQA
Copy link

SparkQA commented Feb 6, 2015

Test build #26960 has started for PR 4262 at commit 7784669.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 6, 2015

Test build #26961 has started for PR 4262 at commit 2b92534.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 6, 2015

Test build #26961 has finished for PR 4262 at commit 2b92534.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26961/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Feb 6, 2015

Test build #26960 has finished for PR 4262 at commit 7784669.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class UnresolvedGetField(child: Expression, fieldName: String) extends UnaryExpression
    • case class GetField(child: Expression, field: StructField, ordinal: Int) extends UnaryExpression

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26960/
Test PASSed.

@nchammas
Copy link
Contributor Author

nchammas commented Feb 7, 2015

This PR is ready for a final review.

@SparkQA
Copy link

SparkQA commented Feb 7, 2015

Test build #26981 has started for PR 4262 at commit 8bda6ed.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 7, 2015

Test build #26981 has finished for PR 4262 at commit 8bda6ed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26981/
Test PASSed.

@srowen
Copy link
Member

srowen commented Feb 7, 2015

@shivaram Do you have any more comments on this change?

@shivaram
Copy link
Contributor

shivaram commented Feb 9, 2015

I just tried out the latest version and it worked fine - LGTM. Thanks @nchammas for the change.

asfgit pushed a commit that referenced this pull request Feb 9, 2015
If there is some fatal problem with launching a cluster, `spark-ec2` just hangs without giving the user useful feedback on what the problem is.

This PR exposes the output of the SSH calls to the user if the SSH test fails during cluster launch for any reason but the instance status checks are all green. It also removes the growing trail of dots while waiting in favor of a fixed 3 dots.

For example:

```
$ ./ec2/spark-ec2 -k key -i /incorrect/path/identity.pem --instance-type m3.medium --slaves 1 --zone us-east-1c launch "spark-test"
Setting up security groups...
Searching for existing cluster spark-test...
Spark AMI: ami-35b1885c
Launching instances...
Launched 1 slaves in us-east-1c, regid = r-7dadd096
Launched master in us-east-1c, regid = r-fcadd017
Waiting for cluster to enter 'ssh-ready' state...
Warning: SSH connection error. (This could be temporary.)
Host: 127.0.0.1
SSH return code: 255
SSH output: Warning: Identity file /incorrect/path/identity.pem not accessible: No such file or directory.
Warning: Permanently added '127.0.0.1' (RSA) to the list of known hosts.
Permission denied (publickey).
```

This should give users enough information when some unrecoverable error occurs during launch so they can know to abort the launch. This will help avoid situations like the ones reported [here on Stack Overflow](http://stackoverflow.com/q/28002443/) and [here on the user list](http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3C1422323829398-21381.postn3.nabble.com%3E), where the users couldn't tell what the problem was because it was being hidden by `spark-ec2`.

This is a usability improvement that should be backported to 1.2.

Resolves [SPARK-5473](https://issues.apache.org/jira/browse/SPARK-5473).

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #4262 from nchammas/expose-ssh-failure and squashes the following commits:

8bda6ed [Nicholas Chammas] default to print SSH output
2b92534 [Nicholas Chammas] show SSH output after status check pass

(cherry picked from commit 4dfe180)
Signed-off-by: Sean Owen <sowen@cloudera.com>
@asfgit asfgit closed this in 4dfe180 Feb 9, 2015
@nchammas nchammas deleted the expose-ssh-failure branch February 9, 2015 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants