New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS cluster deployed but 'Some nodes of the cluster were unreachable' and have 'No IP address' and uncontactable via ssh #490

Closed
swingingsimian opened this Issue Sep 26, 2017 · 12 comments

Comments

Projects
None yet
2 participants
@swingingsimian
Copy link

swingingsimian commented Sep 26, 2017

Hi

I have had some success getting a test cluster up but the 'start' process hangs for ages, then appears to timeout and fails to register any IPs for any of the nodes. The ssh key was created in the AWS console and the public portion was grabbed from a running test instance.

Config is as follows:

[cloud/amazon-us-east-1]
provider=ec2_boto
ec2_url=https://ec2.us-east-1.amazonaws.com
ec2_access_key=REMOVED
ec2_secret_key=REMOVED
ec2_region=us-east-1

[login/ubuntu]
image_user=ubuntu
image_sudo=True
user_key_name=elasticluster
user_key_private=~/.ssh/elasticluster.pem
user_key_public=~/.ssh/elasticluster.pub

[setup/slurm]
provider=ansible
frontend_groups=slurm_master
compute_groups=slurm_worker

[cluster/slurm-on-ubuntu16]
setup=slurm
frontend_nodes=1
compute_nodes=4
ssh_to=frontend
# Ubuntu 16.04
image_id=ami-cd0f5cb6
cloud=amazon-us-east-1
login=ubuntu
security_group=default

# Testing with free tier
flavor=t2.micro

Starting cluster as follows:

# elasticluster -c /usr/local/src/cegx-cluster/config start slurm-on-ubuntu16
Starting cluster slurm-on-ubuntu16 with:
* 1 frontend nodes.
* 4 compute nodes.
(This may take a while...)
2017-09-26 09:10:41 0fde7a0f3e33 gc3.elasticluster[16] ERROR Apparently, Amazon does not compute the RSA key fingerprint as we do! We cannot check if the uploaded keypair is correct!
2017-09-26 09:10:41 0fde7a0f3e33 gc3.elasticluster[16] ERROR Apparently, Amazon does not compute the RSA key fingerprint as we do! We cannot check if the uploaded keypair is correct!
2017-09-26 09:10:42 0fde7a0f3e33 gc3.elasticluster[16] ERROR Apparently, Amazon does not compute the RSA key fingerprint as we do! We cannot check if the uploaded keypair is correct!
2017-09-26 09:10:42 0fde7a0f3e33 gc3.elasticluster[16] ERROR Apparently, Amazon does not compute the RSA key fingerprint as we do! We cannot check if the uploaded keypair is correct!
2017-09-26 09:10:42 0fde7a0f3e33 gc3.elasticluster[16] ERROR Apparently, Amazon does not compute the RSA key fingerprint as we do! We cannot check if the uploaded keypair is correct!

# Hangs here for several minutes, but cluster appears in AWS console as expected

2017-09-26 09:21:24 0fde7a0f3e33 gc3.elasticluster[16] ERROR Some nodes of the cluster were unreachable within the given 600-seconds timeout: frontend001, compute001, compute002, compute004, compute003
Configuring the cluster.
(this too may take a while...)
2017-09-26 09:21:24 0fde7a0f3e33 gc3.elasticluster[16] WARNING Ignoring node frontend001: No IP address.
2017-09-26 09:21:24 0fde7a0f3e33 gc3.elasticluster[16] WARNING Ignoring node compute001: No IP address.
2017-09-26 09:21:24 0fde7a0f3e33 gc3.elasticluster[16] WARNING Ignoring node compute002: No IP address.
2017-09-26 09:21:24 0fde7a0f3e33 gc3.elasticluster[16] WARNING Ignoring node compute003: No IP address.
2017-09-26 09:21:24 0fde7a0f3e33 gc3.elasticluster[16] WARNING Ignoring node compute004: No IP address.

Your cluster is ready!

Cluster name: slurm-on-ubuntu16
Cluster template: slurm-on-ubuntu16
Default ssh to node: frontend001
- frontend nodes: 1
- compute nodes: 4

To login on the frontend node, run the command:

elasticluster ssh slurm-on-ubuntu16

To upload or download files to the cluster, use the command:

elasticluster sftp slurm-on-ubuntu16

Connect ssh into cluster either via elasticluster or direct using key:

$ ssh -vvv -i elasticluster.pem ubuntu@ec2-34-228-66-210.compute-1.amazonaws.com
OpenSSH_7.4p1, LibreSSL 2.5.0
debug1: Reading configuration data /Users/nathanjohnson/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug2: resolving "ec2-34-228-66-210.compute-1.amazonaws.com" port 22
debug2: ssh_connect_direct: needpriv 0
debug1: Connecting to ec2-34-228-66-210.compute-1.amazonaws.com [34.228.66.210] port 22.
debug1: connect to address 34.228.66.210 port 22: Connection refused
ssh: connect to host ec2-34-228-66-210.compute-1.amazonaws.com port 22: Connection refused

I'm a bit stumped here, any input greatly appreciated.

Thanks

@riccardomurri

This comment has been minimized.

Copy link
Member

riccardomurri commented Sep 26, 2017

I am trying to use your config to start a cluster on AWS EC2 and indeed things do not work as expected, although I get a different error:

  • the VMs are started correctly

  • ElastiCluster's Python code can connect via SSH::

      gc3.elasticluster[5320] INFO Connection to node `frontend001` successful, using IP address 174.129.54.58 to connect.
      gc3.elasticluster[5320] INFO Connection to node `compute003` successful, using IP address 54.221.71.81 to connect.
      gc3.elasticluster[5320] INFO Connection to node `compute004` successful, using IP address 107.22.140.140 to connect.
      gc3.elasticluster[5320] INFO Connection to node `compute001` successful, using IP address 52.201.48.233 to connect.
      gc3.elasticluster[5320] INFO Connection to node `compute002` successful, using IP address 54.174.220.36 to connect.
    
  • Ansible marks all VMs as "unreachable"::

      Configuring the cluster.
      ...
      TASK [Gathering Facts] **********************************************************************************************************************************************************************
      fatal: [compute003]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"54.221.71.81\". Make sure this host can be reached over ssh", "unreachable": true}
      fatal: [frontend001]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"174.129.54.58\". Make sure this host can be reached over ssh", "unreachable": true}
      fatal: [compute004]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"107.22.140.140\". Make sure this host can be reached over ssh", "unreachable": true}
      fatal: [compute002]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"54.174.220.36\". Make sure this host can be reached over ssh", "unreachable": true}
      fatal: [compute001]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"52.201.48.233\". Make sure this host can be reached over ssh", "unreachable": true}
    
  • still, I can connect to the VMs via ssh from the command-line...

@riccardomurri

This comment has been minimized.

Copy link
Member

riccardomurri commented Sep 26, 2017

If I upgrade to Ansible 2.4.0.0 I get a different error, which is likely the true cause of the failure::

    TASK [Gathering Facts] **********************************************************************************************************************************************************************
    fatal: [compute003]: FAILED! => {"changed": false, "failed": true, "module_stderr": "Shared connection to 54.221.71.81 closed.\r\n", "module_stdout": "/bin/sh: 1: /usr/bin/python: not found\r\n", "msg": "MODULE FAILURE", "rc": 0}
    ...

This is related to Ubuntu 16.04 having dropped Python2 (hence, /usr/bin/python) from the base system -- see #304

@riccardomurri

This comment has been minimized.

Copy link
Member

riccardomurri commented Sep 26, 2017

Can you please pull the latest "master" code (commit def6864 or later), then pip install "boto>=2.48" and re-try?

I cannot reproduce your exact problem, but my AWS connection did not work without this latest patch.

@riccardomurri riccardomurri self-assigned this Sep 26, 2017

@riccardomurri riccardomurri added the ec2 label Sep 26, 2017

@swingingsimian

This comment has been minimized.

Copy link
Author

swingingsimian commented Sep 26, 2017

No joy I'm afraid. I tried with a fresh checkout and pip install "boto>=2.48", but I got the same behaviour.

I see the partial fix in #304 regarding global_var_ansible_python_interpreter, but it seems there are still ongoing gnarly problems with xenial on that ticket. I will jump back to trusty for now. The other factor here which I didn't mention is that this is dockerised rather than virtualenv'd, but I don't expect that's making a difference.

Thanks

@riccardomurri

This comment has been minimized.

Copy link
Member

riccardomurri commented Sep 26, 2017

I'm not able to reproduce the issue you mentioned at the beginning. Do you still get the "No IP address" error, or does Ansible start now and fail with the "/usr/bin/python not found" error?

@swingingsimian

This comment has been minimized.

Copy link
Author

swingingsimian commented Sep 26, 2017

Same errors,to be clear I only updated the check out, I did not apply the global_var_ansible_python_interpreter fix.

# pip install "boto>=2.48"
Requirement already satisfied: boto>=2.48 in /usr/local/lib/python2.7/dist-packages
root@133fe7b334dc:/# elasticluster -c /usr/local/src/cegx-cluster/config start slurm-on-ubuntu16
Starting cluster `slurm-on-ubuntu16` with:
* 1 frontend nodes.
* 4 compute nodes.
(This may take a while...)
2017-09-26 14:39:52 133fe7b334dc gc3.elasticluster[22] ERROR Apparently, Amazon does not compute the RSA key fingerprint as we do! We cannot check if the uploaded keypair is correct!
2017-09-26 14:39:52 133fe7b334dc gc3.elasticluster[22] ERROR Apparently, Amazon does not compute the RSA key fingerprint as we do! We cannot check if the uploaded keypair is correct!
2017-09-26 14:39:53 133fe7b334dc gc3.elasticluster[22] ERROR Apparently, Amazon does not compute the RSA key fingerprint as we do! We cannot check if the uploaded keypair is correct!
2017-09-26 14:39:53 133fe7b334dc gc3.elasticluster[22] ERROR Apparently, Amazon does not compute the RSA key fingerprint as we do! We cannot check if the uploaded keypair is correct!
2017-09-26 14:39:53 133fe7b334dc gc3.elasticluster[22] ERROR Apparently, Amazon does not compute the RSA key fingerprint as we do! We cannot check if the uploaded keypair is correct!
2017-09-26 14:50:35 133fe7b334dc gc3.elasticluster[22] ERROR Some nodes of the cluster were unreachable within the given 600-seconds timeout: compute002, compute001, frontend001, compute003, compute004
Configuring the cluster.
(this too may take a while...)
2017-09-26 14:50:35 133fe7b334dc gc3.elasticluster[22] WARNING Ignoring node `frontend001`: No IP address.
2017-09-26 14:50:35 133fe7b334dc gc3.elasticluster[22] WARNING Ignoring node `compute001`: No IP address.
2017-09-26 14:50:35 133fe7b334dc gc3.elasticluster[22] WARNING Ignoring node `compute002`: No IP address.
2017-09-26 14:50:35 133fe7b334dc gc3.elasticluster[22] WARNING Ignoring node `compute003`: No IP address.
2017-09-26 14:50:35 133fe7b334dc gc3.elasticluster[22] WARNING Ignoring node `compute004`: No IP address.
Your cluster is ready!

Cluster name:     slurm-on-ubuntu16
Cluster template: slurm-on-ubuntu16
Default ssh to node: frontend001
- frontend nodes: 1
- compute nodes: 4

To login on the frontend node, run the command:

    elasticluster ssh slurm-on-ubuntu16

To upload or download files to the cluster, use the command:

    elasticluster sftp slurm-on-ubuntu16
@riccardomurri

This comment has been minimized.

Copy link
Member

riccardomurri commented Sep 26, 2017

@swingingsimian

This comment has been minimized.

Copy link
Author

swingingsimian commented Sep 27, 2017

Thanks

Yes I see from RTM:

The following are the default rules for each default security group:

Allows all inbound traffic from other instances associated with the default security group (the security group specifies itself as a source security group in its inbound rules)
Allows all outbound traffic from the instance.

Although this is counter intuitive as I can ssh into a manually started instance just fine. I will add a custom group with explicit external SSH permitted and retest.

Thanks

@swingingsimian

This comment has been minimized.

Copy link
Author

swingingsimian commented Sep 27, 2017

After schooling myself on security groups, am now using 'ssh-group' instead of 'default' security group, which got me to the known python not found issue:

...
Configuring the cluster.
(this too may take a while...)
[DEPRECATION WARNING]: 'include' for playbook includes. You should use 'import_playbook' instead.
This feature will be removed in version 2.8. Deprecation warnings can be disabled by setting
deprecation_warnings=False in ansible.cfg.
[DEPRECATION WARNING]: The use of 'include' for tasks has been deprecated. Use 'import_tasks' for
static inclusions or 'include_tasks' for dynamic inclusions. This feature will be removed in a
future release. Deprecation warnings can be disabled by setting deprecation_warnings=False in
ansible.cfg.
[DEPRECATION WARNING]: include is kept for backwards compatibility but usage is discouraged. The
module documentation details page may explain more about this rationale.. This feature will be
removed in a future release. Deprecation warnings can be disabled by setting
deprecation_warnings=False in ansible.cfg.

PLAY [Apply local customizations (before)] **********************************************************

TASK [Gathering Facts] ******************************************************************************
fatal: [frontend001]: FAILED! => {"changed": false, "failed": true, "module_stderr": "Shared connection to 34.235.87.21 closed.\r\n", "module_stdout": "/bin/sh: 1: /usr/bin/python: not found\r\n", "msg": "MODULE FAILURE", "rc": 0}
fatal: [compute004]: FAILED! => {"changed": false, "failed": true, "module_stderr": "Shared connection to 54.242.232.246 closed.\r\n", "module_stdout": "/bin/sh: 1: /usr/bin/python: not found\r\n", "msg": "MODULE FAILURE", "rc": 0}
fatal: [compute003]: FAILED! => {"changed": false, "failed": true, "module_stderr": "Shared connection to 54.91.1.5 closed.\r\n", "module_stdout": "\r\n/bin/sh: 1: /usr/bin/python: not found\r\n", "msg": "MODULE FAILURE", "rc": 0}
fatal: [compute001]: FAILED! => {"changed": false, "failed": true, "module_stderr": "Shared connection to 54.197.165.103 closed.\r\n", "module_stdout": "/bin/sh: 1: /usr/bin/python: not found\r\n", "msg": "MODULE FAILURE", "rc": 0}
fatal: [compute002]: FAILED! => {"changed": false, "failed": true, "module_stderr": "Shared connection to 54.237.229.174 closed.\r\n", "module_stdout": "/bin/sh: 1: /usr/bin/python: not found\r\n", "msg": "MODULE FAILURE", "rc": 0}
	to retry, use: --limit @/usr/local/src/elasticluster/src/elasticluster/share/playbooks/site.retry

PLAY RECAP ******************************************************************************************
compute001                 : ok=0    changed=0    unreachable=0    failed=1
compute002                 : ok=0    changed=0    unreachable=0    failed=1
compute003                 : ok=0    changed=0    unreachable=0    failed=1
compute004                 : ok=0    changed=0    unreachable=0    failed=1
frontend001                : ok=0    changed=0    unreachable=0    failed=1

2017-09-27 13:24:44 b862cf15f217 gc3.elasticluster[17] ERROR Command `ansible-playbook /usr/local/src/elasticluster/src/elasticluster/share/playbooks/site.yml --inventory=/root/.elasticluster/storage/slurm-on-ubuntu16.inventory --become --become-user=root` failed with exit code 2.
2017-09-27 13:24:44 b862cf15f217 gc3.elasticluster[17] ERROR Check the output lines above for additional information on this error.
2017-09-27 13:24:44 b862cf15f217 gc3.elasticluster[17] ERROR The cluster has likely *not* been configured correctly. You may need to re-run `elasticluster setup` or fix the playbooks.
2017-09-27 13:24:44 b862cf15f217 gc3.elasticluster[17] WARNING Cluster `slurm-on-ubuntu16` not yet configured. Please, re-run `elasticluster setup slurm-on-ubuntu16` and/or check your configuration

WARNING: YOUR CLUSTER IS NOT READY YET!

Cluster name:     slurm-on-ubuntu16
Cluster template: slurm-on-ubuntu16
Default ssh to node: frontend001
- frontend nodes: 1
- compute nodes: 4

To login on the frontend node, run the command:

    elasticluster ssh slurm-on-ubuntu16

To upload or download files to the cluster, use the command:

    elasticluster sftp slurm-on-ubuntu16

Given the work around for xenial in #304 involves configuring a python2.7 install and a seemingly unresolved auto-update race condition, I'm still preferring the trusty option here.

Thanks for the help.

@riccardomurri

This comment has been minimized.

Copy link
Member

riccardomurri commented Sep 27, 2017

After schooling myself on security groups, am now using 'ssh-group' instead of 'default' security group, which got me to the known python not found issue

Ok, I gather I can close this issue then?

Would you have any suggestion on how to update the documentation so
that other people do not run into the same issue in the future?

Given the work around for xenial in #304 involves configuring a python2.7 install and a seemingly unresolved auto-update race condition, I'm still preferring the trusty option here.

Sure, if you have the option of running on Ubuntu "trusty", that's definitely less hassle.

@swingingsimian

This comment has been minimized.

Copy link
Author

swingingsimian commented Sep 27, 2017

Yup, close this off now.

Re docs, only small changes really.

Adding a bit more to the security group docs 'ssh-group' to the security_group docs here:

http://elasticluster.readthedocs.io/en/latest/configure.html?highlight=security_group

'e.g. for EC2 ssh-group'

Also, under user_key_name:

http://elasticluster.readthedocs.io/en/latest/configure.html?highlight=user_key_name

This:
'If the keypair does not exist it will be created by ElastiCluster'

Made me think elasticluster would actually run ssh-keygen for me and add them, admittedly, if I'd spent a few more seconds trying to understand the sentence I would have realised that Elasticluster does not 'create' the keys, but adds them to AWS.

So maybe:
'If the pre-generated keypair does not exist on the cloud platform it will be added by ElastiCluster'

Thanks

@riccardomurri

This comment has been minimized.

Copy link
Member

riccardomurri commented Sep 29, 2017

Clarifications added to the docs.

Thanks for taking the time to review the docs and suggesting improvements!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment