Skip to content

CfnCluster fails to start a cluster with COMPUTE nodes in a PRIVATE subnet #531

@vbosquier

Description

@vbosquier

Hi,

I want to start a cfncluster with the MASTER node in a public Subnet and the COMPUTE nodes in a private Subnet. I don't want my compute nodes to be exposed, and even if you can close all possible ways to connect to the nodes through the Security Group, I'm convinced that having a Public IP on the compute nodes is an unnecessary weakness in terms of security.
My question is:
HOW CAN I START A CLUSTER HAVING THE COMPUTE NODES IN A STRICTLY PRIVATE NETWORK (with no Internet access at all)?

Complementary info:
When trying to start such cluster, the ComputeFleet fails to initialize, and the cluster fails to start with the following error:

Status: cfncluster-Test - CREATE_FAILED                             
Cluster creation failed.  Failed events:
  - AWS::CloudFormation::Stack cfncluster-Test The following resource(s) failed to create: [ComputeFleet]. 
  - AWS::AutoScaling::AutoScalingGroup ComputeFleet Received 2 FAILURE signal(s) out of 2.  Unable to satisfy 100% MinSuccessfulInstancesPercent requirement

In cloud-init.log, I could see that the script named /var/lib/cloud/instance/scripts/part-002 fails to execute properly on the COMPUTE nodes (no error on the MASTER, and no error if the COMPUTE nodes are configured to start in a public Subnet) :

2018-08-28 15:36:38,160 - util.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/part-002'] with allowed return codes [0] (shell=False, capture=False)
2018-08-28 15:47:11,727 - util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-002 [1]
2018-08-28 15:47:11,728 - util.py[DEBUG]: Failed running /var/lib/cloud/instance/scripts/part-002 [1]
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/cloudinit/util.py", line 802, in runparts
subp(prefix + [exe_path], capture=False)
File "/usr/lib/python2.7/site-packages/cloudinit/util.py", line 1858, in subp
cmd=args)
ProcessExecutionError: Unexpected error while running command.
Command: ['/var/lib/cloud/instance/scripts/part-002']
Exit code: 1
Reason: -
Stdout: -
Stderr: -

Please, find below the configuration that I use:

[aws]
aws_region_name = eu-west-1
aws_access_key_id = xxx
aws_secret_access_key = xxx

[global]
update_check = true
sanity_check = true
cluster_template = TorqueCluster

[cluster TorqueCluster]
vpc_settings = VPC-test
key_name = mykey
scheduler = torque
base_os = centos7
initial_queue_size = 2
max_queue_size = 2
maintain_initial_size = true
master_instance_type = m4.xlarge
compute_instance_type = t2.micro

[vpc VPC-test]
vpc_id = vpc-xxx
master_subnet_id = subnet-<public>
compute_subnet_id = subnet-<private>
vpc_security_group_id = sg-xxx

Thanks in advance for your help.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions