Can't Launch Batch Custom AMI with ParallelCluster #829

zadelman · 2019-01-15T18:51:02Z

Environment:

AWS ParallelCluster 2.1.0
OS: alinux
Scheduler: awsbatch
Master instance type: t2.micro
Compute instance type: optimal

Bug description and how to reproduce:
I'm trying to create a custom AMI in which I have some software pre-installed for doing weather modeling. I was able to successfully create and launch an RHEL-based AMI using the pcluster CLI. The problem with that image was that I don't think I had the container agent set up correctly because I could never get my job to run...it just stayed in the "RUNNABLE" phase. So I figured I'd try to use the Amazon ECS-optimized Linux AMI as my base image and build my computing platform on top of that. For a test, before installing anything I just tried to launch a batch instance using an AMI from this list: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html

I continue to get a variation of this error:

Cluster creation failed. Failed events:

AWS::CloudFormation::Stack AWSBatchStack Resource creation cancelled
AWS::EC2::Instance MasterServer Received FAILURE signal with UniqueId i-0f70fb2c183c37506

Additional context:
Here's my configuration file:

[aws]
aws_region_name = us-east-2

[cluster awswrf]
scheduler = awsbatch
compute_instance_type = optimal
key_name = ########
vpc_settings = public
ebs_settings = awswrf
master_instance_type = t2.micro
#master_root_volume_size = 40
min_vcpus = 0
max_vcpus = 40
desired_vcpus = 4
cluster_type = ondemand
#custom_ami = ami-009973ece6fe45688
custom_ami = ami-0c3da6571b6cfbe9a

[ebs awswrf]
shared_dir = data
ebs_snapshot_id = snap-00fa1f5bc9a7a9490
volume_type = gp2
volume_size = 500
volume_iops = 1500
encrypted = false
#ebs_volume_id = vol-0385e5d9d5f7b280d

[vpc public]
master_subnet_id = subnet-268aac4e
vpc_id = vpc-c2c6f8aa

[global]
update_check = true
sanity_check = true
cluster_template = awswrf

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

lukeseawalker · 2019-01-16T02:14:48Z

Hi @zadelman,
when the scheduler is awsbatch, the custom_ami parameter only applies to the master node.
AWS Batch is docker based and at this time we don’t allow using custom ami’s.
The only way to have pre-installed software into the AWS Batch cluster, is through installing it into the shared filesystem which will be then mounted into the docker.

About the job staying in the "RUNNABLE" phase, it is likely a networking issue, you should double check the requirements described in this section.

zadelman · 2019-01-17T15:38:12Z

Hi @lukeseawalker, thanks for these tips. In writing this ticket I found about about the --norollback option for the pcluster create command. It turns out my create was failing because I didn't have the mount point right for the ebs storage appliance I was attaching to the master instance. I found that out by tracing the messages in the log files written to /var/logs on the master instance. That's a nice debugging feature.

In the end, I was able to get my custom ami to start up using pcluster. I started with one of the aws optimized AMIs from this list. I spun up that AMI, installed all of my software (fortran-based weather model called WRF, compiled with PGI), and made a new AMI. I could then start that AMI with pcluster create. Now I'm back to the issue with jobs being stuck in "RUNNABLE."

This software is parallelized with MPI and I'm not clear on how best to get it running. My understanding (which is limited on this subject) is that Docker containers still aren't set up for running MPI jobs. I looked through the link that you provided about the compute_subnet settings. I don't really know what to do with that information, it's totally foreign sounding to me.

Here's what I was hoping to do, maybe you could point me in a better direction. My plan was to create an AMI with all of my modeling software installed (that's done). I want to run a series of MPI jobs (I'm thinking about 75 x 32 to 64 processor jobs running at the same time). I thought I could spin up a master node with my software binaries and scripts, and then issue AWS Batch jobs from that master node to spin up compute nodes to run my jobs. When I use the awsbatch scheduler to do this I'm stuck in "RUNNABLE".

Do you have any insights on a better way to use pcluster to accomplish my job? Should I just provision all of my compute nodes using EC2 instead of Batch, and run using another queuing system like slurm? Any thoughts are appreciated. -Zac

lukeseawalker · 2019-01-17T21:52:27Z

From your plan it sounds like using a traditional scheduler is the way to go at the moment.
The "build custom AMI" part you already did is good, just a minor comment is that it is better to keep the version of the pcluster package aligned with the instance list you start from. So you should take the AMI from the v2.1.0 list here instead of the 2.0.2 you used.

Using a traditional scheduler (sge, torque or slurm is up to your choice), the software you installed into the AMI will be available in the Compute Nodes as well, where your computation will take place.

Then the following is just an example on how to run an "helloworld" MPI job with sge:

create a file helloworld.sh with the following content

#!/bin/bash

#$ -cwd
#$ -j y
#$ -pe mpi 32
#$ -S /bin/bash

module load mpi/openmpi-x86_64
mpirun -np 32 hostname

submit the job

$ qsub helloworld.sh

At this point, ParallelCluster will try to spin up the number of instance needed to fit your job requirements. Please make sure that the max_queue_size configuration and instance limits are well sized for your job requirements.

zadelman · 2019-02-06T15:11:41Z

Hi Luca Checking back in on this thread. I haven’t been able to get my custom AMI to work. I would start with the v2.1.0 AMI template, add all of my software, save a new AMI, and then pcluster consistently fails to “create” a compute cluster from that new AMI. It would be great to try to close this loop on this to figure out if it is possible to do. I ended up just tarring up all of my software, putting it in an s3 bucket and then copying it over to a new instance spawned from the v2.1.0 AMI template. There is some upfront work that this requires, but once it’s set up I’ve had reasonable success doing what I want to do. I’m using a post-install script to automate some of this, although there are a few steps that require manual intervention (like installing the PGI compiler). My concept of creating a custom AMI with all of my software already loaded is where I’d like to be, rather than the s3 bucket approach because it would avoid having to do some manual configuration before the new instance is ready for live operations. I would like to create a platform-as-a-service custom AMI that I can share with my colleagues. It would be great if I can reduce any barriers to using this AMI, such as avoiding the manual configuration steps. Any thoughts on things I should be looking for in the pcluster error logs to troubleshoot why my custom AMI wasn’t working? I recall seeing error messages about the chef command, but I think I see these in the logs even for clusters to that do spin up correctly. Best, Zac

…

On Jan 17, 2019, at 3:52 PM, Luca Carrogu ***@***.***> wrote: From your plan it sounds like using a traditional scheduler is the way to go at the moment. The "build custom AMI" part you already did is good, just a minor comment is that it is better to keep the version of the pcluster package aligned with the instance list you start from. So you should take the AMI from the v2.1.0 list here <https://github.com/aws/aws-parallelcluster/blob/v2.1.0/amis.txt> instead of the 2.0.2 you used. Using a traditional scheduler (sge, torque or slurm is up to your choice), the software you installed into the AMI will be available in the Compute Nodes as well, where your computation will take place. Then the following is just an example on how to run an "helloworld" MPI job with sge: create a file helloworld.sh with the following content #!/bin/bash #$ -cwd #$ -j y #$ -pe mpi 32 #$ -S /bin/bash module load mpi/openmpi-x86_64 mpirun -np 32 hostname submit the job $ qsub helloworld.sh At this point, ParallelCluster will try to spin up the number of instance needed to fit your job requirements. Please make sure that the max_queue_size configuration and instance limits are well sized for your job requirements. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#829 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Asj0CjHboUp9aLhiKWhi8WHauRUVA9rrks5vEPCkgaJpZM4aBlX2>.

sean-smith · 2019-02-06T20:46:31Z

@zadelman

When you launch the cluster, launch with:

pcluster create new_cluster --norollback

When it fails creation, grab the ip address from the console and ssh in, check the /var/log/cfn-init.log, /var/log/cloud-init-output.log, and /var/log/cloud-init.log.

Post here what errors you find and I can help figure out what's going on. Have you tried using the --createami flag? See https://aws-parallelcluster.readthedocs.io/en/latest/commands.html#createami

zadelman · 2019-02-15T22:55:04Z

Hi Sean There appears to be an issue with chef, but I’m not sure what the problem is. I’m attaching the logs here from the cluster that I created using the pcluster create command in which I tried to use an AMI that I created from the console. I’m now trying to create a new ami now using the pcluster createami command. Working through some issues on my local machine that was causing this approach to fail (maybe having to do with the wrong version of pip?). Please let me know if you see anything in these logs that may give me a clue about why my custom AMI won’t launch through pcluster. Best,

…

_______________ Zac Adelman Lake Michigan Air Directors Consortium office: 847-720-7880 mobile: 919-302-8471 www.ladco.org

On Feb 6, 2019, at 2:46 PM, Sean Smith ***@***.***> wrote: @zadelman <https://github.com/zadelman> When you launch the cluster, launch with: pcluster create new_cluster --norollback When it fails creation, grab the ip address from the console and ssh in, check the /var/log/cfn-init.log, /var/log/cloud-init-output.log, and /var/log/cloud-init.log. Post here what errors you find and I can help figure out what's going on. Have you tried using the --createami flag? See https://aws-parallelcluster.readthedocs.io/en/latest/commands.html#createami <https://aws-parallelcluster.readthedocs.io/en/latest/commands.html#createami> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#829 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Asj0CvX8T4ddyLWOywb_X4BWAhRapUSFks5vKz8rgaJpZM4aBlX2>.

lukeseawalker · 2019-02-19T14:19:26Z

Hi @zadelman
we could not see any attached logs, I think because you replied to the ticket via mail.
Can you please post them through the GitHub portal?

tnx

zadelman · 2019-02-19T15:53:40Z

Here are the logs, including a chef-stacktrace that may include some useful information.

cfn-init.log
cloud-init.log
cloud-init-output.log
chef-stacktrace.out.log

lukeseawalker · 2019-02-22T07:29:26Z

Hi @zadelman
I have the suspect that you
1 - created a cluster parallelcluster-ladcowrf
2 - customized the running master server instance and created a new AMI from this
3 - used that AMI to create parallelcluster-ladcowrf2

If this is what happened, it's not the correct way to create your custom AMI. The AMI you want to customize needs to be launched independently from ParallelCluster (e.g. from the EC2 console).

The procedure is described here Modify an AWS ParallelCluster AMI

afernandezody · 2019-02-28T19:41:58Z

Hello everyone,
My two cents:
I also run MPI-based solvers and installing my software can take up to several hours so a customized AMI was also of interest to me. I have been tinkering with it for the last two days and this is what has worked and what hasn't:
i) You must follow the instructions linked by @lukeseawalker thoroughly. I started with an original AMI from a different region and got the same error as @zadelman from chef.
ii) I tried CentOS (personal preference) and although the previous error was fixed, the cloud-init cheft script still failed with a different error message (logs are attached below).
iii) I switched to ALinux and the cloud-init cheft script finally seemed to go through (the logs showed a successfull completion) but the deployment still failed with the following error message: "-AWS::Autoscaling::AutoscalingGroup ComputeFleet Received 1 FAILURE signal(s) out of 1. Unable to satisfy 100% MinSuccessfullInstancesPercent requirement." The information that I was able to find refers to CloudFormation and suggests modifying the JSON file, particularly the MetaData stuff. However, I'm not sure if fixing this issue would require some extra inputs in the ParallelCluster script (I tried maintain_initial_size with true and false but it didn't make any difference).

My sample script for a small cluster reads:
[aws]
aws_region_name = us-west-2
aws_access_key_id = XXXX
aws_secret_access_key = XXXX

[cluster odycluster]
vpc_settings = odyvpc
placement_group = DYNAMIC
placement = compute
key_name = XXXX
master_instance_type = t2.micro
compute_instance_type = m5a.large
initial_queue_size = 2
max_queue_size = 2
maintain_initial_size = false
cluster_type = ondemand
extra_json = { "cfncluster" : { "cfn_scheduler_slots" : "cores" } }
scheduler = sge
base_os = alinux
custom_ami = ami-0de4698be5f2698c8

[vpc odyvpc]
master_subnet_id = subnet-XXXX
vpc_id = vpc-XXXX
vpc_security_group_id = sg-XXXX

[global]
update_check = true
sanity_check = true
cluster_template = odycluster

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

Thanks,

boot.log
cfn-init.log
cfn-init-cmd.log
cfn-wire.log
cloud-init.log
yum.log

lukeseawalker · 2019-02-28T21:50:11Z

Hi @afernandezody,
from the logs I see that you are using ParallelCluster 2.1.0, but the AMI you are using has been created from the version 2.0.2.

To create your custom AMI you need to start from the released AMI matching your ParallelCluster version, so you must pick up the AMI from this list https://github.com/aws/aws-parallelcluster/blob/v2.1.0/amis.txt

In your case for Centos7, start from

us-west-2: ami-03abc3d448a987e73

for alinux start from

us-west-2: ami-0bdc62a7fa9026972

afernandezody · 2019-03-01T00:09:21Z

Hi @lukeseawalker,
Thank you for your quick reply. I went back and made sure that the original AMI version agrees with that of ParallelCluster. When this is not the situation, chef fails. However, the deployment is still falling through even when the versions agree with the message "-AWS::Autoscaling::AutoscalingGroup ComputeFleet Received 1 FAILURE signal(s) out of 1. Unable to satisfy 100% MinSuccessfullInstancesPercent requirement." The logs (attached below) indicate that chef has completed successfully, and that leads me to believe that something else is missing.

cfn-init.log
cfn-init-cmd.log
cfn-wire.log
cloud-init.log
cloud-init-output.log
supervisord.log

P.S. Any chance that upgrading to 2.2.1 would fix this issue?

demartinofra · 2019-03-01T11:00:06Z

@afernandezody The issue seems to be with the Compute nodes. What would really help here to identify the root cause are the logs pulled from the compute instance. Could you try to create the cluster with the --norollback option, log into the compute node (from the master) and pull the logs from there?

Thank you!

afernandezody · 2019-03-01T15:06:56Z

Hi @demartinofra,
I can't provide that info because I cannot ssh from the master to the compute nodes. Apparently the cluster creation has not finished and when sshing, I get the known

[centos@ip-172-31-11-40 ~]$ ssh 172.31.7.86
The authenticity of host '172.31.7.86 (172.31.7.86)' can't be established.
ECDSA key fingerprint is SHA256:3bMzUh4bMLRHm2Cr05GI2glu0Qnrwp9++hKk/P+Hipw.
ECDSA key fingerprint is MD5:af:20:62:8f:d0:56:e4:95:ec:43:bd:ea:2b:42:dc:22.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '172.31.7.86' (ECDSA) to the list of known hosts.
Permission denied (publickey,gssapi-keyex,gssapi-with-mic)._

The command qhost doesn't show any compute node either.
Thank you.
P.S. I upgraded everything (pcluster + AMIs) to 2.2.1 and that didn't make any difference.

demartinofra · 2019-03-01T18:23:16Z

I think it is just failing to ssh because you either need to forward the agent and add the key to the agent when sshing into the master or copy the ssh key to the master node and use that explicitly when sshing into the compute (with the -i option).

afernandezody · 2019-03-02T17:02:06Z

Hi @demartinofra,
I had some trouble sshing from the master node but was able to ssh directly to the compute node through the public IPv4 (which I'm not sure if the configuration aims to do). Anyhow, the logs are attached. I'm unfamiliar with chef so there is only so much I can interpret from the logs. It read like the problem is either with the mounting or some metadata. I tried the same process but erasing everything from the shared subdirectory (just in case this was interfering with the mounting thing), and the only modifications vs the original AMI reside in the /opt subdirectory. However, it didn't work either and got the same error message.
Thanks.

boot.log
cfn-init.log
cfn-init-cmd.log
cfn-wire.log
cloud-init.log

afernandezody · 2019-03-04T17:11:35Z

Hello again,
I was able to isolate and trace back the issue to the specification of the 'vpc_security_group_id=sg-XXXX' (same one used to create the custom AMI) in the script. This seems to somehow conflict with chef. I'm not sure about the details but will update my post as I figure out what works and what doesn't.

enrico-usai · 2019-04-15T13:08:10Z

Hi @afernandezody ,
thank you for your analysis and for the logs.

The error in the cfn-init.log is:

---- Begin output of mount -t nfs -o hard,intr,noatime,vers=3,_netdev ip-172-31-12-97.us-west-2.compute.internal:/home /home ----
STDOUT: 
STDERR: mount.nfs: Connection timed out

It means the compute node is not able to mount the nfs exported by the master.
This could depend by your vpc_security_group_id=sg-XXXX setting since the compute nodes must be able to access to the port 2049 of the master node.

Please let us know if it helps.

afernandezody · 2019-04-19T13:52:31Z

Hi Enrico,
I haven't looked into this issue for the last month plus, but I'll check what you're suggesting next week. Thanks.

afernandezody · 2019-04-25T18:41:53Z

It finally worked upon double-checking that everything agrees between the AMI and the cluster. Thanks.

sean-smith closed this as completed May 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't Launch Batch Custom AMI with ParallelCluster #829

Can't Launch Batch Custom AMI with ParallelCluster #829

zadelman commented Jan 15, 2019

lukeseawalker commented Jan 16, 2019

zadelman commented Jan 17, 2019

lukeseawalker commented Jan 17, 2019

zadelman commented Feb 6, 2019 via email

sean-smith commented Feb 6, 2019

zadelman commented Feb 15, 2019 via email

lukeseawalker commented Feb 19, 2019

zadelman commented Feb 19, 2019

lukeseawalker commented Feb 22, 2019

afernandezody commented Feb 28, 2019

lukeseawalker commented Feb 28, 2019

afernandezody commented Mar 1, 2019 •

edited

demartinofra commented Mar 1, 2019 •

edited

afernandezody commented Mar 1, 2019

demartinofra commented Mar 1, 2019

afernandezody commented Mar 2, 2019 •

edited

afernandezody commented Mar 4, 2019

enrico-usai commented Apr 15, 2019

afernandezody commented Apr 19, 2019

afernandezody commented Apr 25, 2019 •

edited

Can't Launch Batch Custom AMI with ParallelCluster #829

Can't Launch Batch Custom AMI with ParallelCluster #829

Comments

zadelman commented Jan 15, 2019

lukeseawalker commented Jan 16, 2019

zadelman commented Jan 17, 2019

lukeseawalker commented Jan 17, 2019

zadelman commented Feb 6, 2019 via email

sean-smith commented Feb 6, 2019

zadelman commented Feb 15, 2019 via email

lukeseawalker commented Feb 19, 2019

zadelman commented Feb 19, 2019

lukeseawalker commented Feb 22, 2019

afernandezody commented Feb 28, 2019

lukeseawalker commented Feb 28, 2019

afernandezody commented Mar 1, 2019 • edited

demartinofra commented Mar 1, 2019 • edited

afernandezody commented Mar 1, 2019

demartinofra commented Mar 1, 2019

afernandezody commented Mar 2, 2019 • edited

afernandezody commented Mar 4, 2019

enrico-usai commented Apr 15, 2019

afernandezody commented Apr 19, 2019

afernandezody commented Apr 25, 2019 • edited

afernandezody commented Mar 1, 2019 •

edited

demartinofra commented Mar 1, 2019 •

edited

afernandezody commented Mar 2, 2019 •

edited

afernandezody commented Apr 25, 2019 •

edited