Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't Launch Batch Custom AMI with ParallelCluster #829

Closed
zadelman opened this issue Jan 15, 2019 · 20 comments
Closed

Can't Launch Batch Custom AMI with ParallelCluster #829

zadelman opened this issue Jan 15, 2019 · 20 comments

Comments

@zadelman
Copy link

Environment:

  • AWS ParallelCluster 2.1.0
  • OS: alinux
  • Scheduler: awsbatch
  • Master instance type: t2.micro
  • Compute instance type: optimal

Bug description and how to reproduce:
I'm trying to create a custom AMI in which I have some software pre-installed for doing weather modeling. I was able to successfully create and launch an RHEL-based AMI using the pcluster CLI. The problem with that image was that I don't think I had the container agent set up correctly because I could never get my job to run...it just stayed in the "RUNNABLE" phase. So I figured I'd try to use the Amazon ECS-optimized Linux AMI as my base image and build my computing platform on top of that. For a test, before installing anything I just tried to launch a batch instance using an AMI from this list: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html

I continue to get a variation of this error:

Cluster creation failed. Failed events:

  • AWS::CloudFormation::Stack AWSBatchStack Resource creation cancelled
  • AWS::EC2::Instance MasterServer Received FAILURE signal with UniqueId i-0f70fb2c183c37506

Additional context:
Here's my configuration file:

[aws]
aws_region_name = us-east-2

[cluster awswrf]
scheduler = awsbatch
compute_instance_type = optimal
key_name = ########
vpc_settings = public
ebs_settings = awswrf
master_instance_type = t2.micro
#master_root_volume_size = 40
min_vcpus = 0
max_vcpus = 40
desired_vcpus = 4
cluster_type = ondemand
#custom_ami = ami-009973ece6fe45688
custom_ami = ami-0c3da6571b6cfbe9a

[ebs awswrf]
shared_dir = data
ebs_snapshot_id = snap-00fa1f5bc9a7a9490
volume_type = gp2
volume_size = 500
volume_iops = 1500
encrypted = false
#ebs_volume_id = vol-0385e5d9d5f7b280d

[vpc public]
master_subnet_id = subnet-268aac4e
vpc_id = vpc-c2c6f8aa

[global]
update_check = true
sanity_check = true
cluster_template = awswrf

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

@lukeseawalker
Copy link
Contributor

Hi @zadelman,
when the scheduler is awsbatch, the custom_ami parameter only applies to the master node.
AWS Batch is docker based and at this time we don’t allow using custom ami’s.
The only way to have pre-installed software into the AWS Batch cluster, is through installing it into the shared filesystem which will be then mounted into the docker.

About the job staying in the "RUNNABLE" phase, it is likely a networking issue, you should double check the requirements described in this section.

@zadelman
Copy link
Author

Hi @lukeseawalker, thanks for these tips. In writing this ticket I found about about the --norollback option for the pcluster create command. It turns out my create was failing because I didn't have the mount point right for the ebs storage appliance I was attaching to the master instance. I found that out by tracing the messages in the log files written to /var/logs on the master instance. That's a nice debugging feature.

In the end, I was able to get my custom ami to start up using pcluster. I started with one of the aws optimized AMIs from this list. I spun up that AMI, installed all of my software (fortran-based weather model called WRF, compiled with PGI), and made a new AMI. I could then start that AMI with pcluster create. Now I'm back to the issue with jobs being stuck in "RUNNABLE."

This software is parallelized with MPI and I'm not clear on how best to get it running. My understanding (which is limited on this subject) is that Docker containers still aren't set up for running MPI jobs. I looked through the link that you provided about the compute_subnet settings. I don't really know what to do with that information, it's totally foreign sounding to me.

Here's what I was hoping to do, maybe you could point me in a better direction. My plan was to create an AMI with all of my modeling software installed (that's done). I want to run a series of MPI jobs (I'm thinking about 75 x 32 to 64 processor jobs running at the same time). I thought I could spin up a master node with my software binaries and scripts, and then issue AWS Batch jobs from that master node to spin up compute nodes to run my jobs. When I use the awsbatch scheduler to do this I'm stuck in "RUNNABLE".

Do you have any insights on a better way to use pcluster to accomplish my job? Should I just provision all of my compute nodes using EC2 instead of Batch, and run using another queuing system like slurm? Any thoughts are appreciated. -Zac

@lukeseawalker
Copy link
Contributor

From your plan it sounds like using a traditional scheduler is the way to go at the moment.
The "build custom AMI" part you already did is good, just a minor comment is that it is better to keep the version of the pcluster package aligned with the instance list you start from. So you should take the AMI from the v2.1.0 list here instead of the 2.0.2 you used.

Using a traditional scheduler (sge, torque or slurm is up to your choice), the software you installed into the AMI will be available in the Compute Nodes as well, where your computation will take place.

Then the following is just an example on how to run an "helloworld" MPI job with sge:

  • create a file helloworld.sh with the following content
#!/bin/bash

#$ -cwd
#$ -j y
#$ -pe mpi 32
#$ -S /bin/bash

module load mpi/openmpi-x86_64
mpirun -np 32 hostname
  • submit the job
$ qsub helloworld.sh

At this point, ParallelCluster will try to spin up the number of instance needed to fit your job requirements. Please make sure that the max_queue_size configuration and instance limits are well sized for your job requirements.

@zadelman
Copy link
Author

zadelman commented Feb 6, 2019 via email

@sean-smith
Copy link
Contributor

@zadelman

When you launch the cluster, launch with:

pcluster create new_cluster --norollback

When it fails creation, grab the ip address from the console and ssh in, check the /var/log/cfn-init.log, /var/log/cloud-init-output.log, and /var/log/cloud-init.log.

Post here what errors you find and I can help figure out what's going on. Have you tried using the --createami flag? See https://aws-parallelcluster.readthedocs.io/en/latest/commands.html#createami

@zadelman
Copy link
Author

zadelman commented Feb 15, 2019 via email

@lukeseawalker
Copy link
Contributor

Hi @zadelman
we could not see any attached logs, I think because you replied to the ticket via mail.
Can you please post them through the GitHub portal?

tnx

@zadelman
Copy link
Author

Here are the logs, including a chef-stacktrace that may include some useful information.

cfn-init.log
cloud-init.log
cloud-init-output.log
chef-stacktrace.out.log

@lukeseawalker
Copy link
Contributor

Hi @zadelman
I have the suspect that you
1 - created a cluster parallelcluster-ladcowrf
2 - customized the running master server instance and created a new AMI from this
3 - used that AMI to create parallelcluster-ladcowrf2

If this is what happened, it's not the correct way to create your custom AMI. The AMI you want to customize needs to be launched independently from ParallelCluster (e.g. from the EC2 console).

The procedure is described here Modify an AWS ParallelCluster AMI

@afernandezody
Copy link

Hello everyone,
My two cents:
I also run MPI-based solvers and installing my software can take up to several hours so a customized AMI was also of interest to me. I have been tinkering with it for the last two days and this is what has worked and what hasn't:
i) You must follow the instructions linked by @lukeseawalker thoroughly. I started with an original AMI from a different region and got the same error as @zadelman from chef.
ii) I tried CentOS (personal preference) and although the previous error was fixed, the cloud-init cheft script still failed with a different error message (logs are attached below).
iii) I switched to ALinux and the cloud-init cheft script finally seemed to go through (the logs showed a successfull completion) but the deployment still failed with the following error message: "-AWS::Autoscaling::AutoscalingGroup ComputeFleet Received 1 FAILURE signal(s) out of 1. Unable to satisfy 100% MinSuccessfullInstancesPercent requirement." The information that I was able to find refers to CloudFormation and suggests modifying the JSON file, particularly the MetaData stuff. However, I'm not sure if fixing this issue would require some extra inputs in the ParallelCluster script (I tried maintain_initial_size with true and false but it didn't make any difference).

My sample script for a small cluster reads:
[aws]
aws_region_name = us-west-2
aws_access_key_id = XXXX
aws_secret_access_key = XXXX

[cluster odycluster]
vpc_settings = odyvpc
placement_group = DYNAMIC
placement = compute
key_name = XXXX
master_instance_type = t2.micro
compute_instance_type = m5a.large
initial_queue_size = 2
max_queue_size = 2
maintain_initial_size = false
cluster_type = ondemand
extra_json = { "cfncluster" : { "cfn_scheduler_slots" : "cores" } }
scheduler = sge
base_os = alinux
custom_ami = ami-0de4698be5f2698c8

[vpc odyvpc]
master_subnet_id = subnet-XXXX
vpc_id = vpc-XXXX
vpc_security_group_id = sg-XXXX

[global]
update_check = true
sanity_check = true
cluster_template = odycluster

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

Thanks,

boot.log
cfn-init.log
cfn-init-cmd.log
cfn-wire.log
cloud-init.log
yum.log

@lukeseawalker
Copy link
Contributor

Hi @afernandezody,
from the logs I see that you are using ParallelCluster 2.1.0, but the AMI you are using has been created from the version 2.0.2.

To create your custom AMI you need to start from the released AMI matching your ParallelCluster version, so you must pick up the AMI from this list https://github.com/aws/aws-parallelcluster/blob/v2.1.0/amis.txt

In your case for Centos7, start from

  • us-west-2: ami-03abc3d448a987e73

for alinux start from

  • us-west-2: ami-0bdc62a7fa9026972

@afernandezody
Copy link

afernandezody commented Mar 1, 2019

Hi @lukeseawalker,
Thank you for your quick reply. I went back and made sure that the original AMI version agrees with that of ParallelCluster. When this is not the situation, chef fails. However, the deployment is still falling through even when the versions agree with the message "-AWS::Autoscaling::AutoscalingGroup ComputeFleet Received 1 FAILURE signal(s) out of 1. Unable to satisfy 100% MinSuccessfullInstancesPercent requirement." The logs (attached below) indicate that chef has completed successfully, and that leads me to believe that something else is missing.

cfn-init.log
cfn-init-cmd.log
cfn-wire.log
cloud-init.log
cloud-init-output.log
supervisord.log

P.S. Any chance that upgrading to 2.2.1 would fix this issue?

@demartinofra
Copy link
Contributor

demartinofra commented Mar 1, 2019

@afernandezody The issue seems to be with the Compute nodes. What would really help here to identify the root cause are the logs pulled from the compute instance. Could you try to create the cluster with the --norollback option, log into the compute node (from the master) and pull the logs from there?

Thank you!

@afernandezody
Copy link

Hi @demartinofra,
I can't provide that info because I cannot ssh from the master to the compute nodes. Apparently the cluster creation has not finished and when sshing, I get the known

[centos@ip-172-31-11-40 ~]$ ssh 172.31.7.86
The authenticity of host '172.31.7.86 (172.31.7.86)' can't be established.
ECDSA key fingerprint is SHA256:3bMzUh4bMLRHm2Cr05GI2glu0Qnrwp9++hKk/P+Hipw.
ECDSA key fingerprint is MD5:af:20:62:8f:d0:56:e4:95:ec:43:bd:ea:2b:42:dc:22.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '172.31.7.86' (ECDSA) to the list of known hosts.
Permission denied (publickey,gssapi-keyex,gssapi-with-mic)._

The command qhost doesn't show any compute node either.
Thank you.
P.S. I upgraded everything (pcluster + AMIs) to 2.2.1 and that didn't make any difference.

@demartinofra
Copy link
Contributor

I think it is just failing to ssh because you either need to forward the agent and add the key to the agent when sshing into the master or copy the ssh key to the master node and use that explicitly when sshing into the compute (with the -i option).

@afernandezody
Copy link

afernandezody commented Mar 2, 2019

Hi @demartinofra,
I had some trouble sshing from the master node but was able to ssh directly to the compute node through the public IPv4 (which I'm not sure if the configuration aims to do). Anyhow, the logs are attached. I'm unfamiliar with chef so there is only so much I can interpret from the logs. It read like the problem is either with the mounting or some metadata. I tried the same process but erasing everything from the shared subdirectory (just in case this was interfering with the mounting thing), and the only modifications vs the original AMI reside in the /opt subdirectory. However, it didn't work either and got the same error message.
Thanks.

boot.log
cfn-init.log
cfn-init-cmd.log
cfn-wire.log
cloud-init.log

@afernandezody
Copy link

Hello again,
I was able to isolate and trace back the issue to the specification of the 'vpc_security_group_id=sg-XXXX' (same one used to create the custom AMI) in the script. This seems to somehow conflict with chef. I'm not sure about the details but will update my post as I figure out what works and what doesn't.

@enrico-usai
Copy link
Contributor

Hi @afernandezody ,
thank you for your analysis and for the logs.

The error in the cfn-init.log is:

---- Begin output of mount -t nfs -o hard,intr,noatime,vers=3,_netdev ip-172-31-12-97.us-west-2.compute.internal:/home /home ----
STDOUT: 
STDERR: mount.nfs: Connection timed out

It means the compute node is not able to mount the nfs exported by the master.
This could depend by your vpc_security_group_id=sg-XXXX setting since the compute nodes must be able to access to the port 2049 of the master node.

Please let us know if it helps.

@afernandezody
Copy link

Hi Enrico,
I haven't looked into this issue for the last month plus, but I'll check what you're suggesting next week. Thanks.

@afernandezody
Copy link

afernandezody commented Apr 25, 2019

It finally worked upon double-checking that everything agrees between the AMI and the cluster. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants