New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't Launch Batch Custom AMI with ParallelCluster #829
Comments
Hi @zadelman, About the job staying in the "RUNNABLE" phase, it is likely a networking issue, you should double check the requirements described in this section. |
Hi @lukeseawalker, thanks for these tips. In writing this ticket I found about about the --norollback option for the pcluster create command. It turns out my create was failing because I didn't have the mount point right for the ebs storage appliance I was attaching to the master instance. I found that out by tracing the messages in the log files written to /var/logs on the master instance. That's a nice debugging feature. In the end, I was able to get my custom ami to start up using pcluster. I started with one of the aws optimized AMIs from this list. I spun up that AMI, installed all of my software (fortran-based weather model called WRF, compiled with PGI), and made a new AMI. I could then start that AMI with pcluster create. Now I'm back to the issue with jobs being stuck in "RUNNABLE." This software is parallelized with MPI and I'm not clear on how best to get it running. My understanding (which is limited on this subject) is that Docker containers still aren't set up for running MPI jobs. I looked through the link that you provided about the compute_subnet settings. I don't really know what to do with that information, it's totally foreign sounding to me. Here's what I was hoping to do, maybe you could point me in a better direction. My plan was to create an AMI with all of my modeling software installed (that's done). I want to run a series of MPI jobs (I'm thinking about 75 x 32 to 64 processor jobs running at the same time). I thought I could spin up a master node with my software binaries and scripts, and then issue AWS Batch jobs from that master node to spin up compute nodes to run my jobs. When I use the awsbatch scheduler to do this I'm stuck in "RUNNABLE". Do you have any insights on a better way to use pcluster to accomplish my job? Should I just provision all of my compute nodes using EC2 instead of Batch, and run using another queuing system like slurm? Any thoughts are appreciated. -Zac |
From your plan it sounds like using a traditional scheduler is the way to go at the moment. Using a traditional scheduler (sge, torque or slurm is up to your choice), the software you installed into the AMI will be available in the Compute Nodes as well, where your computation will take place. Then the following is just an example on how to run an "helloworld" MPI job with sge:
At this point, ParallelCluster will try to spin up the number of instance needed to fit your job requirements. Please make sure that the max_queue_size configuration and instance limits are well sized for your job requirements. |
Hi Luca
Checking back in on this thread. I haven’t been able to get my custom AMI to work. I would start with the v2.1.0 AMI template, add all of my software, save a new AMI, and then pcluster consistently fails to “create” a compute cluster from that new AMI. It would be great to try to close this loop on this to figure out if it is possible to do.
I ended up just tarring up all of my software, putting it in an s3 bucket and then copying it over to a new instance spawned from the v2.1.0 AMI template. There is some upfront work that this requires, but once it’s set up I’ve had reasonable success doing what I want to do. I’m using a post-install script to automate some of this, although there are a few steps that require manual intervention (like installing the PGI compiler).
My concept of creating a custom AMI with all of my software already loaded is where I’d like to be, rather than the s3 bucket approach because it would avoid having to do some manual configuration before the new instance is ready for live operations. I would like to create a platform-as-a-service custom AMI that I can share with my colleagues. It would be great if I can reduce any barriers to using this AMI, such as avoiding the manual configuration steps.
Any thoughts on things I should be looking for in the pcluster error logs to troubleshoot why my custom AMI wasn’t working? I recall seeing error messages about the chef command, but I think I see these in the logs even for clusters to that do spin up correctly.
Best,
Zac
… On Jan 17, 2019, at 3:52 PM, Luca Carrogu ***@***.***> wrote:
From your plan it sounds like using a traditional scheduler is the way to go at the moment.
The "build custom AMI" part you already did is good, just a minor comment is that it is better to keep the version of the pcluster package aligned with the instance list you start from. So you should take the AMI from the v2.1.0 list here <https://github.com/aws/aws-parallelcluster/blob/v2.1.0/amis.txt> instead of the 2.0.2 you used.
Using a traditional scheduler (sge, torque or slurm is up to your choice), the software you installed into the AMI will be available in the Compute Nodes as well, where your computation will take place.
Then the following is just an example on how to run an "helloworld" MPI job with sge:
create a file helloworld.sh with the following content
#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe mpi 32
#$ -S /bin/bash
module load mpi/openmpi-x86_64
mpirun -np 32 hostname
submit the job
$ qsub helloworld.sh
At this point, ParallelCluster will try to spin up the number of instance needed to fit your job requirements. Please make sure that the max_queue_size configuration and instance limits are well sized for your job requirements.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#829 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Asj0CjHboUp9aLhiKWhi8WHauRUVA9rrks5vEPCkgaJpZM4aBlX2>.
|
When you launch the cluster, launch with:
When it fails creation, grab the ip address from the console and ssh in, check the Post here what errors you find and I can help figure out what's going on. Have you tried using the |
Hi Sean
There appears to be an issue with chef, but I’m not sure what the problem is. I’m attaching the logs here from the cluster that I created using the pcluster create command in which I tried to use an AMI that I created from the console.
I’m now trying to create a new ami now using the pcluster createami command. Working through some issues on my local machine that was causing this approach to fail (maybe having to do with the wrong version of pip?).
Please let me know if you see anything in these logs that may give me a clue about why my custom AMI won’t launch through pcluster.
Best,
…_______________
Zac Adelman
Lake Michigan Air Directors Consortium
office: 847-720-7880
mobile: 919-302-8471
www.ladco.org
On Feb 6, 2019, at 2:46 PM, Sean Smith ***@***.***> wrote:
@zadelman <https://github.com/zadelman>
When you launch the cluster, launch with:
pcluster create new_cluster --norollback
When it fails creation, grab the ip address from the console and ssh in, check the /var/log/cfn-init.log, /var/log/cloud-init-output.log, and /var/log/cloud-init.log.
Post here what errors you find and I can help figure out what's going on. Have you tried using the --createami flag? See https://aws-parallelcluster.readthedocs.io/en/latest/commands.html#createami <https://aws-parallelcluster.readthedocs.io/en/latest/commands.html#createami>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#829 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Asj0CvX8T4ddyLWOywb_X4BWAhRapUSFks5vKz8rgaJpZM4aBlX2>.
|
Hi @zadelman tnx |
Here are the logs, including a chef-stacktrace that may include some useful information. cfn-init.log |
Hi @zadelman If this is what happened, it's not the correct way to create your custom AMI. The AMI you want to customize needs to be launched independently from ParallelCluster (e.g. from the EC2 console). The procedure is described here Modify an AWS ParallelCluster AMI |
Hello everyone, My sample script for a small cluster reads: [cluster odycluster] [vpc odyvpc] [global] [aliases] Thanks, boot.log |
Hi @afernandezody, To create your custom AMI you need to start from the released AMI matching your ParallelCluster version, so you must pick up the AMI from this list https://github.com/aws/aws-parallelcluster/blob/v2.1.0/amis.txt In your case for Centos7, start from
for alinux start from
|
Hi @lukeseawalker, cfn-init.log P.S. Any chance that upgrading to 2.2.1 would fix this issue? |
@afernandezody The issue seems to be with the Compute nodes. What would really help here to identify the root cause are the logs pulled from the compute instance. Could you try to create the cluster with the Thank you! |
Hi @demartinofra, [centos@ip-172-31-11-40 ~]$ ssh 172.31.7.86 The command qhost doesn't show any compute node either. |
I think it is just failing to ssh because you either need to forward the agent and add the key to the agent when sshing into the master or copy the ssh key to the master node and use that explicitly when sshing into the compute (with the -i option). |
Hi @demartinofra, boot.log |
Hello again, |
Hi @afernandezody , The error in the cfn-init.log is:
It means the compute node is not able to mount the nfs exported by the master. Please let us know if it helps. |
Hi Enrico, |
It finally worked upon double-checking that everything agrees between the AMI and the cluster. Thanks. |
Environment:
Bug description and how to reproduce:
I'm trying to create a custom AMI in which I have some software pre-installed for doing weather modeling. I was able to successfully create and launch an RHEL-based AMI using the pcluster CLI. The problem with that image was that I don't think I had the container agent set up correctly because I could never get my job to run...it just stayed in the "RUNNABLE" phase. So I figured I'd try to use the Amazon ECS-optimized Linux AMI as my base image and build my computing platform on top of that. For a test, before installing anything I just tried to launch a batch instance using an AMI from this list: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html
I continue to get a variation of this error:
Cluster creation failed. Failed events:
Additional context:
Here's my configuration file:
[aws]
aws_region_name = us-east-2
[cluster awswrf]
scheduler = awsbatch
compute_instance_type = optimal
key_name = ########
vpc_settings = public
ebs_settings = awswrf
master_instance_type = t2.micro
#master_root_volume_size = 40
min_vcpus = 0
max_vcpus = 40
desired_vcpus = 4
cluster_type = ondemand
#custom_ami = ami-009973ece6fe45688
custom_ami = ami-0c3da6571b6cfbe9a
[ebs awswrf]
shared_dir = data
ebs_snapshot_id = snap-00fa1f5bc9a7a9490
volume_type = gp2
volume_size = 500
volume_iops = 1500
encrypted = false
#ebs_volume_id = vol-0385e5d9d5f7b280d
[vpc public]
master_subnet_id = subnet-268aac4e
vpc_id = vpc-c2c6f8aa
[global]
update_check = true
sanity_check = true
cluster_template = awswrf
[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}
The text was updated successfully, but these errors were encountered: