Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple AZ/subnets support #3576

Closed
zbarr opened this issue Dec 7, 2021 · 23 comments
Closed

Multiple AZ/subnets support #3576

zbarr opened this issue Dec 7, 2021 · 23 comments

Comments

@zbarr
Copy link

zbarr commented Dec 7, 2021

There's an issue open for this but it is specifically for awsbatch whereas I need this for Slurm. Our own proprietary "Slurm with AWS" implementation had this so we've gone backwards a bit in that respect by moving to ParallelCluster. Hoping you guys are able to get to this soon! Documentation indicates it's coming.

@hanwen-pcluste
Copy link
Contributor

Hello zbarr,

Thank you for the information. Multiple AZ support is a popular feature request. We are discussing this and will let you know any progress.

Cheers,
Hanwen

@devanshkv
Copy link

Any updates or timelines for the multiple AZ support?

@enrico-usai enrico-usai changed the title Multiple AZ support Multiple AZ/subnets support Mar 11, 2022
@enrico-usai
Copy link
Contributor

We don't have yet a timeline to share for this feature enhancement.

An high level alternative is to use Slurm Federation to manage multiple Slurm clusters. See Workshop: Using AWS ParallelCluster for Research.

@gns-build-server
Copy link

gns-build-server commented Mar 30, 2022

I successfully did multi-AZ / multi-subnet with ParallelCluster 2.11.3 running SGE. I am not sure if the issues people are seeing are SLURM-specific, or if there's something else I'm doing right here that AWS doesn't necessarily know about. @enrico-usai , can you weigh in?

We are using:

scheduler = sge
base_os = ubuntu1804
cluster_type = spot

with
master_subnet_id and compute_subnet_id set to the same subnet. Create the cluster, and then edit the autoscaling group manually. Under the "Network" section, change the one AZ/subnet listed to a full set of subnets you've created, one in each AZ.

As you can see in the screenshots, simply changing the ASG causes the autoscaler to spot request in multiple AZs, and running jobs span across the AZs:

image
image
image
image
Screen Shot 2022-03-30 at 4 15 09 PM

@enrico-usai
Copy link
Contributor

enrico-usai commented Mar 31, 2022

Hi @gns-build-server, thanks for your contribute!

Your approach works with SGE (or Torque) because ParallelCluster uses ASG to manage compute fleet, so by manually modifying the ASG you're adding more capabilities to your cluster.

This is not a suggested approach because you're manually modifying resources created by ParallelCluster so this can causes unexpected issues. Then this approach has some limits, for example the instance types you can use for compute fleet must be in the same avail zone.

For Slurm (starting from ParallelCluster 2.9.0) this approach cannot work because we moved away from ASG and we're using Slurm Cloud Bursting native support, providing a better integration with AWS. As part of 2.9.0 we also introduced support for multiple queues and multiple instance types.

With this issue we're tracking the possibility to natively use multiple Availability Zones together with different instance types, for example to overcome capacity limits.

@gns-build-server
Copy link

gns-build-server commented Apr 4, 2022

@enrico-usai That's what I was worried about. If there is no roadmap for the functionality to use multiple AZs in Parallelcluster with SLURM, then this is a dealbreaker for us to move onto Parallelcluster 3.0 where the SGE scheduler no longer exists.

For us to continue using Parallelcluster without SGE and ASGs, we will absolutely need multiple availability zones, because at the scale we are runinng, we frequently run into capacity limits when stuck in single AZs, and using SLURM Federation to federate between multiple-clusters is a non-starter for our product.

I would like to chime in with the other voices requesting a timeline for the ability to natively use multiple AZs together with the "different instance types" built into Parallelcluster >=2.9.0 in order to overcome capacity limits. It is vital for us to be able to continue using Parallelcluster beyond the version we're currently stuck at, which according to https://docs.aws.amazon.com/parallelcluster/latest/ug/support-policy.html, will reach its End-Of-Support date on 12/31/2022.

@devanshkv
Copy link

@gns-build-server +1, @enrico-usai, we are also stuck due to the same limitations.

@enrico-usai
Copy link
Contributor

enrico-usai commented Apr 5, 2022

@gns-build-server, @devanshkv, @zbarr could you please add more details about your use-cases?
Are you trying to cope with capacity issues? Do you need instances types which are in separate AZs?
We'd like to collect more info about user's requirements for this feature. Thanks

@devanshkv
Copy link

@enrico-usai, this is due to the GPU capacity issues.

@lletourn
Copy link

lletourn commented Apr 5, 2022

If I can chime in, My particular use case, we use HPC to run processing pipelines which also contain GPU isntances.
To save cost we use CPU jobs and GPU jobs that have dependencies in between them so we only use GPUs to do the strict minimum.
Since we don't need low latency between hosts, just A LOT of hosts, using multi-AZ would help us greatly!

@gns-build-server
Copy link

gns-build-server commented Apr 5, 2022

@enrico-usai , We find that when we run our application, (using spot, with 96-VCPU nodes such as m6a.24xlarge and m6i.24xlarge), we frequently find ourselves unable to get capacity if we are stuck with a single AZ. We primarily run in US-East-1. Like @lletourn , we are not latency bound as our application is mostly embarrassingly parallel, and we run at large scale so capacity frequently becomes an issue. Our clusters will frequently scale to 50-100 96-VCPU nodes when the queue is stacked with jobs, but capacity issues don't only crop up when the queue is stacked. Before implementing the multi-subnet ASG method I described in my post above, we frequently found a cluster unable to get even a single node even when it's currently at zero nodes. The cluster would never get the node, or the node would come up for a minute and then immediately get interrupted. What we found ourselves doing in that case is looking at the Spot Advisor and trying to guess, according to the "% interruption" column for our region, which instance types we should change the cluster to. Even this doesn't work well, because we frequently find that the Spot Advisor is indicating low % interruption and we still can't get any capacity for that instance type in a particular AZ. My suspicion is the Spot Advisor will indicate low % interruption even when one or more AZs in a region are out of capacity, as long as there are some AZs in the region that still have high capacity. So we try another. And then we find that one can't get capacity either. So we try another. And that one can't get capacity either. So we try to outsmart it a bit, and say "Well, there are particular instance types, like m5 versus m5d or m5dn, that AWS clearly owns "more of", so let's prefer those instance types even when the Spot Advisor shows them as having higher interruption." And then, eventually, we hit on one that might actually have some capacity. For now. Tomorrow, it might be different. This frankly, makes the Spot Advisor incredibly frustrating to get any useful information out of in a single-AZ situation.

If we were to move to SLURM right now, we could potentially mitigate this a bit by having multiple instance types set even within a single AZ, but we do that with StarCluster (VanillaImprovements fork) already, and it doesn't really work well even in the situation where we have multiple instance types set, ever since StarCluster's AZ choice mechanism became broken due to AWS's changing how they do pricing. (StarCluster always prefers whichever AZ/instance type combination in your "instance type array" has the cheapest price, modified by whatever weighting factor you assign to prefer certain instance types even at higher current price. Ever since, a couple of years ago, AWS changed how spot pricing works such that an AZ with low capacity no longer has HIGHER price, this mechanism became broken. It used to be that if an AZ had no spot capacity for that instance type, the price shot up to 10X the on-demand price, and the StarCluster would always choose a lower-priced AZ for that instance type, or a different instance type in a lower-priced AZ. After AWS made that change -- without warning customers, I might add -- a pure price-prioritized spot mechanism like StarCluster's will frequently get stuck in a single AZ because it may be the lowest priced one even though it has zero spot capacity.)

Because of our extensive experience having terrible capacity problems when getting stuck in a single AZ, even when trying to switch between a handful of instance types appropriate for the application, we are not going to move from SGE to SLURM on ParallelCluster 3.0+ until multi-AZ in SLURM on ParallelCluster is available and works at least as well as it does with SGE/ASG's on ParallelCluster right now as I outlined in my method above. The urgency now is that your ending support for ParallelCluster 2.11.x on 12/31/22 makes us feel like we're in limbo with the plans to move to SLURM and ParallelCluster 3.0.x.

For the record, Azure CycleCloud does great with this functionality. Just sayin'.

@zbarr
Copy link
Author

zbarr commented Apr 5, 2022

@enrico-usai

I haven't chimed in since I opened this up in December, but my anticipation that we would need this was warranted as we're hitting capacity issues now.

Use case: Multiple clusters for different groups of users, spinning up jobs of 50-100 SPOT instances (for now), ideally across multiple AZ's in a region and ideally across multiple generations of the same spec'd resource type.

Problem: We hit capacity limits. We were using 6th gen Intel c nodes but found that capacity was limited so we switched to 5th gen. Although multiple "ComputeResources" somewhat works, I found that when it would hit capacity limits on one of the resource types, slurm/pcluster wouldn't always grab from the other resource type. That is almost irrelevant though (for this issue), because if I was able to span multiple AZ's, I would be less likely to run into these capacity issues in the first place. For now, I have tried to put different clusters (since we have multiple) into different AZ's so they don't step on each other but this manual balancing is proving to be a pretty tedious and is burning time on something that should otherwise be automated.

If we had this feature, along with maybe a smoother mechanism for using multiple instance types within the same queue, I would consider this product 100% complete for our use case.

@gns-build-server
Copy link

gns-build-server commented Apr 5, 2022

Also, based on looking at #3114 , I suspect @tanantharaman and @hy714335634 also have a good usage case here.

@Chen188
Copy link

Chen188 commented Apr 8, 2022

I've done some work on multi-AZ support for Slurm based on PCluster 3.1.1. Anyone interested in this can find the code here, and follow the user guide to have a try.

@enrico-usai
Copy link
Contributor

Thank you for going deep on this issue and we truly appreciate the efforts and are also looking into some for the great ideas discussed here.
We heard your request and are actively evaluating adding support for it this year. We will provide further updates as we progress.

@gns-build-server
Copy link

gns-build-server commented Apr 19, 2022

I've done some work on multi-AZ support for Slurm based on PCluster 3.1.1. Anyone interested in this can find the code here, and follow the user guide to have a try.

Thank you, @Chen188 -- I will try this! I would encourage @enrico-usai to look at it, since of course we want them to incorporate this -- or something like it -- into the main branch rather than us having to run a custom build in our productionalized environment.

@lletourn
Copy link

Any news on this? We are hitting ressource limits rather frequently. This morning for example:
2022-08-18 13:25:20,204 - [slurm_plugin.instance_manager:add_instances_for_nodes] - ERROR - Encountered exception when launching instances for nodes (x1) ['gpu-dy-g4dnxlarge-2']: An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 1): We currently do not have sufficient g4dn.4xlarge capacity in the Availability Zone you requested (us-east-1a). Our system will be working on provisioning additional capacity. You can currently get g4dn.4xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1b, us-east-1c, us-east-1d, us-east-1f.

@msherman13
Copy link

+1 on this. seems like the spot capacity is getting more and more competitive in us-east-1. since we host our data on s3 in this region, it is important that we can run our cluster in the same reason so we don't incur heavy data-transfer costs. the broader the pool of spot inventory we can access, the better this will work. or amazon can add more capacity and there won't be a need for this feature...

@pauloestrela
Copy link

Large scale HPC without this feature is almost impossible. I've changed my approach to a custom head node with AWS plugin for Slurm (https://github.com/aws-samples/aws-plugin-for-slurm )that supports multiple subnets. I've also made a fork to add headnode high availability support, more flexible node hostnames, and HTTPS proxy support (https://github.com/pauloestrela/aws-plugin-for-slurm).

@msherman13
Copy link

@pauloestrela probably not relevant for you since you're on slurm, but i have managed to get awsbatch working with multiple AZ's. I basically bypassed the compute-environment and job-queue created by pcluster and created my own. the only real difference is to add more subnets to the "computeResources" section of the compute-env. i'm not sure if this breaks some features of pcluster, but i think for my case it is fine because my compute nodes do not need to communicate with each other.

this method has the added benefit that you can use your own container image instead of the pcluster one (also specified in the compute-environment), assuming you use pcluster's entrypoint script (in this github repo) and add all necessary dependencies, it is able to mount the shared volume and integrate successfully.

@Q-Hilbtec
Copy link

Q-Hilbtec commented Oct 21, 2022 via email

@zbarr
Copy link
Author

zbarr commented Oct 27, 2022

Coming back here one more time to emphasize how helpful this would be. As this is the most discussed open issue for this project, I think we can all agree that this feature would have a huge impact!

@lukeseawalker
Copy link
Contributor

Hello all,
I'm happy to announce that ParallelCluster 3.4.0 with multi-az support is now available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests