Question: Slurm Accounting migration between ParallelCluster versions #6214

christianversloot · 2024-04-15T11:52:36Z

We are in the process of setting up a cluster with AWS ParallelCluster and AWS ParallelCluster UI. We are also working on writing a plan for upgrading the cluster. Given our knowledge and what we've learned online, doing so (in the case of new ParallelCluster versions) would require us to:

Set up a new ParallelCluster UI stack with the target version.
Through that UI, create a new ParallelCluster with the target version (using the same accounting database RDS cluster).
Ensure that the cluster is operational, then delete the stacks of the old cluster and the old UI.

The cluster has Slurm Accounting setup. We use a separately deployed Aurora based RDS cluster meaning that it is not deleted in between UI upgrades. However, we've observed that when setting up a new cluster, the newly created cluster's accounting database is tightly coupled to the cluster itself by means of (1) database name and (2) table names.

The problem this gives for our cluster users is that when creating a new ParallelCluster under step 2 above, all accounting data is lost - invisible, if you will, because it's in a different database within the cluster.

We have looked into migrating with DMS, but because of the tight coupling between cluster and database (via table names), this proves to be quite difficult and potentially error prone. Unfortunately, dumping the database then inserting it into the new database instance will also not work for us, either because of the tight coupling OR because the new cluster cannot have a name equal to that of the old cluster (and we cannot have downtime while upgrading).

Looking around in both the AWS docs and on the internet, I've not found much that points me in the right direction. However, many customers must be running into this when upgrading to new ParallelCluster versions. I'd thus welcome a suggestion as to how to handle this. Is there a way to have the accounting database running loosely coupled from a ParallelCluster, allowing multiple clusters to be supported within one database (as suggested by the cluster_table table)? Any other approach that works for many customers? We're so far using the service quite happily, but this seems to be a bit of a roadblock.

Thanks in advance!

The text was updated successfully, but these errors were encountered:

joehellmersNOAA · 2024-04-17T18:22:28Z

+1

gmarciani · 2024-04-22T08:22:32Z

Hi @christianversloot , thanks for reaching out and let us know about your use case.

In general, you can configure the cluster to use a database with whatever arbitrary name, using the configuration Scheduling/SlurmSettings/Database/DatabaseName, introduced in ParallelCluster v3.8.0.

However, as of ParallelCluster v3.9.1 your upgrade use case is not supported because:

you cannot have more than one cluster insisting on the same database (it may lead to inconsistencies in the DB). This limitation will be addressed in future releases, but we do not have a date yet.
even when you will have more clusters insisting on the same DB, you will be able to share it only among the ParallelCluster versions shipping cross compatible version of the SLURM daemons responsible for the DB.

Some follow up questions to know more about your use case:

It's clear that you're looking for a general solution. Anyway, what is the specific source and target version of ParallelCluster you're willing to upgrade?
is the no-downtime requirement during the upgrade a strict requirement or a nice to have?

Thank you.

christianversloot · 2024-04-22T10:11:25Z

Hi @gmarciani thanks for your response!

To answer your questions:

Currently we're building up the cluster, so there is no strict migration case yet (the jobs so far have been testing jobs, for which keeping the accounting data is not necessary). Our idea is that whenever our ECFlow based workflow scheduler is entirely set up, we create a cluster with the newest available ParallelCluster (and UI) - which is now 3.9.1 I believe. From that moment, we'd be interested in a general solution as accounting data is then related to production jobs.
I would need to discuss this with the team, but in case of downtime our entire production flow stops and needs to be re-run afterwards. Last week, the team's estimate was that a 1-hour period for spinning up a new cluster would be survivable, so I can imagine that a similar amount of downtime is acceptable when upgrading. Let me check this with the team on Wednesday and then get back to you.

christianversloot · 2024-04-22T16:32:19Z

Additionally, even though I know that this is not the responsibility of this repository - the availability of DatabaseName needs to be reflected in parallelcluster-ui as well. Currently, if you create a cluster through the UI, then you cannot provide the database name. I saw that it was added not too long ago, so I see why it is not yet present in the UI, but what is the approach here? Create a ticket in the UI repo too? Something you can put in motion internally? Thanks!

gmarciani · 2024-04-22T17:03:08Z

Thanks for all the valuable information about your use case. Waiting for the missing info about max acceptable downtime.

Regarding ParallelCluster UI, I suggest to create an issue in https://github.com/aws/aws-parallelcluster-ui/issues

christianversloot · 2024-04-22T19:31:09Z

Thanks, created the request: aws/aws-parallelcluster-ui#329

gmarciani · 2024-04-23T08:25:50Z

Thank you! In the issue aws/aws-parallelcluster-ui#329 it seems that you're planning to use the DatabaseName property as soon as it will be avail in PCUI to manage the upgrade.

Just to verify we are on the same page: this should not be done until we will provide the support for an external SlurmDBD in ParallelCluster, which is planned for future releases.

christianversloot · 2024-04-23T15:24:11Z

Yes, understood.

christianversloot · 2024-04-25T13:37:18Z

Hi @gmarciani - we had a discussion within the team and came to these downtime allowances:

Generally speaking a downtime of 3 hours maximum is acceptable.
It is not preferred but this can be stretched to 6 hours if really necessary, and informing clients may be necessary.
Only in exceptional cases a downtime of up to 1 day is acceptable, but this will have a severe impact on our deliveries and requires informing clients and is generally perceived negatively.

Fortunately, as we've thoroughly documented upgrading a cluster between ParallelCluster (and UI) versions, I expect we should be able to stay < 1 to 1.5 hours most of the times.

In other words, stopping the compute fleet in the old cluster then spinning up a new cluster is OK for us. It would be best if both clusters could be hosted within the same database cluster, either via a different way of setting things up (by separating database creation from cluster creation) or allowing two clusters with the same name to co-exist (we don't want to delete the old cluster first before setting up the new one).

gmarciani added the pending release label Apr 22, 2024

christianversloot mentioned this issue Apr 22, 2024

Request: support for providing Slurm Accounting DatabaseName to ParallelCluster UI aws/aws-parallelcluster-ui#329

Open

hanwen-pcluste removed the pending release label Jul 3, 2024

elduds mentioned this issue Aug 6, 2024

Feature or Documentation Request - Continuous Deployment (eg. Blue/Green ) #6382

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Slurm Accounting migration between ParallelCluster versions #6214

Question: Slurm Accounting migration between ParallelCluster versions #6214

christianversloot commented Apr 15, 2024

joehellmersNOAA commented Apr 17, 2024

gmarciani commented Apr 22, 2024

christianversloot commented Apr 22, 2024 •

edited

Loading

christianversloot commented Apr 22, 2024

gmarciani commented Apr 22, 2024

christianversloot commented Apr 22, 2024

gmarciani commented Apr 23, 2024

christianversloot commented Apr 23, 2024

christianversloot commented Apr 25, 2024

Question: Slurm Accounting migration between ParallelCluster versions #6214

Question: Slurm Accounting migration between ParallelCluster versions #6214

Comments

christianversloot commented Apr 15, 2024

joehellmersNOAA commented Apr 17, 2024

gmarciani commented Apr 22, 2024

christianversloot commented Apr 22, 2024 • edited Loading

christianversloot commented Apr 22, 2024

gmarciani commented Apr 22, 2024

christianversloot commented Apr 22, 2024

gmarciani commented Apr 23, 2024

christianversloot commented Apr 23, 2024

christianversloot commented Apr 25, 2024

christianversloot commented Apr 22, 2024 •

edited

Loading