diff --git a/docs/config.md b/docs/config.md index 372ae4ac..89ec2d53 100644 --- a/docs/config.md +++ b/docs/config.md @@ -76,8 +76,6 @@ This project creates a ParallelCluster configuration file that is documented in - str Imds: Secured: bool - SubmitterSecurityGroupIds: - SecurityGroupName: SecurityGroupId SubmitterInstanceTags: str TagName: - TagValues @@ -249,7 +247,7 @@ See the [ParallelCluster docs](https://docs.aws.amazon.com/parallelcluster/lates See the [ParallelCluster docs](https://docs.aws.amazon.com/parallelcluster/latest/ug/Image-v3.html#yaml-Image-CustomAmi) for the custom AMI documentation. -**NOTE**: A CustomAmi must be provided for Rocky8. +**NOTE**: A CustomAmi must be provided for Rocky8 or Rocky9. All other distributions have a default AMI that is provided by ParallelCluster. #### Architecture @@ -491,12 +489,6 @@ Additional security groups that will be added to the head node instance. List of Amazon Resource Names (ARNs) of IAM policies for Amazon EC2 that will be added to the head node instance. -### SubmitterSecurityGroupIds - -External security groups that should be able to use the cluster. - -Rules will be added to allow it to interact with Slurm. - ### SubmitterInstanceTags Tags of instances that can be configured to submit to the cluster. diff --git a/docs/deployment-prerequisites.md b/docs/deployment-prerequisites.md index ad6abf9c..b3dc8901 100644 --- a/docs/deployment-prerequisites.md +++ b/docs/deployment-prerequisites.md @@ -99,6 +99,76 @@ The version that has been tested is in the CDK_VERSION variable in the install s The install script will try to install the prerequisites if they aren't already installed. +## Security Groups for Login Nodes + +If you want to allow instances like remote desktops to use the cluster directly, you must define +three security groups that allow connections between the instance, the Slurm head node, and the Slurm compute nodes. +We call the instance that is connecting to the Slurm cluster a login node or a submitter instance. + +I'll call the three security groups the following names, but they can be whatever you want. + +* SlurmSubmitterSG +* SlurmHeadNodeSG +* SlurmComputeNodeSG + +### Slurm Submitter Security Group + +The SlurmSubmitterSG will be attached to your login nodes, such as your virtual desktops. + +It needs at least the following inbound rules: + +| Type | Port range | Source | Description +|------|------------|--------|------------ +| TCP | 1024-65535 | SlurmHeadNodeSG | SlurmHeadNode ephemeral +| TCP | 1024-65535 | SlurmComputeNodeSG | SlurmComputeNode ephemeral +| TCP | 6000-7024 | SlurmComputeNodeSG | SlurmComputeNode X11 + +It needs the following outbound rules. + +| Type | Port range | Destination | Description +|------|------------|-------------|------------ +| TCP | 2049 | SlurmHeadNodeSG | SlurmHeadNode NFS +| TCP | 6818 | SlurmComputeNodeSG | SlurmComputeNode slurmd +| TCP | 6819 | SlurmHeadNodeSG | SlurmHeadNode slurmdbd +| TCP | 6820-6829 | SlurmHeadNodeSG | SlurmHeadNode slurmctld +| TCP | 6830 | SlurmHeadNodeSG | SlurmHeadNode slurmrestd + +### Slurm Head Node Security Group + +The SlurmHeadNodeSG will be specified in your configuration file for the slurm/SlurmCtl/AdditionalSecurityGroups parameter. + +It needs at least the following inbound rules: + +| Type | Port range | Source | Description +|------|------------|--------|------------ +| TCP | 2049 | SlurmSubmitterSG | SlurmSubmitter NFS +| TCP | 6819 | SlurmSubmitterSG | SlurmSubmitter slurmdbd +| TCP | 6820-6829 | SlurmSubmitterSG | SlurmSubmitter slurmctld +| TCP | 6830 | SlurmSubmitterSG | SlurmSubmitter slurmrestd + +It needs the following outbound rules. + +| Type | Port range | Destination | Description +|------|------------|-------------|------------ +| TCP | 1024-65535 | SlurmSubmitterSG | SlurmSubmitter ephemeral + +### Slurm Compute Node Security Group + +The SlurmComputeNodeSG will be specified in your configuration file for the slurm/InstanceConfig/AdditionalSecurityGroups parameter. + +It needs at least the following inbound rules: + +| Type | Port range | Source | Description +|------|------------|--------|------------ +| TCP | 6818 | SlurmSubmitterSG | SlurmSubmitter slurmd + +It needs the following outbound rules. + +| Type | Port range | Destination | Description +|------|------------|-------------|------------ +| TCP | 1024-65535 | SlurmSubmitterSG | SlurmSubmitter ephemeral +| TCP | 6000-7024 | SlurmSubmitterSG | SlurmSubmitter X11 + ## Create Configuration File Before you deploy a cluster you need to create a configuration file. @@ -108,6 +178,7 @@ Ideally you should version control this file so you can keep track of changes. The schema for the config file along with its default values can be found in [source/cdk/config_schema.py](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L230-L445). The schema is defined in python, but the actual config file should be in yaml format. +See [Configuration File Format](config.md) for documentation on all of the parameters. The following are key parameters that you will need to update. If you do not have the required parameters in your config file then the installer script will fail unless you specify the `--prompt` option. @@ -120,7 +191,6 @@ You should save your selections in the config file. | [Region](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L368-L369) | Region where VPC is located | | `$AWS_DEFAULT_REGION` | [VpcId](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L372-L373) | The vpc where the cluster will be deployed. | vpc-* | None | [SshKeyPair](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L370-L371) | EC2 Keypair to use for instances | | None -| [slurm/SubmitterSecurityGroupIds](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L480-L485) | Existing security groups that can submit to the cluster. For SOCA this is the ComputeNodeSG* resource. | sg-* | None | [ErrorSnsTopicArn](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L379-L380) | ARN of an SNS topic that will be notified of errors | `arn:aws:sns:{{region}}:{AccountId}:{TopicName}` | None | [slurm/InstanceConfig](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L491-L543) | Configure instance types that the cluster can use and number of nodes. | | See [default_config.yml](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/resources/config/default_config.yml) @@ -137,7 +207,9 @@ all nodes must have the same architecture and Base OS. | CentOS 7 | x86_64 | RedHat 7 | x86_64 | RedHat 8 | x86_64, arm64 +| RedHat 9 | x86_64, arm64 | Rocky 8 | x86_64, arm64 +| Rocky 9 | x86_64, arm64 You can exclude instances types by family or specific instance type. By default the InstanceConfig excludes older generation instance families. diff --git a/docs/res_integration.md b/docs/res_integration.md index c33b61b2..1892c995 100644 --- a/docs/res_integration.md +++ b/docs/res_integration.md @@ -11,11 +11,12 @@ The intention is to completely automate the deployment of ParallelCluster and se |-----------|-------------|------ | VpcId | VPC id for the RES cluster | vpc-xxxxxx | SubnetId | Subnet in the RES VPC. | subnet-xxxxx -| SubmitterSecurityGroupIds | The security group names and ids used by RES VDIs. The name will be something like *EnvironmentName*-vdc-dcv-host-security-group | *EnvironmentName*-*VDISG*: sg-xxxxxxxx | SubmitterInstanceTags | The tag of VDI instances. | 'res:EnvironmentName': *EnvironmentName*' | ExtraMounts | The mount parameters for the /home directory. This is required for access to the home directory. | | ExtraMountSecurityGroups | Security groups that give access to the ExtraMounts. These will be added to compute nodes so they can access the file systems. +You must also create security groups as described in [Security Groups for Login Nodes](deployment-prerequisites.md#security-groups-for-login-nodes) and specify the SlurmHeadNodeSG in the `slurm/SlurmCtl/AdditionalSecurityGroups` parameter and the SlurmComputeNodeSG in the `slurm/InstanceConfig/AdditionalSecurityGroups` parameter. + When you specify **RESEnvironmentName**, a lambda function will run SSM commands to create a cron job on a RES domain joined instance to update the users_groups.json file every hour. Another lambda function will also automatically configure all running VDI hosts to use the cluster. The following example shows the configuration parameters for a RES with the EnvironmentName=res-eda. @@ -51,11 +52,15 @@ slurm: Database: DatabaseStackName: pcluster-slurm-db-res - SlurmCtl: {} + SlurmCtl: + AdditionalSecurityGroups: + - sg-12345678 # SlurmHeadNodeSG # Configure typical EDA instance types # A partition will be created for each combination of Base OS, Architecture, and Spot InstanceConfig: + AdditionalSecurityGroups: + - sg-23456789 # SlurmComputeNodeSG UseSpot: true NodeCounts: DefaultMaxCount: 10 diff --git a/docs/soca_integration.md b/docs/soca_integration.md index c5b765fd..9b0ee6bd 100644 --- a/docs/soca_integration.md +++ b/docs/soca_integration.md @@ -11,7 +11,8 @@ Set the following parameters in your config file. | Parameter | Description | Value |-----------|-------------|------ | VpcId | VPC id for the SOCA cluster | vpc-xxxxxx -| SubmitterSecurityGroupIds | The ComputeNode security group name and id | *cluster-id*-*ComputeNodeSG*: sg-xxxxxxxx +| slurm/SlurmCtl/AdditionalSecurityGroups | Security group ids that give desktop instances access to the head node and that give the head node access to VPC resources such as file systems. +| slurm/InstanceConfig/AdditionalSecurityGroups | Security group ids that give desktop instances access to the compute nodes and that give compute nodes access to VPC resources such as file systems. | ExtraMounts | Add the mount parameters for the /apps and /data directories. This is required for access to the home directory. | Deploy your slurm cluster. diff --git a/setup.sh b/setup.sh index d70bd50e..3ff93257 100644 --- a/setup.sh +++ b/setup.sh @@ -41,7 +41,16 @@ fi echo "Using python $python_version" # Check nodejs version +# https://nodejs.org/en/about/previous-releases required_nodejs_version=16.20.2 +# required_nodejs_version=18.20.2 +# On Amazon Linux 2 and nodejs 18.20.2 I get the following errors: +# node: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by node) +# node: /lib64/libc.so.6: version `GLIBC_2.28' not found (required by node) +# required_nodejs_version=20.13.1 +# On Amazon Linux 2 and nodejs 20.13.1 I get the following errors: +# node: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by node) +# node: /lib64/libc.so.6: version `GLIBC_2.28' not found (required by node) export JSII_SILENCE_WARNING_DEPRECATED_NODE_VERSION=1 if ! which node &> /dev/null; then echo -e "\nnode not found in your path." @@ -88,7 +97,7 @@ fi echo "Using nodejs version $nodejs_version" # Create a local installation of cdk -CDK_VERSION=2.91.0 # If you change the CDK version here, make sure to also change it in source/requirements.txt +CDK_VERSION=2.111.0 # When you change the CDK version here, make sure to also change it in source/requirements.txt if ! cdk --version &> /dev/null; then echo "CDK not installed. Installing global version of cdk@$CDK_VERSION." if ! npm install -g aws-cdk@$CDK_VERSION; then diff --git a/source/cdk/cdk_slurm_stack.py b/source/cdk/cdk_slurm_stack.py index 03c35b0a..e641ba5b 100644 --- a/source/cdk/cdk_slurm_stack.py +++ b/source/cdk/cdk_slurm_stack.py @@ -231,17 +231,6 @@ def override_config_with_context(self): logger.error(f"Must set --{command_line_switch} from the command line or {config_key} in the config files") exit(1) - config_key = 'SubmitterSecurityGroupIds' - context_key = config_key - submitterSecurityGroupIds_b64_string = self.node.try_get_context(context_key) - if submitterSecurityGroupIds_b64_string: - submitterSecurityGroupIds = json.loads(base64.b64decode(submitterSecurityGroupIds_b64_string).decode('utf-8')) - if config_key not in self.config['slurm']: - logger.info(f"slurm/{config_key:20} set from command line: {submitterSecurityGroupIds}") - else: - logger.info(f"slurm/{config_key:20} in config file overridden on command line from {self.config['slurm'][config_key]} to {submitterSecurityGroupIds}") - self.config['slurm'][config_key] = submitterSecurityGroupIds - def check_config(self): ''' Check config, set defaults, and sanity check the configuration. @@ -425,6 +414,9 @@ def update_config_for_res(self): ''' res_environment_name = self.config['RESEnvironmentName'] logger.info(f"Updating configuration for RES environment: {res_environment_name}") + + self.config['slurm']['SubmitterInstanceTags'] = {'res:EnvironmentName': [res_environment_name]} + cloudformation_client = boto3.client('cloudformation', region_name=self.config['Region']) res_stack_name = None stack_statuses = {} @@ -481,13 +473,6 @@ def update_config_for_res(self): self.config['SubnetId'] = subnet_ids[0] logger.info(f" SubnetId: {self.config['SubnetId']}") - submitter_security_group_ids = [] - if 'SubmitterSecurityGroupIds' not in self.config['slurm']: - self.config['slurm']['SubmitterSecurityGroupIds'] = {} - else: - for security_group_name, security_group_ids in self.config['slurm']['SubmitterSecurityGroupIds'].items(): - submitter_security_group_ids.append(security_group_ids) - # Get RES VDI Security Group res_vdc_stack_name = f"{res_stack_name}-vdc" if res_vdc_stack_name not in stack_statuses: @@ -508,11 +493,6 @@ def update_config_for_res(self): if not res_dcv_security_group_id: logger.error(f"RES VDI security group not found.") exit(1) - if res_dcv_security_group_id not in submitter_security_group_ids: - res_dcv_security_group_name = f"{res_environment_name}-dcv-sg" - logger.info(f" SubmitterSecurityGroupIds['{res_dcv_security_group_name}'] = '{res_dcv_security_group_id}'") - self.config['slurm']['SubmitterSecurityGroupIds'][res_dcv_security_group_name] = res_dcv_security_group_id - submitter_security_group_ids.append(res_dcv_security_group_id) # Get cluster manager Security Group logger.debug(f"Searching for cluster manager security group id") @@ -535,11 +515,6 @@ def update_config_for_res(self): if not res_cluster_manager_security_group_id: logger.error(f"RES cluster manager security group not found.") exit(1) - if res_cluster_manager_security_group_id not in submitter_security_group_ids: - res_cluster_manager_security_group_name = f"{res_environment_name}-cluster-manager-sg" - logger.info(f" SubmitterSecurityGroupIds['{res_cluster_manager_security_group_name}'] = '{res_cluster_manager_security_group_id}'") - self.config['slurm']['SubmitterSecurityGroupIds'][res_cluster_manager_security_group_name] = res_cluster_manager_security_group_id - submitter_security_group_ids.append(res_cluster_manager_security_group_id) # Get vdc controller Security Group logger.debug(f"Searching for VDC controller security group id") @@ -564,11 +539,6 @@ def update_config_for_res(self): if not res_vdc_controller_security_group_id: logger.error(f"RES VDC controller security group not found.") exit(1) - if res_vdc_controller_security_group_id not in submitter_security_group_ids: - res_vdc_controller_security_group_name = f"{res_environment_name}-vdc-controller-sg" - logger.info(f" SubmitterSecurityGroupIds['{res_vdc_controller_security_group_name}'] = '{res_vdc_controller_security_group_id}'") - self.config['slurm']['SubmitterSecurityGroupIds'][res_vdc_controller_security_group_name] = res_vdc_controller_security_group_id - submitter_security_group_ids.append(res_vdc_controller_security_group_id) # Configure the /home mount from RES if /home not already configured home_mount_found = False @@ -1025,7 +995,7 @@ def create_parallel_cluster_lambdas(self): ], compatible_runtimes = [ aws_lambda.Runtime.PYTHON_3_9, - aws_lambda.Runtime.PYTHON_3_10, + # aws_lambda.Runtime.PYTHON_3_10, # Doesn't work: No module named 'rpds.rpds' # aws_lambda.Runtime.PYTHON_3_11, # Doesn't work: No module named 'rpds.rpds' ], ) @@ -1694,7 +1664,7 @@ def create_callSlurmRestApiLambda(self): function_name=f"{self.stack_name}-CallSlurmRestApiLambda", description="Example showing how to call Slurm REST API", memory_size=128, - runtime=aws_lambda.Runtime.PYTHON_3_8, + runtime=aws_lambda.Runtime.PYTHON_3_9, architecture=aws_lambda.Architecture.ARM_64, timeout=Duration.minutes(1), log_retention=logs.RetentionDays.INFINITE, @@ -1842,14 +1812,6 @@ def create_security_groups(self): Tags.of(self.slurm_submitter_sg).add("Name", self.slurm_submitter_sg_name) self.suppress_cfn_nag(self.slurm_submitter_sg, 'W29', 'Egress port range used to block all egress') self.submitter_security_groups[self.slurm_submitter_sg_name] = self.slurm_submitter_sg - for slurm_submitter_sg_name, slurm_submitter_sg_id in self.config['slurm']['SubmitterSecurityGroupIds'].items(): - (allow_all_outbound, allow_all_ipv6_outbound) = self.allow_all_outbound(slurm_submitter_sg_id) - self.submitter_security_groups[slurm_submitter_sg_name] = ec2.SecurityGroup.from_security_group_id( - self, f"{slurm_submitter_sg_name}", - security_group_id = slurm_submitter_sg_id, - allow_all_outbound = allow_all_outbound, - allow_all_ipv6_outbound = allow_all_ipv6_outbound - ) self.slurm_rest_api_lambda_sg = ec2.SecurityGroup(self, "SlurmRestLambdaSG", vpc=self.vpc, allow_all_outbound=False, description="SlurmRestApiLambda to SlurmCtl Security Group") self.slurm_rest_api_lambda_sg_name = f"{self.stack_name}-SlurmRestApiLambdaSG" diff --git a/source/cdk/config_schema.py b/source/cdk/config_schema.py index a43a0b1a..f69bb32e 100644 --- a/source/cdk/config_schema.py +++ b/source/cdk/config_schema.py @@ -61,6 +61,20 @@ # 3.7.1: # * Fix pmix CVE # * Use Slurm 23.02.5 +# 3.8.0: +# * Add support for Rocky Linux 8 +# * Add support for user-provided /home directory for the cluster +# * Add support for MungeKeySecretArn to permit user-provided Munge key. +# * Add head node alarms +# * Add support for il-central-1 region +# * Upgrade Slurm from 23.02.6 to 23.02.7 +# 3.9.0: +# * Add support for RHEL9 +# * Add support for Rocky9 +# * Upgrade Slurm from 23.02.7 to 23.11.4 +# * Upgrade Pmix from 4.2.6 to 4.2.9. +# 3.9.1: +# * Bug fixes MIN_PARALLEL_CLUSTER_VERSION = parse_version('3.6.0') # Update source/resources/default_config.yml with latest version when this is updated. PARALLEL_CLUSTER_VERSIONS = [ @@ -70,6 +84,8 @@ '3.7.1', '3.7.2', '3.8.0', + '3.9.0', + '3.9.1', ] PARALLEL_CLUSTER_MUNGE_VERSIONS = { # This can be found on the head node at /opt/parallelcluster/sources @@ -80,6 +96,8 @@ '3.7.1': '0.5.15', # confirmed '3.7.2': '0.5.15', # confirmed '3.8.0': '0.5.15', # confirmed + '3.9.0': '0.5.15', # confirmed + '3.9.1': '0.5.15', # confirmed } PARALLEL_CLUSTER_PYTHON_VERSIONS = { # This can be found on the head node at /opt/parallelcluster/pyenv/versions @@ -89,6 +107,8 @@ '3.7.1': '3.9.16', # confirmed '3.7.2': '3.9.16', # confirmed '3.8.0': '3.9.17', # confirmed + '3.9.0': '3.9.17', # confirmed + '3.9.1': '3.9.17', # confirmed } PARALLEL_CLUSTER_SLURM_VERSIONS = { # This can be found on the head node at /etc/chef/local-mode-cache/cache/ @@ -97,7 +117,9 @@ '3.7.0': '23.02.4', # confirmed '3.7.1': '23.02.5', # confirmed '3.7.2': '23.02.6', # confirmed - '3.8.0': '23.02.6', # confirmed + '3.8.0': '23.02.7', # confirmed + '3.9.0': '23.11.4', # confirmed + '3.9.1': '23.11.4', # confirmed } PARALLEL_CLUSTER_PC_SLURM_VERSIONS = { # This can be found on the head node at /etc/chef/local-mode-cache/cache/ @@ -107,6 +129,8 @@ '3.7.1': '23-02-5-1', # confirmed '3.7.2': '23-02-6-1', # confirmed '3.8.0': '23-02-6-1', # confirmed + '3.9.0': '23-11-4-1', # confirmed + '3.9.1': '23-11-4-1', # confirmed } SLURM_REST_API_VERSIONS = { '23-02-2-1': '0.0.39', @@ -114,18 +138,26 @@ '23-02-4-1': '0.0.39', '23-02-5-1': '0.0.39', '23-02-6-1': '0.0.39', + '23-02-7-1': '0.0.39', + '23-11-4-1': '0.0.39', } PARALLEL_CLUSTER_ALLOWED_OSES = [ 'alinux2', 'centos7', 'rhel8', + 'rhel9', 'rocky8', + 'rocky9', 'ubuntu2004', 'ubuntu2204' ] def get_parallel_cluster_version(config): - return config['slurm']['ParallelClusterConfig']['Version'] + parallel_cluster_version = config['slurm']['ParallelClusterConfig']['Version'] + if parallel_cluster_version not in PARALLEL_CLUSTER_VERSIONS: + logger.error(f"Unsupported ParallelCluster version: {parallel_cluster_version}\nSupported versions are:\n{json.dumps(PARALLEL_CLUSTER_VERSIONS, indent=4)}") + raise KeyError(parallel_cluster_version) + return parallel_cluster_version def get_PARALLEL_CLUSTER_MUNGE_VERSION(config): parallel_cluster_version = get_parallel_cluster_version(config) @@ -486,13 +518,6 @@ def get_config_schema(config): Optional('Secured', default=True): bool } }, - # - # SubmitterSecurityGroupIds: - # External security groups that should be able to use the cluster - # Rules will be added to allow it to interact with Slurm. - Optional('SubmitterSecurityGroupIds', default={}): { - Optional(str): And(str, lambda s: re.match(r'sg-', s)) - }, # SubmitterInstanceTags: # Tags of instances that can be configured to submit to the cluster. # When the cluster is deleted, the tag is used to unmount the slurm filesystem from the instances using SSM. diff --git a/source/requirements.txt b/source/requirements.txt index 623a0bdf..60b01311 100644 --- a/source/requirements.txt +++ b/source/requirements.txt @@ -1,5 +1,5 @@ -e . -aws-cdk-lib==2.91.0 +aws-cdk-lib==2.111.0 boto3 colored constructs>=10.0.0 diff --git a/source/resources/lambdas/DeconfigureRESUsersGroupsJson/DeconfigureRESUsersGroupsJson.py b/source/resources/lambdas/DeconfigureRESUsersGroupsJson/DeconfigureRESUsersGroupsJson.py index 40b591dd..028e5983 100644 --- a/source/resources/lambdas/DeconfigureRESUsersGroupsJson/DeconfigureRESUsersGroupsJson.py +++ b/source/resources/lambdas/DeconfigureRESUsersGroupsJson/DeconfigureRESUsersGroupsJson.py @@ -101,6 +101,7 @@ def lambda_handler(event, context): # Make sure that the cluster is still mounted and mount is accessible. # If the cluster has already been deleted then the mount will be hung and we have to do manual cleanup. +# Another failure mechanism is if the cluster didn't deploy in which case the mount may not even exist. if mount | grep " $mount_dest "; then echo "$mount_dest is mounted." if ! timeout 1s ls $mount_dest; then @@ -135,6 +136,8 @@ def lambda_handler(event, context): if timeout 1s ls $mount_dest; then sudo rmdir $mount_dest fi + +pass """ logger.info(f"Submitting SSM command") send_command_response = ssm_client.send_command( @@ -153,6 +156,7 @@ def lambda_handler(event, context): MAX_WAIT_TIME = 5 * 60 DELAY = 10 MAX_ATTEMPTS = int(MAX_WAIT_TIME / DELAY) + logger.info(f"Waiting {MAX_WAIT_TIME} s for command {command_id} to complete.") waiter = ssm_client.get_waiter('command_executed') waiter.wait( CommandId=command_id, diff --git a/source/resources/playbooks/inventories/group_vars/all b/source/resources/playbooks/inventories/group_vars/all index 1f9d52d3..0431a4da 100644 --- a/source/resources/playbooks/inventories/group_vars/all +++ b/source/resources/playbooks/inventories/group_vars/all @@ -23,10 +23,13 @@ centos7: "{{centos and distribution_major_version == '7'}}" rhel: "{{distribution == 'RedHat'}}" rhel7: "{{rhel and distribution_major_version == '7'}}" rhel8: "{{rhel and distribution_major_version == '8'}}" +rhel9: "{{rhel and distribution_major_version == '9'}}" rocky: "{{distribution == 'Rocky'}}" rocky8: "{{rocky and distribution_major_version == '8'}}" +rocky9: "{{rocky and distribution_major_version == '9'}}" rhelclone: "{{alma or centos or rocky}}" rhel8clone: "{{rhelclone and distribution_major_version == '8'}}" +rhel9clone: "{{rhelclone and distribution_major_version == '9'}}" centos7_5_to_6: "{{distribution in ['CentOS', 'RedHat'] and distribution_version is match('7\\.[5-6]')}}" centos7_5_to_9: "{{distribution in ['CentOS', 'RedHat'] and distribution_version is match('7\\.[5-9]')}}" centos7_7_to_9: "{{distribution in ['CentOS', 'RedHat'] and distribution_version is match('7\\.[7-9]')}}" diff --git a/source/resources/playbooks/roles/ParallelClusterHeadNode/files/opt/slurm/config/bin/create_slurm_accounts.py b/source/resources/playbooks/roles/ParallelClusterHeadNode/files/opt/slurm/config/bin/create_slurm_accounts.py index 3ece3d89..d87f0624 100755 --- a/source/resources/playbooks/roles/ParallelClusterHeadNode/files/opt/slurm/config/bin/create_slurm_accounts.py +++ b/source/resources/playbooks/roles/ParallelClusterHeadNode/files/opt/slurm/config/bin/create_slurm_accounts.py @@ -119,6 +119,8 @@ def update_slurm(self): logger.info(f" Creating account {account} with fairshare={fairshare}, parent={parent}") try: subprocess.check_output(cmd, encoding='UTF-8') # nosec + if not parent: + parent = 'root' self.slurm_user_account_dict['accounts'][account] = { 'parent_name': parent, 'users': [], @@ -141,6 +143,7 @@ def update_slurm(self): number_of_changes += 1 # After all projects have been created, make sure that the parent's are correct + self.slurm_user_account_dict = self.get_slurm_user_account_dict() for account in sorted(self.accounts.keys()): logger.debug(f"Checking account {account}'s parent") account_info = self.accounts[account] diff --git a/source/resources/playbooks/roles/all/tasks/main.yml b/source/resources/playbooks/roles/all/tasks/main.yml index 8f87ab7e..64c683ea 100644 --- a/source/resources/playbooks/roles/all/tasks/main.yml +++ b/source/resources/playbooks/roles/all/tasks/main.yml @@ -22,10 +22,12 @@ rhel: {{ rhel }} rhel7: {{ rhel7 }} rhel8: {{ rhel8 }} + rhel9: {{ rhel9 }} rocky: {{ rocky }} rocky8: {{ rocky8 }} rhelclone: {{ rhelclone }} rhel8clone: {{ rhel8clone }} + rhel9clone: {{ rhel9clone }} centos7_5_to_6: {{ centos7_5_to_6 }} centos7_5_to_9: {{ centos7_5_to_9 }} centos7_7_to_9: {{ centos7_7_to_9 }} @@ -73,14 +75,14 @@ # Required for the selinux module - name: Install libselinux-python - when: not(rhel8 or rhel8clone) + when: not(rhel8 or rhel8clone or rhel9 or rhel9clone) yum: state: present name: - libselinux-python - name: Install python3-libselinux - when: rhel8 or rhel8clone + when: rhel8 or rhel8clone or rhel9 or rhel9clone yum: state: present name: @@ -88,7 +90,7 @@ # Selinux breaks ssh - name: Set Selinux mode to disabled - when: not(rhel8 or rhel8clone) + when: not(rhel8 or rhel8clone or rhel9 or rhel9clone) selinux: state: disabled @@ -96,7 +98,7 @@ # Failed to import the required Python library (libselinux-python) # Can't figure out how to resolve - name: Set Selinux mode to disabled - when: rhel8 or rhel8clone + when: rhel8 or rhel8clone or rhel9 or rhel9clone shell: cmd: | set -ex diff --git a/source/resources/playbooks/roles/eda_tools/tasks/main.yml b/source/resources/playbooks/roles/eda_tools/tasks/main.yml index 57071f99..13843446 100644 --- a/source/resources/playbooks/roles/eda_tools/tasks/main.yml +++ b/source/resources/playbooks/roles/eda_tools/tasks/main.yml @@ -85,7 +85,7 @@ - python3-pip - name: Install packages required by python packages - when: (rhel8 or rhel8clone) and Architecture == 'arm64' + when: (rhel8 or rhel8clone or rhel9 or rhel9clone) and Architecture == 'arm64' tags: - python - packages @@ -97,7 +97,7 @@ - platform-python-devel - name: Install cython - when: (rhel8 or rhel8clone) and Architecture == 'arm64' + when: (rhel8 or rhel8clone or rhel9 or rhel9clone) and Architecture == 'arm64' pip: executable: /usr/bin/pip3 state: present @@ -108,7 +108,7 @@ # * RHEL 8, arm64 # * Rocky 8, arm64 - name: Install numpy - when: (rhel8 or rhel8clone) and Architecture == 'arm64' + when: (rhel8 or rhel8clone or rhel9 or rhel9clone) and Architecture == 'arm64' tags: - python - packages @@ -228,7 +228,7 @@ - libcrypt - name: Install perl-Switch on non-RedHat - when: not rhel and not rhel8clone + when: not rhel and not (rhel8clone or rhel9clone) tags: - perl-switch - packages @@ -252,7 +252,7 @@ # - perl-Switch - name: Install non-RedHat packages - when: not rhel and not rhel8clone + when: not rhel and not (rhel8clone or rhel9clone) tags: - packages yum: @@ -415,7 +415,7 @@ - ksh - name: Install gpaste - when: not(rhel8 or rhel8clone) + when: not(rhel8 or rhel8clone or rhel9 or rhel9clone) tags: - eda_packages - packages @@ -525,7 +525,7 @@ - ncurses-libs.i686 - name: Install EDA packages 3 - when: not(rhel8 or rhel8clone) + when: not(rhel8 or rhel8clone or rhel9 or rhel9clone) tags: - eda_packages - packages @@ -629,7 +629,7 @@ - tigervnc - name: Install compat-db47 - when: not(distribution == 'Amazon' and Architecture == 'arm64') and not(rhel8 or rhel8clone) + when: not(distribution == 'Amazon' and Architecture == 'arm64') and not(rhel8 or rhel8clone or rhel9 or rhel9clone) tags: - packages - eda_packages diff --git a/source/resources/playbooks/roles/install_slurm/tasks/main.yml b/source/resources/playbooks/roles/install_slurm/tasks/main.yml index a8db4de7..d2e1ab0e 100644 --- a/source/resources/playbooks/roles/install_slurm/tasks/main.yml +++ b/source/resources/playbooks/roles/install_slurm/tasks/main.yml @@ -18,10 +18,13 @@ rhel: {{ rhel }} rhel7: {{ rhel7 }} rhel8: {{ rhel8 }} + rhel9: {{ rhel9 }} rocky: {{ rocky }} rocky8: {{ rocky8 }} + rocky9: {{ rocky9 }} rhelclone: {{ rhelclone }} rhel8clone: {{ rhel8clone }} + rhel9clone: {{ rhel9clone }} centos7_5_to_6: {{ centos7_5_to_6 }} centos7_5_to_9: {{ centos7_5_to_9 }} centos7_7_to_9: {{ centos7_7_to_9 }} @@ -64,7 +67,7 @@ - epel-release - name: Enable PowerTools repo - when: rhel8clone + when: rhel8clone or rhel9clone shell: cmd: yum-config-manager --enable PowerTools || yum-config-manager --enable powertools @@ -82,6 +85,12 @@ cmd: | yum-config-manager --enable codeready-builder-for-rhel-8-rhui-rpms +- name: Enable codeready-builder-for-rhel-9-rhui-rpms repo + when: rhel9 + shell: + cmd: | + yum-config-manager --enable codeready-builder-for-rhel-9-rhui-rpms + - name: Install slurm packages yum: state: present @@ -126,7 +135,7 @@ - wget - name: Install hdf5-devel - when: not(distribution == 'Amazon' and Architecture == 'arm64') and not(rhel8 or rhel8clone) + when: not(distribution == 'Amazon' and Architecture == 'arm64') and not(rhel8 or rhel8clone or rhel9 or rhel9clone) yum: state: present name: diff --git a/source/slurm_installer/installer.py b/source/slurm_installer/installer.py index f869b530..4316472a 100755 --- a/source/slurm_installer/installer.py +++ b/source/slurm_installer/installer.py @@ -77,11 +77,10 @@ def main(self): parser.add_argument("--profile", "-p", type=str, help="AWS CLI profile to use.") parser.add_argument("--region", "--Region", "-r", type=str, help="AWS region where you want to deploy your SOCA environment.") parser.add_argument("--SshKeyPair", "-ssh", type=str, help="SSH key to use") - parser.add_argument("--RESEnvironmentName", type=str, default=None, help="Research and Engineering Studio (RES) environment to build the cluster in. Will automatically set VpcId, SubnetId, and SubmitterSecurityGroupIds.") + parser.add_argument("--RESEnvironmentName", type=str, default=None, help="Research and Engineering Studio (RES) environment to build the cluster in. Will automatically set VpcId and SubnetId.") parser.add_argument("--VpcId", type=str, help="Id of VPC to use") parser.add_argument("--SubnetId", type=str, help="SubnetId to use") parser.add_argument("--ErrorSnsTopicArn", type=str, default='', help="SNS topic for error notifications.") - parser.add_argument("--SubmitterSecurityGroupIds", type=str, default=None, help="External security groups that should be able to use the cluster.") parser.add_argument("--debug", action='store_const', const=True, default=False, help="Enable CDK debug mode") parser.add_argument("--cdk-cmd", type=str, choices=["deploy", "create", "update", "diff", "ls", "list", "synth", "synthesize", "destroy", "bootstrap"], default="synth") args = parser.parse_args() @@ -297,41 +296,6 @@ def main(self): del cmdline_args[arg_index] del cmdline_args[arg_index] - # Optional - config_key = 'SubmitterSecurityGroupIds' - if config_key not in self.config and not args.SubmitterSecurityGroupIds and not args.prompt: - pass - else: - if args.SubmitterSecurityGroupIds: - arg_json_value = args.SubmitterSecurityGroupIds - arg_SubmitterSecurityGroupIds = json.loads(args.SubmitterSecurityGroupIds) - else: - arg_json_value = '' - arg_SubmitterSecurityGroupIds = None - try: - checked_value = resource_finder.get_submitter_security_groups(self.config['VpcId'], config_key, self.config.get(config_key, None), arg_SubmitterSecurityGroupIds, args.prompt) - except ValueError as e: - logger.error(e) - sys.exit(1) - if checked_value: - checked_value_json = json.dumps(checked_value) - if args.prompt: - if args.SubmitterSecurityGroupIds: - if arg_json_value != checked_value_json: - for arg_index, arg_name in enumerate(cmdline_args): - if arg_name == f'--{config_key}': - cmdline_args[arg_index + 1] = f"'{checked_value_json}'" - else: - prompt_args += [f'--{config_key}', f"'{checked_value_json}'"] - self.config[config_key] = checked_value - self.install_parameters[config_key] = base64.b64encode(checked_value_json.encode('utf-8')).decode('utf-8') - logger.info(f"{config_key:30}: {self.config[config_key]}") - else: - while f'--{config_key}' in cmdline_args: - arg_index = cmdline_args.index(f'--{config_key}') - del cmdline_args[arg_index] - del cmdline_args[arg_index] - self.install_parameters['config_file'] = args.config_file try: