Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 1 addition & 9 deletions docs/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,8 +76,6 @@ This project creates a ParallelCluster configuration file that is documented in
- str
<a href="https://docs.aws.amazon.com/parallelcluster/latest/ug/HeadNode-v3.html#HeadNode-v3-Imds">Imds</a>:
<a href="https://docs.aws.amazon.com/parallelcluster/latest/ug/HeadNode-v3.html#yaml-HeadNode-Imds-Secured">Secured</a>: bool
<a href="#submittersecuritygroupids">SubmitterSecurityGroupIds</a>:
SecurityGroupName: SecurityGroupId
<a href="#submitterinstancetags">SubmitterInstanceTags</a>: str
TagName:
- TagValues
Expand Down Expand Up @@ -249,7 +247,7 @@ See the [ParallelCluster docs](https://docs.aws.amazon.com/parallelcluster/lates

See the [ParallelCluster docs](https://docs.aws.amazon.com/parallelcluster/latest/ug/Image-v3.html#yaml-Image-CustomAmi) for the custom AMI documentation.

**NOTE**: A CustomAmi must be provided for Rocky8.
**NOTE**: A CustomAmi must be provided for Rocky8 or Rocky9.
All other distributions have a default AMI that is provided by ParallelCluster.

#### Architecture
Expand Down Expand Up @@ -491,12 +489,6 @@ Additional security groups that will be added to the head node instance.

List of Amazon Resource Names (ARNs) of IAM policies for Amazon EC2 that will be added to the head node instance.

### SubmitterSecurityGroupIds

External security groups that should be able to use the cluster.

Rules will be added to allow it to interact with Slurm.

### SubmitterInstanceTags

Tags of instances that can be configured to submit to the cluster.
Expand Down
74 changes: 73 additions & 1 deletion docs/deployment-prerequisites.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,76 @@ The version that has been tested is in the CDK_VERSION variable in the install s

The install script will try to install the prerequisites if they aren't already installed.

## Security Groups for Login Nodes

If you want to allow instances like remote desktops to use the cluster directly, you must define
three security groups that allow connections between the instance, the Slurm head node, and the Slurm compute nodes.
We call the instance that is connecting to the Slurm cluster a login node or a submitter instance.

I'll call the three security groups the following names, but they can be whatever you want.

* SlurmSubmitterSG
* SlurmHeadNodeSG
* SlurmComputeNodeSG

### Slurm Submitter Security Group

The SlurmSubmitterSG will be attached to your login nodes, such as your virtual desktops.

It needs at least the following inbound rules:

| Type | Port range | Source | Description
|------|------------|--------|------------
| TCP | 1024-65535 | SlurmHeadNodeSG | SlurmHeadNode ephemeral
| TCP | 1024-65535 | SlurmComputeNodeSG | SlurmComputeNode ephemeral
| TCP | 6000-7024 | SlurmComputeNodeSG | SlurmComputeNode X11

It needs the following outbound rules.

| Type | Port range | Destination | Description
|------|------------|-------------|------------
| TCP | 2049 | SlurmHeadNodeSG | SlurmHeadNode NFS
| TCP | 6818 | SlurmComputeNodeSG | SlurmComputeNode slurmd
| TCP | 6819 | SlurmHeadNodeSG | SlurmHeadNode slurmdbd
| TCP | 6820-6829 | SlurmHeadNodeSG | SlurmHeadNode slurmctld
| TCP | 6830 | SlurmHeadNodeSG | SlurmHeadNode slurmrestd

### Slurm Head Node Security Group

The SlurmHeadNodeSG will be specified in your configuration file for the slurm/SlurmCtl/AdditionalSecurityGroups parameter.

It needs at least the following inbound rules:

| Type | Port range | Source | Description
|------|------------|--------|------------
| TCP | 2049 | SlurmSubmitterSG | SlurmSubmitter NFS
| TCP | 6819 | SlurmSubmitterSG | SlurmSubmitter slurmdbd
| TCP | 6820-6829 | SlurmSubmitterSG | SlurmSubmitter slurmctld
| TCP | 6830 | SlurmSubmitterSG | SlurmSubmitter slurmrestd

It needs the following outbound rules.

| Type | Port range | Destination | Description
|------|------------|-------------|------------
| TCP | 1024-65535 | SlurmSubmitterSG | SlurmSubmitter ephemeral

### Slurm Compute Node Security Group

The SlurmComputeNodeSG will be specified in your configuration file for the slurm/InstanceConfig/AdditionalSecurityGroups parameter.

It needs at least the following inbound rules:

| Type | Port range | Source | Description
|------|------------|--------|------------
| TCP | 6818 | SlurmSubmitterSG | SlurmSubmitter slurmd

It needs the following outbound rules.

| Type | Port range | Destination | Description
|------|------------|-------------|------------
| TCP | 1024-65535 | SlurmSubmitterSG | SlurmSubmitter ephemeral
| TCP | 6000-7024 | SlurmSubmitterSG | SlurmSubmitter X11

## Create Configuration File

Before you deploy a cluster you need to create a configuration file.
Expand All @@ -108,6 +178,7 @@ Ideally you should version control this file so you can keep track of changes.

The schema for the config file along with its default values can be found in [source/cdk/config_schema.py](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L230-L445).
The schema is defined in python, but the actual config file should be in yaml format.
See [Configuration File Format](config.md) for documentation on all of the parameters.

The following are key parameters that you will need to update.
If you do not have the required parameters in your config file then the installer script will fail unless you specify the `--prompt` option.
Expand All @@ -120,7 +191,6 @@ You should save your selections in the config file.
| [Region](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L368-L369) | Region where VPC is located | | `$AWS_DEFAULT_REGION`
| [VpcId](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L372-L373) | The vpc where the cluster will be deployed. | vpc-* | None
| [SshKeyPair](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L370-L371) | EC2 Keypair to use for instances | | None
| [slurm/SubmitterSecurityGroupIds](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L480-L485) | Existing security groups that can submit to the cluster. For SOCA this is the ComputeNodeSG* resource. | sg-* | None
| [ErrorSnsTopicArn](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L379-L380) | ARN of an SNS topic that will be notified of errors | `arn:aws:sns:{{region}}:{AccountId}:{TopicName}` | None
| [slurm/InstanceConfig](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L491-L543) | Configure instance types that the cluster can use and number of nodes. | | See [default_config.yml](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/resources/config/default_config.yml)

Expand All @@ -137,7 +207,9 @@ all nodes must have the same architecture and Base OS.
| CentOS 7 | x86_64
| RedHat 7 | x86_64
| RedHat 8 | x86_64, arm64
| RedHat 9 | x86_64, arm64
| Rocky 8 | x86_64, arm64
| Rocky 9 | x86_64, arm64

You can exclude instances types by family or specific instance type.
By default the InstanceConfig excludes older generation instance families.
Expand Down
9 changes: 7 additions & 2 deletions docs/res_integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,12 @@ The intention is to completely automate the deployment of ParallelCluster and se
|-----------|-------------|------
| VpcId | VPC id for the RES cluster | vpc-xxxxxx
| SubnetId | Subnet in the RES VPC. | subnet-xxxxx
| SubmitterSecurityGroupIds | The security group names and ids used by RES VDIs. The name will be something like *EnvironmentName*-vdc-dcv-host-security-group | *EnvironmentName*-*VDISG*: sg-xxxxxxxx
| SubmitterInstanceTags | The tag of VDI instances. | 'res:EnvironmentName': *EnvironmentName*'
| ExtraMounts | The mount parameters for the /home directory. This is required for access to the home directory. |
| ExtraMountSecurityGroups | Security groups that give access to the ExtraMounts. These will be added to compute nodes so they can access the file systems.

You must also create security groups as described in [Security Groups for Login Nodes](deployment-prerequisites.md#security-groups-for-login-nodes) and specify the SlurmHeadNodeSG in the `slurm/SlurmCtl/AdditionalSecurityGroups` parameter and the SlurmComputeNodeSG in the `slurm/InstanceConfig/AdditionalSecurityGroups` parameter.

When you specify **RESEnvironmentName**, a lambda function will run SSM commands to create a cron job on a RES domain joined instance to update the users_groups.json file every hour. Another lambda function will also automatically configure all running VDI hosts to use the cluster.

The following example shows the configuration parameters for a RES with the EnvironmentName=res-eda.
Expand Down Expand Up @@ -51,11 +52,15 @@ slurm:
Database:
DatabaseStackName: pcluster-slurm-db-res

SlurmCtl: {}
SlurmCtl:
AdditionalSecurityGroups:
- sg-12345678 # SlurmHeadNodeSG

# Configure typical EDA instance types
# A partition will be created for each combination of Base OS, Architecture, and Spot
InstanceConfig:
AdditionalSecurityGroups:
- sg-23456789 # SlurmComputeNodeSG
UseSpot: true
NodeCounts:
DefaultMaxCount: 10
Expand Down
3 changes: 2 additions & 1 deletion docs/soca_integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ Set the following parameters in your config file.
| Parameter | Description | Value
|-----------|-------------|------
| VpcId | VPC id for the SOCA cluster | vpc-xxxxxx
| SubmitterSecurityGroupIds | The ComputeNode security group name and id | *cluster-id*-*ComputeNodeSG*: sg-xxxxxxxx
| slurm/SlurmCtl/AdditionalSecurityGroups | Security group ids that give desktop instances access to the head node and that give the head node access to VPC resources such as file systems.
| slurm/InstanceConfig/AdditionalSecurityGroups | Security group ids that give desktop instances access to the compute nodes and that give compute nodes access to VPC resources such as file systems.
| ExtraMounts | Add the mount parameters for the /apps and /data directories. This is required for access to the home directory. |

Deploy your slurm cluster.
Expand Down
11 changes: 10 additions & 1 deletion setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,16 @@ fi
echo "Using python $python_version"

# Check nodejs version
# https://nodejs.org/en/about/previous-releases
required_nodejs_version=16.20.2
# required_nodejs_version=18.20.2
# On Amazon Linux 2 and nodejs 18.20.2 I get the following errors:
# node: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by node)
# node: /lib64/libc.so.6: version `GLIBC_2.28' not found (required by node)
# required_nodejs_version=20.13.1
# On Amazon Linux 2 and nodejs 20.13.1 I get the following errors:
# node: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by node)
# node: /lib64/libc.so.6: version `GLIBC_2.28' not found (required by node)
export JSII_SILENCE_WARNING_DEPRECATED_NODE_VERSION=1
if ! which node &> /dev/null; then
echo -e "\nnode not found in your path."
Expand Down Expand Up @@ -88,7 +97,7 @@ fi
echo "Using nodejs version $nodejs_version"

# Create a local installation of cdk
CDK_VERSION=2.91.0 # If you change the CDK version here, make sure to also change it in source/requirements.txt
CDK_VERSION=2.111.0 # When you change the CDK version here, make sure to also change it in source/requirements.txt
if ! cdk --version &> /dev/null; then
echo "CDK not installed. Installing global version of cdk@$CDK_VERSION."
if ! npm install -g aws-cdk@$CDK_VERSION; then
Expand Down
48 changes: 5 additions & 43 deletions source/cdk/cdk_slurm_stack.py
Original file line number Diff line number Diff line change
Expand Up @@ -231,17 +231,6 @@ def override_config_with_context(self):
logger.error(f"Must set --{command_line_switch} from the command line or {config_key} in the config files")
exit(1)

config_key = 'SubmitterSecurityGroupIds'
context_key = config_key
submitterSecurityGroupIds_b64_string = self.node.try_get_context(context_key)
if submitterSecurityGroupIds_b64_string:
submitterSecurityGroupIds = json.loads(base64.b64decode(submitterSecurityGroupIds_b64_string).decode('utf-8'))
if config_key not in self.config['slurm']:
logger.info(f"slurm/{config_key:20} set from command line: {submitterSecurityGroupIds}")
else:
logger.info(f"slurm/{config_key:20} in config file overridden on command line from {self.config['slurm'][config_key]} to {submitterSecurityGroupIds}")
self.config['slurm'][config_key] = submitterSecurityGroupIds

def check_config(self):
'''
Check config, set defaults, and sanity check the configuration.
Expand Down Expand Up @@ -425,6 +414,9 @@ def update_config_for_res(self):
'''
res_environment_name = self.config['RESEnvironmentName']
logger.info(f"Updating configuration for RES environment: {res_environment_name}")

self.config['slurm']['SubmitterInstanceTags'] = {'res:EnvironmentName': [res_environment_name]}

cloudformation_client = boto3.client('cloudformation', region_name=self.config['Region'])
res_stack_name = None
stack_statuses = {}
Expand Down Expand Up @@ -481,13 +473,6 @@ def update_config_for_res(self):
self.config['SubnetId'] = subnet_ids[0]
logger.info(f" SubnetId: {self.config['SubnetId']}")

submitter_security_group_ids = []
if 'SubmitterSecurityGroupIds' not in self.config['slurm']:
self.config['slurm']['SubmitterSecurityGroupIds'] = {}
else:
for security_group_name, security_group_ids in self.config['slurm']['SubmitterSecurityGroupIds'].items():
submitter_security_group_ids.append(security_group_ids)

# Get RES VDI Security Group
res_vdc_stack_name = f"{res_stack_name}-vdc"
if res_vdc_stack_name not in stack_statuses:
Expand All @@ -508,11 +493,6 @@ def update_config_for_res(self):
if not res_dcv_security_group_id:
logger.error(f"RES VDI security group not found.")
exit(1)
if res_dcv_security_group_id not in submitter_security_group_ids:
res_dcv_security_group_name = f"{res_environment_name}-dcv-sg"
logger.info(f" SubmitterSecurityGroupIds['{res_dcv_security_group_name}'] = '{res_dcv_security_group_id}'")
self.config['slurm']['SubmitterSecurityGroupIds'][res_dcv_security_group_name] = res_dcv_security_group_id
submitter_security_group_ids.append(res_dcv_security_group_id)

# Get cluster manager Security Group
logger.debug(f"Searching for cluster manager security group id")
Expand All @@ -535,11 +515,6 @@ def update_config_for_res(self):
if not res_cluster_manager_security_group_id:
logger.error(f"RES cluster manager security group not found.")
exit(1)
if res_cluster_manager_security_group_id not in submitter_security_group_ids:
res_cluster_manager_security_group_name = f"{res_environment_name}-cluster-manager-sg"
logger.info(f" SubmitterSecurityGroupIds['{res_cluster_manager_security_group_name}'] = '{res_cluster_manager_security_group_id}'")
self.config['slurm']['SubmitterSecurityGroupIds'][res_cluster_manager_security_group_name] = res_cluster_manager_security_group_id
submitter_security_group_ids.append(res_cluster_manager_security_group_id)

# Get vdc controller Security Group
logger.debug(f"Searching for VDC controller security group id")
Expand All @@ -564,11 +539,6 @@ def update_config_for_res(self):
if not res_vdc_controller_security_group_id:
logger.error(f"RES VDC controller security group not found.")
exit(1)
if res_vdc_controller_security_group_id not in submitter_security_group_ids:
res_vdc_controller_security_group_name = f"{res_environment_name}-vdc-controller-sg"
logger.info(f" SubmitterSecurityGroupIds['{res_vdc_controller_security_group_name}'] = '{res_vdc_controller_security_group_id}'")
self.config['slurm']['SubmitterSecurityGroupIds'][res_vdc_controller_security_group_name] = res_vdc_controller_security_group_id
submitter_security_group_ids.append(res_vdc_controller_security_group_id)

# Configure the /home mount from RES if /home not already configured
home_mount_found = False
Expand Down Expand Up @@ -1025,7 +995,7 @@ def create_parallel_cluster_lambdas(self):
],
compatible_runtimes = [
aws_lambda.Runtime.PYTHON_3_9,
aws_lambda.Runtime.PYTHON_3_10,
# aws_lambda.Runtime.PYTHON_3_10, # Doesn't work: No module named 'rpds.rpds'
# aws_lambda.Runtime.PYTHON_3_11, # Doesn't work: No module named 'rpds.rpds'
],
)
Expand Down Expand Up @@ -1694,7 +1664,7 @@ def create_callSlurmRestApiLambda(self):
function_name=f"{self.stack_name}-CallSlurmRestApiLambda",
description="Example showing how to call Slurm REST API",
memory_size=128,
runtime=aws_lambda.Runtime.PYTHON_3_8,
runtime=aws_lambda.Runtime.PYTHON_3_9,
architecture=aws_lambda.Architecture.ARM_64,
timeout=Duration.minutes(1),
log_retention=logs.RetentionDays.INFINITE,
Expand Down Expand Up @@ -1842,14 +1812,6 @@ def create_security_groups(self):
Tags.of(self.slurm_submitter_sg).add("Name", self.slurm_submitter_sg_name)
self.suppress_cfn_nag(self.slurm_submitter_sg, 'W29', 'Egress port range used to block all egress')
self.submitter_security_groups[self.slurm_submitter_sg_name] = self.slurm_submitter_sg
for slurm_submitter_sg_name, slurm_submitter_sg_id in self.config['slurm']['SubmitterSecurityGroupIds'].items():
(allow_all_outbound, allow_all_ipv6_outbound) = self.allow_all_outbound(slurm_submitter_sg_id)
self.submitter_security_groups[slurm_submitter_sg_name] = ec2.SecurityGroup.from_security_group_id(
self, f"{slurm_submitter_sg_name}",
security_group_id = slurm_submitter_sg_id,
allow_all_outbound = allow_all_outbound,
allow_all_ipv6_outbound = allow_all_ipv6_outbound
)

self.slurm_rest_api_lambda_sg = ec2.SecurityGroup(self, "SlurmRestLambdaSG", vpc=self.vpc, allow_all_outbound=False, description="SlurmRestApiLambda to SlurmCtl Security Group")
self.slurm_rest_api_lambda_sg_name = f"{self.stack_name}-SlurmRestApiLambdaSG"
Expand Down
Loading