Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
e98def2
[NVIDIA_IMEX] Add resource to install Nvidia-imex
Jul 22, 2025
5f03bbb
[NVIDIA_IMEX] Adding Unit test for IMEX installation
Jul 22, 2025
024ba4e
[NVIDIA_IMEX] Not Install Nvidia-imex for Isolated regions
Jul 22, 2025
d53bf01
[NVIDIA_IMEX] Install Nvdia-Imex as part of Build Image
Jul 22, 2025
d925ce2
[Nvidia-imex] Never Install NVIDIA Imex for AL2
Jul 23, 2025
a46954b
[Nvidia-imex] Add unit tests for NVidia Imex
Jul 23, 2025
ff89ace
[Nvidia-imex] Cookstyle changes
Jul 23, 2025
cdce37f
[NVIDIA_IMEX] Adding Kitchen test for Installation and Configuration
Jul 23, 2025
6726f3e
[FABRIC MANAGER] Using common library for getting NVSwitch count
Jul 23, 2025
368fbee
[NVIDIA-IMEX] Configure Nvidia-imex only if we use Gb200 instance
Jul 23, 2025
9f09451
[NVIDIA-IMEX] USe specific Version naming for nvidia-imex installation
Jul 24, 2025
f743ef8
[NVIDIA-IMEX] Install Nvidia-imex and flush cache before it
Jul 28, 2025
7505635
[NVIDIA-IMEX] Redirect nvidia-imex to system logs which are pushed in CW
Jul 28, 2025
4bfca5b
[NVIDIA-IMEX] Install With specific version in name
Jul 28, 2025
dc8e30a
[NVIDIA-IMEX] Removing flush cache as it does not exist for package r…
Jul 28, 2025
2190669
[NVIDIA-IMEX] Installing NVIDIA IMEx using install_packages resource
Jul 28, 2025
4c7291d
[NVIDIA-IMEX] Adding Unit tests for Configuration of nvidia-imex
Jul 28, 2025
c336054
[NVIDIA-IMEX] Configuring nvidia-imex only for gb200 and ComputeFleet…
Jul 28, 2025
dd7e0ef
[NVIDIA-IMEX] Not check installation of nvidia-imex for Alinux2
Jul 28, 2025
f9c324f
[NVIDIA-IMEX] Inspec Test
Jul 28, 2025
ee601b7
[NVIDIA-IMEX] Setting Nvidia-imex node attributes which should show t…
Jul 28, 2025
f40e936
[NVIDIA-IMEX] Test epoch version
Jul 29, 2025
876fccb
[NVIDIA-IMEX] Add Version and package name for debian installation
Jul 29, 2025
caa3c91
[NVIDIA-IMEX] Add changelog
Jul 29, 2025
c78a8d4
Add unit test for checking configuration of nvidia-imex
Jul 29, 2025
9bed1c0
[Nvidia-Imex] Use nvidia-imex shared directory for Inspec and configu…
Jul 30, 2025
8efff28
[Nvidia-Imex] Update copyright year
Jul 30, 2025
2efe7bd
[Nvidia-Imex] Adding correct comments
Jul 30, 2025
95e1228
[Nvidia-Imex] Updating function names
Jul 30, 2025
329166a
[Nvidia-Imex] Remove _nvidia_imex_version as it is not needed
Jul 30, 2025
abc3f1f
[Nvidia-Imex] Update action sequence for service
Jul 30, 2025
9f711b8
[NVIDIA-IMEX] Comment the official docs for nvidia-imex service file
Jul 31, 2025
f68542e
[NVIDIA-IMEX] Using common naming convention for package name
Jul 31, 2025
f9a9aed
[NVIDIA-IMEX] Correcting kitchen test
Jul 31, 2025
d43d640
[NVIDIA_IMEX] Install nvidia-imex from s3
Aug 1, 2025
b6544a2
[NVIDIA_IMEX] Install nvidia-imex from s3
Aug 1, 2025
1251c23
[NVIDIA_IMEX] Update unit tests
Aug 2, 2025
c9c81cd
[NVIDIA Driver] Upgrade NVIDIA driver to 570.172.08 for all except AL2
Aug 2, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ This file is used to list changes made in each version of the AWS ParallelCluste
- Addressed cluster id mismatch known issue by deleting the file `/var/spool/slurm.state/clustername` before configuring Slurm accounting.
- Upgrade DCV to version 2024.0-19030.
- Remove `berkshelf`. All cookbooks are local and do not need `berkshelf` dependency management.
- Install nvidia-imex for all OSs except AL2.

**BUG FIXES**
- Fix a race condition in CloudWatch Agent startup that could cause nodes bootstrap failures.
Expand All @@ -38,6 +39,7 @@ This file is used to list changes made in each version of the AWS ParallelCluste
- Libfabric-aws: libfabric-aws-2.1.0-1
- Rdma-core: rdma-core-57.0-1
- Open MPI: openmpi40-aws-4.1.7-2 and openmpi50-aws-5.0.6
- Upgrade NVIDIA driver to version 570.172.08 (from 570.86.15) for all OSs except AL2.

**BUG FIXES**
- Fix a bug in the installation of ARM Performance Library that was causing the build image fail in isolated environments.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,15 @@

# NVidia
default['cluster']['nvidia']['enabled'] = 'no'
default['cluster']['nvidia']['driver_version'] = '570.86.15'
default['cluster']['nvidia']['driver_version'] = '570.172.08'
default['cluster']['nvidia']['dcgm_version'] = '3.3.6'
if platform?('amazon') && node['platform_version'] == "2"
default['cluster']['nvidia']['driver_version'] = '550.127.08'
end

# nvidia-imex
default['cluster']['nvidia']['imex']['shared_dir'] = "#{node['cluster']['shared_dir']}/nvidia-imex"

# DCV
default['cluster']['dcv']['authenticator']['user'] = "dcvextauth"
default['cluster']['dcv']['authenticator']['user_id'] = node['cluster']['reserved_base_uid'] + 3
Expand Down
16 changes: 16 additions & 0 deletions cookbooks/aws-parallelcluster-platform/libraries/nvidia.rb
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,19 @@ def is_process_running(process_name)

!ps.stdout.strip.empty?
end

#
# Get Count of GPUs in instance
#
def get_nvswitch_count(device_id)
shell_out("lspci -d #{device_id} | wc -l").stdout.strip.to_i
end

def get_device_ids
# A100 (P4), H100(P5), B200(P6) and GB200()p6e) systems have NVSwitches
# NVSwitch device id is 10de:1af1 for P4 instance
# NVSwitch device id is 10de:22a3 for P5 instance
# NVSwitch device id is 10de:2901 for P6 instance
# NVSwitch device id is 10de:2941 for P6e instance
{ 'a100' => '10de:1af1', 'h100' => '10de:22a3', 'b200' => '10de:2901', 'gb200' => '10de:2941' }
end
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,7 @@
end

include_recipe "aws-parallelcluster-platform::nvidia_uvm"

nvidia_imex 'Configure nvidia-imex' do
action :configure
end
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,5 @@
fabric_manager 'Install Nvidia Fabric Manager'

nvidia_dcgm 'install Nvidia datacenter-gpu-manager'

nvidia_imex 'Install nvidia-imex'
Original file line number Diff line number Diff line change
Expand Up @@ -54,12 +54,8 @@ def _nvidia_driver_version

# Get number of nv switches
def get_nvswitches
# A100 (P4), H100(P5) and B200(P6) systems have NVSwitches
# NVSwitch device id is 10de:1af1 for P4 instance
# NVSwitch device id is 10de:22a3 for P5 instance
# NVSwitch device id is 10de:2901 for P6 instance
# We sum the count for all these deviceIds as output of lscpi command will be >0
# for only one device ID based on the instance type
nvswitch_device_ids = ['10de:1af1', '10de:22a3', '10de:2901']
nvswitch_device_ids.sum { |id| shell_out("lspci -d #{id} | wc -l").stdout.strip.to_i }
nvswitch_device_ids = get_device_ids.values
nvswitch_device_ids.sum { |id| get_nvswitch_count(id) }
end
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# frozen_string_literal: true

# Copyright:: 2025 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License").
# You may not use this file except in compliance with the License.
# A copy of the License is located at
#
# http://aws.amazon.com/apache2.0/
#
# or in the "LICENSE.txt" file accompanying this file.
# This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, express or implied.
# See the License for the specific language governing permissions and limitations under the License.

provides :nvidia_imex, platform: 'amazon' do |node|
node['platform_version'].to_i == 2023
end

use 'partial/_nvidia_imex_common.rb'
use 'partial/_nvidia_imex_rhel.rb'

def platform
"amzn#{node['platform_version'].to_i}"
end
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# frozen_string_literal: true

# Copyright:: 2025 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License").
# You may not use this file except in compliance with the License.
# A copy of the License is located at
#
# http://aws.amazon.com/apache2.0/
#
# or in the "LICENSE.txt" file accompanying this file.
# This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, express or implied.
# See the License for the specific language governing permissions and limitations under the License.

provides :nvidia_imex, platform: 'amazon', platform_version: '2'

use 'partial/_nvidia_imex_common.rb'
use 'partial/_nvidia_imex_rhel.rb'

def imex_installed?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker but this function name is misleading.
IMEx is never installed on AL2. You need a function to determine wheteher or not imex should be installed. So I would rename this to install_imex? or skip_imex_installation?

# We do not install NVIDIA-Imex for Alinux2 due to restriction on NVIDIA driver
true
end

action :configure do
# Do nothing
end
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# frozen_string_literal: true

# Copyright:: 2025 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License").
# You may not use this file except in compliance with the License.
# A copy of the License is located at
#
# http://aws.amazon.com/apache2.0/
#
# or in the "LICENSE.txt" file accompanying this file.
# This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, express or implied.
# See the License for the specific language governing permissions and limitations under the License.

provides :nvidia_imex, platform: 'redhat' do |node|
node['platform_version'].to_i >= 8
end

use 'partial/_nvidia_imex_common.rb'
use 'partial/_nvidia_imex_rhel.rb'

def platform
"rhel#{node['platform_version'].to_i}"
end
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# frozen_string_literal: true

# Copyright:: 2025 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License").
# You may not use this file except in compliance with the License.
# A copy of the License is located at
#
# http://aws.amazon.com/apache2.0/
#
# or in the "LICENSE.txt" file accompanying this file.
# This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, express or implied.
# See the License for the specific language governing permissions and limitations under the License.

provides :nvidia_imex, platform: 'rocky' do |node|
node['platform_version'].to_i >= 8
end

use 'partial/_nvidia_imex_common.rb'
use 'partial/_nvidia_imex_rhel.rb'

def platform
"rhel#{node['platform_version'].to_i}"
end
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# frozen_string_literal: true

# Copyright:: 2025 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License").
# You may not use this file except in compliance with the License.
# A copy of the License is located at
#
# http://aws.amazon.com/apache2.0/
#
# or in the "LICENSE.txt" file accompanying this file.
# This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, express or implied.
# See the License for the specific language governing permissions and limitations under the License.

provides :nvidia_imex, platform: 'ubuntu' do |node|
node['platform_version'].to_i >= 22
end

use 'partial/_nvidia_imex_common.rb'
use 'partial/_nvidia_imex_debian.rb'

def platform
"ubuntu#{node['platform_version'].delete('.')}"
end
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# frozen_string_literal: true
#
# Copyright:: 2025 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License").
# You may not use this file except in compliance with the License.
# A copy of the License is located at
#
# http://aws.amazon.com/apache2.0/
#
# or in the "LICENSE.txt" file accompanying this file.
# This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, express or implied.
# See the License for the specific language governing permissions and limitations under the License.

unified_mode true
default_action :install

action :install do
return unless nvidia_enabled_or_installed?
return if on_docker? || imex_installed? || aws_region.start_with?("us-iso")

directory node['cluster']['nvidia']['imex']['shared_dir']

action_install_imex
# Save Imex version in Node Attributes for InSpec Tests
node.default['cluster']['nvidia']['imex']['version'] = nvidia_imex_full_version
node.default['cluster']['nvidia']['imex']['package'] = nvidia_imex_package
node_attributes 'dump node attributes'
end

action :configure do
return unless imex_installed? && node['cluster']['node_type'] == "ComputeFleet"
# Start nvidia-imex on p6e-gb200 and only on ComputeFleet
if get_nvswitch_count(get_device_ids['gb200']) > 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, it is safer to:

  1. get values from a dictionary providing a default value (eg: hash.fetch("d", "default_value")); OR
  2. make the receiving function get_nvswitch_count able to handle nil values
    In this specific case (2) is preferable because a failed fetching in ruby does not raise an exception, but returns a nil value.

# For each Compute Resource, we generate a unique NVIDIA IMEX configuration file,
# if one doesn't already exist in a common, shared location.
template nvidia_imex_nodes_conf_file do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall design assumes the imex nodes config file to be shared in the cluster.
This is helpful to centralize the orchestration from the head node and also to simplify the torubleshooting.
However it would be a blocker for the per-job deployment type(https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/deployment.html#per-job-wide). Avoid blockers for such deployment would be beneficial for not only for our users, but also to us, because it would allow us to vend the automated configuraiton of imex following the NVIDIA example SLURM Scheduler Integration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets have an offline discussion as there would be blockers on the Job wide deployment model

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed we will keep the exiting changes and we can later make changes as per naming convention or whichever is easier for the design of Custom Actions we recommend

source 'nvidia-imex/nvidia-imex-nodes.erb'
owner 'root'
group 'root'
mode '0755'
action :create
not_if { file_exists_and_cluster_update?(nvidia_imex_nodes_conf_file) }
end

template nvidia_imex_main_conf_file do
source 'nvidia-imex/nvidia-imex-config.erb'
owner 'root'
group 'root'
mode '0755'
action :create
not_if { file_exists_and_cluster_update?(nvidia_imex_main_conf_file) }
variables(imex_nodes_config_file_path: nvidia_imex_nodes_conf_file)
end

template "/etc/systemd/system/#{nvidia_imex_service}.service" do
source 'nvidia-imex/nvidia-imex.service.erb'
owner 'root'
group 'root'
mode '0644'
action :create
variables(imex_main_config_file_path: nvidia_imex_main_conf_file)
end

service nvidia_imex_service do
action %i(enable start)
supports status: true
end
end
end

def nvidia_imex_package
"#{nvidia_imex_service}-#{nvidia_driver_major_version}"
end

def nvidia_driver_major_version
node['cluster']['nvidia']['driver_version'].split('.')[0]
end

def nvidia_imex_service
'nvidia-imex'
end

def nvidia_imex_full_version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading function name: the function is expected to return the imex full version, but it does contain the imex versio. suffix, as the full imex version is made of ${nvidia_driver_major_version}-${nvidia_driver_version}-1

Copy link
Contributor Author

@himani2411 himani2411 Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full version is not necessarily the one you refer.

When we do a apt/dnf list installed packages we see the "#{node['cluster']['nvidia']['driver_version']}-1" which is why I set the node.default['cluster']['nvidia']['imex']['version'] as part of node attributes which I use in Inspec tests.
But the package naming convention requires during installation me to add the ${nvidia_driver_major_version}-${nvidia_driver_version}-1 depending on the platform, so I install using the full name of the package with the exact version so there is no mismatch.

"#{node['cluster']['nvidia']['driver_version']}-1"
end

def imex_installed?
::File.exist?("/usr/bin/#{nvidia_imex_service}") || ::File.exist?("/usr/bin/#{nvidia_imex_service}-ctl")
end

def nvidia_enabled_or_installed?
nvidia_enabled? || nvidia_installed?
end

def file_exists_and_cluster_update?(file_path)
::File.exist?(file_path) && !are_queues_updated?
end

def nvidia_imex_main_conf_file
"#{node['cluster']['nvidia']['imex']['shared_dir']}/config_#{node['cluster']['launch_template_id']}.cfg"
end

def nvidia_imex_nodes_conf_file
"#{node['cluster']['nvidia']['imex']['shared_dir']}/nodes_config_#{node['cluster']['launch_template_id']}.cfg"
end
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# frozen_string_literal: true
#
# Copyright:: 2025 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License").
# You may not use this file except in compliance with the License.
# A copy of the License is located at
#
# http://aws.amazon.com/apache2.0/
#
# or in the "LICENSE.txt" file accompanying this file.
# This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, express or implied.
# See the License for the specific language governing permissions and limitations under the License.

action :install_imex do
remote_file "#{node['cluster']['sources_dir']}/#{nvidia_imex_package}-#{nvidia_imex_full_version}.deb" do
source "#{nvidia_imex_url}"
mode '0644'
retries 3
retry_delay 5
action :create_if_missing
end

bash "Install nvidia-imex" do
user 'root'
cwd node['cluster']['sources_dir']
code <<-NVIDIA_IMEX
set -e
dpkg -i #{nvidia_imex_package}-#{nvidia_imex_full_version}.deb && apt-mark hold #{nvidia_imex_package}
NVIDIA_IMEX
retries 3
retry_delay 5
end
end

def nvidia_imex_url
"#{node['cluster']['artifacts_s3_url']}/dependencies/nvidia_imex/#{platform}/#{nvidia_imex_package}_#{nvidia_imex_full_version}_#{arch_suffix}.deb"
end

def arch_suffix
arm_instance? ? 'arm64' : 'amd64'
end
Loading
Loading