[Nvidia-Imex] Installing Nvidia-imex as part of ParallelCluster Build Image #2996

himani2411 · 2025-07-29T19:46:45Z

Description of changes

Adding resource which installs nvidia-imex from S3, which is installed only in commercial regions and ignored in isolated regions
- Version of nvidia-imex should be same as nvidia version
- We skip installing nvidia-imex for AL2 instance as NVIDIA Driver 570+ are not supported with AL2 Upgrade NVIDIA driver and cuda version #2887
- We skip installation if nvidia-imex is already installed or if Installation/NvidiaSoftware/Enabled is kept False.
- nvidia-imex configuration files are kept in a shared location /opt/parallelcluster/shared/nvidia-imex
- We overwrite nvidia-imex files are configured only during cluster creation phase and only on gb200.
- NVIDIA Configuration files contains default values except for where we re-direct nvidia-imex logs to system logs which are pushed in CW Log group
- We keep /opt/parallelcluster/shared/nvidia-imex/nodes_config_<LaunchTemplateID>.cfg containing fake IP addresses for being able to start enable nvidia-imex service during configuration Phase of cluster creation.
  *. * We use lt-123456789012 Launch template Id suffix to adhere to `max length of 256 characters.
Similarly configured changes in NVIDIA-Fabric Manager to be enabled when p6e instance is used as a compute fleet.
Add Kitchen and InSpec tests

Tests

Inpsec tests
Build AMI for all OSSes
** Tested by HARDCODing NVIDIA Driver and Fabric Manager BUILD IMAGE as nvidia-imex was not available for older version.**

References

Link to impacted open issues.
Link to related PRs in other packages (i.e. cookbook, node).
Link to documentation useful to understand the changes.

Checklist

Make sure you are pointing to the right branch.
If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
Check all commits' messages are clear, describing what and why vs how.
Make sure to have added unit tests or integration tests to cover the new/modified code.
Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

...-parallelcluster-platform/resources/fabric_manager/partial/_fabric_manager_install_debian.rb

cookbooks/aws-parallelcluster-platform/attributes/platform.rb

himani2411 · 2025-07-29T20:40:41Z

...ws-parallelcluster-platform/resources/fabric_manager/partial/_fabric_manager_install_rhel.rb


 def fabric_manager_url
-  "#{node['cluster']['artifacts_s3_url']}/dependencies/nvidia_fabric/#{platform}/#{fabric_manager_package}-#{fabric_manager_version}-1.#{arch_suffix}.rpm"
+  # "#{node['cluster']['artifacts_s3_url']}/dependencies/nvidia_fabric/#{platform}/#{fabric_manager_package}-#{fabric_manager_version}-1.#{arch_suffix}.rpm"


Same as #2996 (comment). Added only as part of testing and will be removed in this PR.

cookbooks/aws-parallelcluster-platform/resources/nvidia_driver/partial/_nvidia_driver_common.rb

cookbooks/aws-parallelcluster-platform/libraries/nvidia.rb

cookbooks/aws-parallelcluster-platform/recipes/install/nvidia_install.rb

cookbooks/aws-parallelcluster-platform/libraries/nvidia.rb

cookbooks/aws-parallelcluster-platform/attributes/platform.rb

cookbooks/aws-parallelcluster-platform/resources/nvidia_imex/nvidia_imex_alinux2023.rb

gmarciani · 2025-07-30T15:54:52Z

cookbooks/aws-parallelcluster-platform/resources/nvidia_imex/partial/_nvidia_imex_common.rb

+  'nvidia-imex'
+end
+
+def nvidia_imex_full_version


Misleading function name: the function is expected to return the imex full version, but it does contain the imex versio. suffix, as the full imex version is made of ${nvidia_driver_major_version}-${nvidia_driver_version}-1

The full version is not necessarily the one you refer.

When we do a apt/dnf list installed packages we see the "#{node['cluster']['nvidia']['driver_version']}-1" which is why I set the node.default['cluster']['nvidia']['imex']['version'] as part of node attributes which I use in Inspec tests.
But the package naming convention requires during installation me to add the ${nvidia_driver_major_version}-${nvidia_driver_version}-1 depending on the platform, so I install using the full name of the package with the exact version so there is no mismatch.

cookbooks/aws-parallelcluster-platform/resources/nvidia_imex/partial/_nvidia_imex_common.rb

cookbooks/aws-parallelcluster-platform/resources/nvidia_imex/partial/_nvidia_imex_debian.rb

cookbooks/aws-parallelcluster-platform/resources/nvidia_imex/partial/_nvidia_imex_rhel.rb

gmarciani · 2025-07-30T16:06:13Z

cookbooks/aws-parallelcluster-platform/templates/nvidia-imex/nvidia-imex.service.erb

+Type=forking
+TimeoutStartSec=infinity
+
+ExecStart=/usr/bin/nvidia-imex -c <%= node['cluster']['nvidia']['imex']['shared_dir'] %>/config.cfg


We reference the path to nvidia-imex binary and config file in different places. What about injecting the variables directly?

ExecStart=<%= node['cluster']['nvidia']['imex']['binary'] %> -c <%= node['cluster']['nvidia']['imex']['config'] %>`

So you want a variable for the binary? Why? we dont decide this path and I dont see the point of add an attribute if its not something i use anywhere else

gmarciani · 2025-07-30T16:08:05Z

cookbooks/aws-parallelcluster-platform/templates/nvidia-imex/nvidia-imex.service.erb

@@ -0,0 +1,26 @@
+[Unit]


If this file is taken directly from the nvidia-imex documentation, then link to the official doc documenting it. In this way if we face issues at some point with the service definition we have a quick reference to the suggested values from nvidia and also spot quickly the changes we made on top of it.

I picked the file from an instance in which i downloaded it. Will add the official doc in the comment

gmarciani · 2025-07-30T16:09:55Z

cookbooks/aws-parallelcluster-platform/templates/nvidia-imex/nvidia-imex.service.erb

+
+LimitCORE=infinity
+
+Restart=on-failure


Why restarting on failure rather than restarting always?
One of the advantage of restarting always is that we capture whatever termination scenario

not sure I understand why we want to keep it always especially when this should not be enabled or running on all the instance types.

On those instance types where it should not run, the service is disabled, so the restart policy is not even taken into account.
The advantage of always compared to on-failure is that the service gets restarted regardless the exit code. This is just a safety net.

However since you took the unit definition from NVIDIA recommedation, they probably have good reasons to prefer on-failure rather than always.
Let's go with on-failure.

gmarciani · 2025-07-30T16:13:16Z

cookbooks/aws-parallelcluster-platform/templates/nvidia-imex/nvidia-imex.service.erb

+Requires=network-online.target
+
+[Service]
+Environment="KRB5_CLIENT_KTNAME=/etc/krb5.keytab"


Why do we need kerberos configuration?
What if the user has configured kerberos to point to a different principal file overriding default_keytab_name in /etc/krb5.conf?

I kept it as is due to the installation from the instance itself that I was working with. But I am going to remove it as official doc does not keep it https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/gettingstarted.html#on-linux-based-systems

gmarciani · 2025-07-30T16:19:37Z

cookbooks/aws-parallelcluster-platform/templates/nvidia-imex/nvidia-imex-nodes.erb

@@ -0,0 +1,3 @@
+## Please replace below fake IP's
+172.31.51.93


How did you come with these values?
If they are placeholder just to get imex started, what about using 0.0.0.0?
See https://www.techtarget.com/searchnetworking/answer/What-is-the-IP-address-0000-used-for#:~:text=In%20the%20simplest%20terms%2C%20the,the%20destination%20address%20is%20unavailable.

I can keep it and was initially going to do so. This is what I got from DLAMI that i used for testing

cookbooks/aws-parallelcluster-platform/templates/nvidia-imex/nvidia-imex-nodes.erb

gmarciani · 2025-07-30T16:22:41Z

cookbooks/aws-parallelcluster-platform/templates/nvidia-imex/nvidia-imex-config.erb

+#           3  - Set log level to WARNING and above
+#           4  - Set log level to INFO and above
+#       Default Value: 4
+LOG_LEVEL=4


Here and for other parameters:
Why do we need to override the parameter with its default value?
What about commenting it out the same way we did for other params?

I keep the file as it is generated by nvidi-imex installation. So even if the default value is not necessary, the installation itself keeps it, so I followed the same

gmarciani · 2025-07-30T16:23:39Z

cookbooks/aws-parallelcluster-platform/templates/nvidia-imex/nvidia-imex-config.erb

+#       to console(stderr). If the specified log file can't be opened or the
+#       path is empty.
+#   Default Value: /var/log/nvidia-imex.log
+# LOG_FILE_NAME=/var/log/nvidia-imex.log


does it make sense to push nvidia-imex logs to cloudwatch and configure log rotation?

We can, I wanted to make minimal changes in this PR which is why i have re-directed it to syslog

I can keep a backlog item to keep them separted and move them to CW

gmarciani · 2025-07-30T16:25:31Z

cookbooks/aws-parallelcluster-platform/templates/nvidia-imex/nvidia-imex-config.erb

+#   Possible Values:
+#       Full path/filename string (max length of 256).
+#       Default Value: /etc/nvidia-imex/nodes_config.cfg
+IMEX_NODE_CONFIG_FILE=<%= node['cluster']['nvidia']['imex']['shared_dir'] %>/nodes_config.cfg


Mentioned also in another comment. What about injecting directly the attribute containing the path to the config file rather than redefining it?

I mean:

IMEX_NODE_CONFIG_FILE=<%= node['cluster']['nvidia']['imex']['config'] %>

I have kept it specific to each node and dont see the point of making another attribute. But if later we can find that it makes the implementation of recommended custom actions easier we can chnage in next iteration

gmarciani · 2025-07-30T16:27:38Z

cookbooks/aws-parallelcluster-platform/templates/nvidia-imex/nvidia-imex-config.erb

+#      0:  Disable encryption and authentication
+#      1:  Enable encryption and authentication
+#  Default value: 0
+IMEX_ENABLE_AUTH_ENCRYPTION=0


why not enabling encryption?

I kept the file as it is when it is generated by nvidia-imex installation, plus I would need to setup other values for enabling the auth either through SSL/TLS and other options like the SOURCE, TARGET_OVERRIDE,etc as per https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/config.html. i think this can be done in the next iteration and make the minimal chnages for now

@mjkoop
Do you have any recommendations on this?

I'd recommend in the future using encryption, but don't see it as a blocker as initial release is a guide vs full-implementation. More importantly, I'd recommend we let the customer know about how the IMEX-channels work so they understand in a single channel mode there is no protection across users on the same node.

Just to be sure I wasnt planning on making changes for IMex channel 's in the next iteration either as that would be more of a customization w.r.t DirectoryServices.

codecov · 2025-07-31T00:33:40Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.47%. Comparing base (6127e18) to head (af90597).
⚠️ Report is 13 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #2996      +/-   ##
===========================================
- Coverage    75.50%   75.47%   -0.04%     
===========================================
  Files           23       23              
  Lines         2356     2357       +1     
===========================================
  Hits          1779     1779              
- Misses         577      578       +1

Flag	Coverage Δ
unittests	`75.47% <ø> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cookbooks/aws-parallelcluster-platform/resources/nvidia_imex/partial/_nvidia_imex_rhel.rb

gmarciani · 2025-08-04T19:56:23Z

cookbooks/aws-parallelcluster-platform/resources/nvidia_imex/partial/_nvidia_imex_common.rb

+  if get_nvswitch_count(get_device_ids['gb200']) > 1
+    # For each Compute Resource, we generate a unique NVIDIA IMEX configuration file,
+    # if one doesn't already exist in a common, shared location.
+    template nvidia_imex_nodes_conf_file do


The overall design assumes the imex nodes config file to be shared in the cluster.
This is helpful to centralize the orchestration from the head node and also to simplify the torubleshooting.
However it would be a blocker for the per-job deployment type(https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/deployment.html#per-job-wide). Avoid blockers for such deployment would be beneficial for not only for our users, but also to us, because it would allow us to vend the automated configuraiton of imex following the NVIDIA example SLURM Scheduler Integration

Lets have an offline discussion as there would be blockers on the Job wide deployment model

As discussed we will keep the exiting changes and we can later make changes as per naming convention or whichever is easier for the design of Custom Actions we recommend

…he version and package name as seen on the repositories

…ration files

Removing test URL

gmarciani · 2025-08-05T21:10:59Z

cookbooks/aws-parallelcluster-platform/resources/nvidia_imex/nvidia_imex_amazon2.rb

+use 'partial/_nvidia_imex_common.rb'
+use 'partial/_nvidia_imex_rhel.rb'
+
+def imex_installed?


Not a blocker but this function name is misleading.
IMEx is never installed on AL2. You need a function to determine wheteher or not imex should be installed. So I would rename this to install_imex? or skip_imex_installation?

himani2411 requested review from a team as code owners July 29, 2025 19:46

himani2411 added skip-changelog-update 3.x labels Jul 29, 2025

himani2411 force-pushed the nvidia-imex-install branch from 4a625ee to 3470a9b Compare July 29, 2025 19:46

himani2411 removed the skip-changelog-update label Jul 29, 2025

gmarciani reviewed Jul 29, 2025

View reviewed changes

...-parallelcluster-platform/resources/fabric_manager/partial/_fabric_manager_install_debian.rb Outdated Show resolved Hide resolved

himani2411 commented Jul 29, 2025

View reviewed changes

cookbooks/aws-parallelcluster-platform/attributes/platform.rb Outdated Show resolved Hide resolved

himani2411 commented Jul 29, 2025

View reviewed changes

cookbooks/aws-parallelcluster-platform/resources/nvidia_driver/partial/_nvidia_driver_common.rb Outdated Show resolved Hide resolved

gmarciani reviewed Jul 30, 2025

View reviewed changes

gmarciani reviewed Aug 1, 2025

View reviewed changes

cookbooks/aws-parallelcluster-platform/resources/nvidia_imex/partial/_nvidia_imex_rhel.rb Show resolved Hide resolved

gmarciani reviewed Aug 4, 2025

View reviewed changes

Himani Anil Deshpande added 13 commits August 5, 2025 16:52

[NVIDIA_IMEX] Add resource to install Nvidia-imex

e98def2

[NVIDIA_IMEX] Adding Unit test for IMEX installation

5f03bbb

[NVIDIA_IMEX] Not Install Nvidia-imex for Isolated regions

024ba4e

[NVIDIA_IMEX] Install Nvdia-Imex as part of Build Image

d53bf01

[Nvidia-imex] Never Install NVIDIA Imex for AL2

d925ce2

[Nvidia-imex] Add unit tests for NVidia Imex

a46954b

[Nvidia-imex] Cookstyle changes

ff89ace

[NVIDIA_IMEX] Adding Kitchen test for Installation and Configuration

cdce37f

[FABRIC MANAGER] Using common library for getting NVSwitch count

6726f3e

[NVIDIA-IMEX] Configure Nvidia-imex only if we use Gb200 instance

368fbee

[NVIDIA-IMEX] USe specific Version naming for nvidia-imex installation

9f09451

[NVIDIA-IMEX] Install Nvidia-imex and flush cache before it

f743ef8

[NVIDIA-IMEX] Redirect nvidia-imex to system logs which are pushed in CW

7505635

Himani Anil Deshpande added 19 commits August 5, 2025 16:52

[NVIDIA-IMEX] Inspec Test

f9c324f

[NVIDIA-IMEX] Setting Nvidia-imex node attributes which should show t…

ee601b7

…he version and package name as seen on the repositories

[NVIDIA-IMEX] Test epoch version

f40e936

[NVIDIA-IMEX] Add Version and package name for debian installation

876fccb

[NVIDIA-IMEX] Add changelog

caa3c91

Add unit test for checking configuration of nvidia-imex

c78a8d4

[Nvidia-Imex] Use nvidia-imex shared directory for Inspec and configu…

9bed1c0

…ration files

[Nvidia-Imex] Update copyright year

8efff28

[Nvidia-Imex] Adding correct comments

2efe7bd

[Nvidia-Imex] Updating function names

95e1228

[Nvidia-Imex] Remove _nvidia_imex_version as it is not needed

329166a

[Nvidia-Imex] Update action sequence for service

abc3f1f

[NVIDIA-IMEX] Comment the official docs for nvidia-imex service file

9f711b8

[NVIDIA-IMEX] Using common naming convention for package name

f68542e

[NVIDIA-IMEX] Correcting kitchen test

f9a9aed

[NVIDIA_IMEX] Install nvidia-imex from s3

d43d640

[NVIDIA_IMEX] Install nvidia-imex from s3

b6544a2

[NVIDIA_IMEX] Update unit tests

1251c23

[NVIDIA Driver] Upgrade NVIDIA driver to 570.172.08 for all except AL2

c9c81cd

Removing test URL

himani2411 force-pushed the nvidia-imex-install branch from fbee300 to c9c81cd Compare August 5, 2025 20:53

himani2411 enabled auto-merge (rebase) August 5, 2025 20:59

himani2411 disabled auto-merge August 5, 2025 20:59

himani2411 enabled auto-merge (squash) August 5, 2025 20:59

gmarciani reviewed Aug 5, 2025

View reviewed changes

gmarciani approved these changes Aug 5, 2025

View reviewed changes

himani2411 merged commit 0993d8c into aws:develop Aug 5, 2025
28 of 30 checks passed

This was referenced Aug 7, 2025

[NVIDIA-IMEX] Add test attribute for NVIDIA-imex simulation #3001

Merged

[SlurmTopo] Add support for slurm Block Topology #3002

Merged

[Test] Add integration tests to validate support for GB200. aws/aws-parallelcluster#6934

Merged

gmarciani mentioned this pull request Aug 12, 2025

[Bugfix] Fix logic in function are_queues_updated? #3006

Open

		@@ -0,0 +1,3 @@
		## Please replace below fake IP's
		172.31.51.93


		LimitCORE=infinity

		Restart=on-failure

[Nvidia-Imex] Installing Nvidia-imex as part of ParallelCluster Build Image #2996

[Nvidia-Imex] Installing Nvidia-imex as part of ParallelCluster Build Image #2996

Uh oh!

Conversation

himani2411 commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Tests

References

Checklist

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

himani2411 Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

himani2411 Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

himani2411 Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

himani2411 Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

himani2411 Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmarciani Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

himani2411 commented Jul 29, 2025 •

edited

Loading

himani2411 Jul 30, 2025 •

edited

Loading

himani2411 Jul 30, 2025 •

edited

Loading

himani2411 Jul 30, 2025 •

edited

Loading

himani2411 Jul 30, 2025 •

edited

Loading

himani2411 Jul 31, 2025 •

edited

Loading

gmarciani Jul 30, 2025 •

edited

Loading

himani2411 Jul 31, 2025 •

edited

Loading