Skip to content

Conversation

himani2411
Copy link
Contributor

@himani2411 himani2411 commented Jul 29, 2025

Description of changes

  • Adding resource which installs nvidia-imex from S3, which is installed only in commercial regions and ignored in isolated regions
    • Version of nvidia-imex should be same as nvidia version
    • nvidia-imex configuration files are kept in a shared location /opt/parallelcluster/shared/nvidia-imex
    • We overwrite nvidia-imex files are configured only during cluster creation phase and only on gb200.
    • NVIDIA Configuration files contains default values except for where we re-direct nvidia-imex logs to system logs which are pushed in CW Log group
    • We keep /opt/parallelcluster/shared/nvidia-imex/nodes_config_<LaunchTemplateID>.cfg containing fake IP addresses for being able to start enable nvidia-imex service during configuration Phase of cluster creation.
      *. * We use lt-123456789012 Launch template Id suffix to adhere to `max length of 256 characters.
  • Similarly configured changes in NVIDIA-Fabric Manager to be enabled when p6e instance is used as a compute fleet.
  • Add Kitchen and InSpec tests

Tests

  • Inpsec tests
  • Build AMI for all OSSes
    ** Tested by HARDCODing NVIDIA Driver and Fabric Manager BUILD IMAGE as nvidia-imex was not available for older version.**

References

  • Link to impacted open issues.
  • Link to related PRs in other packages (i.e. cookbook, node).
  • Link to documentation useful to understand the changes.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.


def fabric_manager_url
"#{node['cluster']['artifacts_s3_url']}/dependencies/nvidia_fabric/#{platform}/#{fabric_manager_package}-#{fabric_manager_version}-1.#{arch_suffix}.rpm"
# "#{node['cluster']['artifacts_s3_url']}/dependencies/nvidia_fabric/#{platform}/#{fabric_manager_package}-#{fabric_manager_version}-1.#{arch_suffix}.rpm"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as #2996 (comment). Added only as part of testing and will be removed in this PR.

'nvidia-imex'
end

def nvidia_imex_full_version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading function name: the function is expected to return the imex full version, but it does contain the imex versio. suffix, as the full imex version is made of ${nvidia_driver_major_version}-${nvidia_driver_version}-1

Copy link
Contributor Author

@himani2411 himani2411 Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full version is not necessarily the one you refer.

When we do a apt/dnf list installed packages we see the "#{node['cluster']['nvidia']['driver_version']}-1" which is why I set the node.default['cluster']['nvidia']['imex']['version'] as part of node attributes which I use in Inspec tests.
But the package naming convention requires during installation me to add the ${nvidia_driver_major_version}-${nvidia_driver_version}-1 depending on the platform, so I install using the full name of the package with the exact version so there is no mismatch.

Type=forking
TimeoutStartSec=infinity

ExecStart=/usr/bin/nvidia-imex -c <%= node['cluster']['nvidia']['imex']['shared_dir'] %>/config.cfg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We reference the path to nvidia-imex binary and config file in different places. What about injecting the variables directly?

ExecStart=<%= node['cluster']['nvidia']['imex']['binary'] %> -c <%= node['cluster']['nvidia']['imex']['config'] %>`

Copy link
Contributor Author

@himani2411 himani2411 Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you want a variable for the binary? Why? we dont decide this path and I dont see the point of add an attribute if its not something i use anywhere else

@@ -0,0 +1,26 @@
[Unit]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this file is taken directly from the nvidia-imex documentation, then link to the official doc documenting it. In this way if we face issues at some point with the service definition we have a quick reference to the suggested values from nvidia and also spot quickly the changes we made on top of it.

Copy link
Contributor Author

@himani2411 himani2411 Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I picked the file from an instance in which i downloaded it. Will add the official doc in the comment


LimitCORE=infinity

Restart=on-failure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why restarting on failure rather than restarting always?
One of the advantage of restarting always is that we capture whatever termination scenario

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand why we want to keep it always especially when this should not be enabled or running on all the instance types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On those instance types where it should not run, the service is disabled, so the restart policy is not even taken into account.
The advantage of always compared to on-failure is that the service gets restarted regardless the exit code. This is just a safety net.

However since you took the unit definition from NVIDIA recommedation, they probably have good reasons to prefer on-failure rather than always.
Let's go with on-failure.

Requires=network-online.target

[Service]
Environment="KRB5_CLIENT_KTNAME=/etc/krb5.keytab"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need kerberos configuration?
What if the user has configured kerberos to point to a different principal file overriding default_keytab_name in /etc/krb5.conf?

Copy link
Contributor Author

@himani2411 himani2411 Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept it as is due to the installation from the instance itself that I was working with. But I am going to remove it as official doc does not keep it https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/gettingstarted.html#on-linux-based-systems

@@ -0,0 +1,3 @@
## Please replace below fake IP's
172.31.51.93
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@himani2411 himani2411 Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can keep it and was initially going to do so. This is what I got from DLAMI that i used for testing

# 3 - Set log level to WARNING and above
# 4 - Set log level to INFO and above
# Default Value: 4
LOG_LEVEL=4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and for other parameters:
Why do we need to override the parameter with its default value?
What about commenting it out the same way we did for other params?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep the file as it is generated by nvidi-imex installation. So even if the default value is not necessary, the installation itself keeps it, so I followed the same

# to console(stderr). If the specified log file can't be opened or the
# path is empty.
# Default Value: /var/log/nvidia-imex.log
# LOG_FILE_NAME=/var/log/nvidia-imex.log
Copy link
Contributor

@gmarciani gmarciani Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to push nvidia-imex logs to cloudwatch and configure log rotation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, I wanted to make minimal changes in this PR which is why i have re-directed it to syslog

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can keep a backlog item to keep them separted and move them to CW

# Possible Values:
# Full path/filename string (max length of 256).
# Default Value: /etc/nvidia-imex/nodes_config.cfg
IMEX_NODE_CONFIG_FILE=<%= node['cluster']['nvidia']['imex']['shared_dir'] %>/nodes_config.cfg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned also in another comment. What about injecting directly the attribute containing the path to the config file rather than redefining it?

I mean:

IMEX_NODE_CONFIG_FILE=<%= node['cluster']['nvidia']['imex']['config'] %>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have kept it specific to each node and dont see the point of making another attribute. But if later we can find that it makes the implementation of recommended custom actions easier we can chnage in next iteration

# 0: Disable encryption and authentication
# 1: Enable encryption and authentication
# Default value: 0
IMEX_ENABLE_AUTH_ENCRYPTION=0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not enabling encryption?

Copy link
Contributor Author

@himani2411 himani2411 Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the file as it is when it is generated by nvidia-imex installation, plus I would need to setup other values for enabling the auth either through SSL/TLS and other options like the SOURCE, TARGET_OVERRIDE,etc as per https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/config.html. i think this can be done in the next iteration and make the minimal chnages for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjkoop
Do you have any recommendations on this?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend in the future using encryption, but don't see it as a blocker as initial release is a guide vs full-implementation. More importantly, I'd recommend we let the customer know about how the IMEX-channels work so they understand in a single channel mode there is no protection across users on the same node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure I wasnt planning on making changes for IMex channel 's in the next iteration either as that would be more of a customization w.r.t DirectoryServices.

Copy link

codecov bot commented Jul 31, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.47%. Comparing base (6127e18) to head (af90597).
⚠️ Report is 13 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #2996      +/-   ##
===========================================
- Coverage    75.50%   75.47%   -0.04%     
===========================================
  Files           23       23              
  Lines         2356     2357       +1     
===========================================
  Hits          1779     1779              
- Misses         577      578       +1     
Flag Coverage Δ
unittests 75.47% <ø> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

if get_nvswitch_count(get_device_ids['gb200']) > 1
# For each Compute Resource, we generate a unique NVIDIA IMEX configuration file,
# if one doesn't already exist in a common, shared location.
template nvidia_imex_nodes_conf_file do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall design assumes the imex nodes config file to be shared in the cluster.
This is helpful to centralize the orchestration from the head node and also to simplify the torubleshooting.
However it would be a blocker for the per-job deployment type(https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/deployment.html#per-job-wide). Avoid blockers for such deployment would be beneficial for not only for our users, but also to us, because it would allow us to vend the automated configuraiton of imex following the NVIDIA example SLURM Scheduler Integration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets have an offline discussion as there would be blockers on the Job wide deployment model

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed we will keep the exiting changes and we can later make changes as per naming convention or whichever is easier for the design of Custom Actions we recommend

@himani2411 himani2411 force-pushed the nvidia-imex-install branch from fbee300 to c9c81cd Compare August 5, 2025 20:53
@himani2411 himani2411 enabled auto-merge (rebase) August 5, 2025 20:59
@himani2411 himani2411 disabled auto-merge August 5, 2025 20:59
@himani2411 himani2411 enabled auto-merge (squash) August 5, 2025 20:59
use 'partial/_nvidia_imex_common.rb'
use 'partial/_nvidia_imex_rhel.rb'

def imex_installed?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker but this function name is misleading.
IMEx is never installed on AL2. You need a function to determine wheteher or not imex should be installed. So I would rename this to install_imex? or skip_imex_installation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants