-
Notifications
You must be signed in to change notification settings - Fork 107
[Nvidia-Imex] Installing Nvidia-imex as part of ParallelCluster Build Image #2996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4a625ee
to
3470a9b
Compare
...-parallelcluster-platform/resources/fabric_manager/partial/_fabric_manager_install_debian.rb
Outdated
Show resolved
Hide resolved
|
||
def fabric_manager_url | ||
"#{node['cluster']['artifacts_s3_url']}/dependencies/nvidia_fabric/#{platform}/#{fabric_manager_package}-#{fabric_manager_version}-1.#{arch_suffix}.rpm" | ||
# "#{node['cluster']['artifacts_s3_url']}/dependencies/nvidia_fabric/#{platform}/#{fabric_manager_package}-#{fabric_manager_version}-1.#{arch_suffix}.rpm" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as #2996 (comment). Added only as part of testing and will be removed in this PR.
cookbooks/aws-parallelcluster-platform/resources/nvidia_driver/partial/_nvidia_driver_common.rb
Outdated
Show resolved
Hide resolved
cookbooks/aws-parallelcluster-platform/recipes/install/nvidia_install.rb
Outdated
Show resolved
Hide resolved
cookbooks/aws-parallelcluster-platform/resources/nvidia_imex/nvidia_imex_alinux2023.rb
Outdated
Show resolved
Hide resolved
'nvidia-imex' | ||
end | ||
|
||
def nvidia_imex_full_version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misleading function name: the function is expected to return the imex full version, but it does contain the imex versio. suffix, as the full imex version is made of ${nvidia_driver_major_version}-${nvidia_driver_version}-1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The full version is not necessarily the one you refer.
When we do a apt/dnf list installed packages we see the "#{node['cluster']['nvidia']['driver_version']}-1"
which is why I set the node.default['cluster']['nvidia']['imex']['version']
as part of node attributes which I use in Inspec tests.
But the package naming convention requires during installation me to add the ${nvidia_driver_major_version}-${nvidia_driver_version}-1
depending on the platform, so I install using the full name of the package with the exact version so there is no mismatch.
cookbooks/aws-parallelcluster-platform/resources/nvidia_imex/partial/_nvidia_imex_common.rb
Outdated
Show resolved
Hide resolved
cookbooks/aws-parallelcluster-platform/resources/nvidia_imex/partial/_nvidia_imex_common.rb
Show resolved
Hide resolved
cookbooks/aws-parallelcluster-platform/resources/nvidia_imex/partial/_nvidia_imex_debian.rb
Outdated
Show resolved
Hide resolved
cookbooks/aws-parallelcluster-platform/resources/nvidia_imex/partial/_nvidia_imex_rhel.rb
Outdated
Show resolved
Hide resolved
Type=forking | ||
TimeoutStartSec=infinity | ||
|
||
ExecStart=/usr/bin/nvidia-imex -c <%= node['cluster']['nvidia']['imex']['shared_dir'] %>/config.cfg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We reference the path to nvidia-imex binary and config file in different places. What about injecting the variables directly?
ExecStart=<%= node['cluster']['nvidia']['imex']['binary'] %>
-c <%= node['cluster']['nvidia']['imex']['config'] %>`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you want a variable for the binary? Why? we dont decide this path and I dont see the point of add an attribute if its not something i use anywhere else
@@ -0,0 +1,26 @@ | |||
[Unit] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this file is taken directly from the nvidia-imex documentation, then link to the official doc documenting it. In this way if we face issues at some point with the service definition we have a quick reference to the suggested values from nvidia and also spot quickly the changes we made on top of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I picked the file from an instance in which i downloaded it. Will add the official doc in the comment
|
||
LimitCORE=infinity | ||
|
||
Restart=on-failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why restarting on failure rather than restarting always?
One of the advantage of restarting always is that we capture whatever termination scenario
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I understand why we want to keep it always
especially when this should not be enabled or running on all the instance types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On those instance types where it should not run, the service is disabled, so the restart policy is not even taken into account.
The advantage of always compared to on-failure is that the service gets restarted regardless the exit code. This is just a safety net.
However since you took the unit definition from NVIDIA recommedation, they probably have good reasons to prefer on-failure rather than always.
Let's go with on-failure.
Requires=network-online.target | ||
|
||
[Service] | ||
Environment="KRB5_CLIENT_KTNAME=/etc/krb5.keytab" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need kerberos configuration?
What if the user has configured kerberos to point to a different principal file overriding default_keytab_name
in /etc/krb5.conf
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept it as is due to the installation from the instance itself that I was working with. But I am going to remove it as official doc does not keep it https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/gettingstarted.html#on-linux-based-systems
@@ -0,0 +1,3 @@ | |||
## Please replace below fake IP's | |||
172.31.51.93 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you come with these values?
If they are placeholder just to get imex started, what about using 0.0.0.0?
See https://www.techtarget.com/searchnetworking/answer/What-is-the-IP-address-0000-used-for#:~:text=In%20the%20simplest%20terms%2C%20the,the%20destination%20address%20is%20unavailable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can keep it and was initially going to do so. This is what I got from DLAMI that i used for testing
cookbooks/aws-parallelcluster-platform/templates/nvidia-imex/nvidia-imex-nodes.erb
Outdated
Show resolved
Hide resolved
# 3 - Set log level to WARNING and above | ||
# 4 - Set log level to INFO and above | ||
# Default Value: 4 | ||
LOG_LEVEL=4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and for other parameters:
Why do we need to override the parameter with its default value?
What about commenting it out the same way we did for other params?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I keep the file as it is generated by nvidi-imex installation. So even if the default value is not necessary, the installation itself keeps it, so I followed the same
# to console(stderr). If the specified log file can't be opened or the | ||
# path is empty. | ||
# Default Value: /var/log/nvidia-imex.log | ||
# LOG_FILE_NAME=/var/log/nvidia-imex.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to push nvidia-imex logs to cloudwatch and configure log rotation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, I wanted to make minimal changes in this PR which is why i have re-directed it to syslog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can keep a backlog item to keep them separted and move them to CW
# Possible Values: | ||
# Full path/filename string (max length of 256). | ||
# Default Value: /etc/nvidia-imex/nodes_config.cfg | ||
IMEX_NODE_CONFIG_FILE=<%= node['cluster']['nvidia']['imex']['shared_dir'] %>/nodes_config.cfg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mentioned also in another comment. What about injecting directly the attribute containing the path to the config file rather than redefining it?
I mean:
IMEX_NODE_CONFIG_FILE=<%= node['cluster']['nvidia']['imex']['config'] %>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have kept it specific to each node and dont see the point of making another attribute. But if later we can find that it makes the implementation of recommended custom actions easier we can chnage in next iteration
# 0: Disable encryption and authentication | ||
# 1: Enable encryption and authentication | ||
# Default value: 0 | ||
IMEX_ENABLE_AUTH_ENCRYPTION=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not enabling encryption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the file as it is when it is generated by nvidia-imex installation, plus I would need to setup other values for enabling the auth either through SSL/TLS and other options like the SOURCE, TARGET_OVERRIDE,etc as per https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/config.html. i think this can be done in the next iteration and make the minimal chnages for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mjkoop
Do you have any recommendations on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend in the future using encryption, but don't see it as a blocker as initial release is a guide vs full-implementation. More importantly, I'd recommend we let the customer know about how the IMEX-channels work so they understand in a single channel mode there is no protection across users on the same node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be sure I wasnt planning on making changes for IMex channel 's in the next iteration either as that would be more of a customization w.r.t DirectoryServices.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #2996 +/- ##
===========================================
- Coverage 75.50% 75.47% -0.04%
===========================================
Files 23 23
Lines 2356 2357 +1
===========================================
Hits 1779 1779
- Misses 577 578 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
cookbooks/aws-parallelcluster-platform/resources/nvidia_imex/partial/_nvidia_imex_rhel.rb
Show resolved
Hide resolved
if get_nvswitch_count(get_device_ids['gb200']) > 1 | ||
# For each Compute Resource, we generate a unique NVIDIA IMEX configuration file, | ||
# if one doesn't already exist in a common, shared location. | ||
template nvidia_imex_nodes_conf_file do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall design assumes the imex nodes config file to be shared in the cluster.
This is helpful to centralize the orchestration from the head node and also to simplify the torubleshooting.
However it would be a blocker for the per-job deployment type(https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/deployment.html#per-job-wide). Avoid blockers for such deployment would be beneficial for not only for our users, but also to us, because it would allow us to vend the automated configuraiton of imex following the NVIDIA example SLURM Scheduler Integration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets have an offline discussion as there would be blockers on the Job wide deployment model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed we will keep the exiting changes and we can later make changes as per naming convention or whichever is easier for the design of Custom Actions we recommend
…he version and package name as seen on the repositories
fbee300
to
c9c81cd
Compare
use 'partial/_nvidia_imex_common.rb' | ||
use 'partial/_nvidia_imex_rhel.rb' | ||
|
||
def imex_installed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a blocker but this function name is misleading.
IMEx is never installed on AL2. You need a function to determine wheteher or not imex should be installed. So I would rename this to install_imex?
or skip_imex_installation?
Description of changes
/opt/parallelcluster/shared/nvidia-imex
/opt/parallelcluster/shared/nvidia-imex/nodes_config_<LaunchTemplateID>.cfg
containing fake IP addresses for being able to start enable nvidia-imex service during configuration Phase of cluster creation.*. * We use
lt-123456789012
Launch template Id suffix to adhere to `max length of 256 characters.Tests
** Tested by HARDCODing NVIDIA Driver and Fabric Manager BUILD IMAGE as nvidia-imex was not available for older version.**
References
Checklist
develop
add the branch name as prefix in the PR title (e.g.[release-3.6]
).Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.