-
Notifications
You must be signed in to change notification settings - Fork 107
[Gb200] Support IMEX configuration to be local to a node #3029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,21 +19,51 @@ | |
return unless nvidia_enabled_or_installed? | ||
return if on_docker? || imex_installed? || aws_region.start_with?("us-iso") | ||
|
||
directory node['cluster']['nvidia']['imex']['shared_dir'] | ||
|
||
action_install_imex | ||
|
||
# Create Imex configuration files | ||
action_create_configuration_files | ||
# Save Imex version in Node Attributes for InSpec Tests | ||
node.default['cluster']['nvidia']['imex']['version'] = nvidia_imex_full_version | ||
node.default['cluster']['nvidia']['imex']['package'] = nvidia_imex_package | ||
node_attributes 'dump node attributes' | ||
end | ||
|
||
action :create_configuration_files do | ||
# We create or update IMEX configuration files if ParallelCluster is installing IMEX | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [minor] this comment is correct, but it makes more sense to write it where There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Its Create or update because as per chef template the action create is https://docs.chef.io/resources/template/
|
||
template nvidia_imex_nodes_conf_file do | ||
source 'nvidia-imex/nvidia-imex-nodes.erb' | ||
owner 'root' | ||
group 'root' | ||
mode '0755' | ||
action :create | ||
end | ||
|
||
template nvidia_imex_main_conf_file do | ||
source 'nvidia-imex/nvidia-imex-config.erb' | ||
owner 'root' | ||
group 'root' | ||
mode '0755' | ||
action :create | ||
variables(imex_nodes_config_file_path: nvidia_imex_nodes_conf_file) | ||
end | ||
|
||
# We keep nvidia-imex.service file in this location to give precedence to pcluster configured service file. | ||
template "/etc/systemd/system/#{nvidia_imex_service}.service" do | ||
source 'nvidia-imex/nvidia-imex.service.erb' | ||
owner 'root' | ||
group 'root' | ||
mode '0644' | ||
action :create | ||
variables(imex_main_config_file_path: nvidia_imex_main_conf_file) | ||
end | ||
end | ||
|
||
action :configure do | ||
return unless imex_installed? && node['cluster']['node_type'] == "ComputeFleet" | ||
# Start nvidia-imex on p6e-gb200 and only on ComputeFleet | ||
if is_gb200_node? || enable_force_configuration? | ||
# For each Compute Resource, we generate a unique NVIDIA IMEX configuration file, | ||
# if one doesn't already exist in a common, shared location. | ||
# Create the file if this is missing otherwise Imex service will not start | ||
template nvidia_imex_nodes_conf_file do | ||
source 'nvidia-imex/nvidia-imex-nodes.erb' | ||
owner 'root' | ||
|
@@ -42,24 +72,6 @@ | |
action :create_if_missing | ||
end | ||
|
||
template nvidia_imex_main_conf_file do | ||
source 'nvidia-imex/nvidia-imex-config.erb' | ||
owner 'root' | ||
group 'root' | ||
mode '0755' | ||
action :create_if_missing | ||
variables(imex_nodes_config_file_path: nvidia_imex_nodes_conf_file) | ||
end | ||
|
||
template "/etc/systemd/system/#{nvidia_imex_service}.service" do | ||
source 'nvidia-imex/nvidia-imex.service.erb' | ||
owner 'root' | ||
group 'root' | ||
mode '0644' | ||
action :create | ||
variables(imex_main_config_file_path: nvidia_imex_main_conf_file) | ||
end | ||
|
||
service nvidia_imex_service do | ||
action %i(enable start) | ||
supports status: true | ||
|
@@ -92,11 +104,11 @@ def nvidia_enabled_or_installed? | |
end | ||
|
||
def nvidia_imex_main_conf_file | ||
"#{node['cluster']['nvidia']['imex']['shared_dir']}/config_#{node['cluster']['scheduler_queue_name']}_#{node['cluster']['scheduler_compute_resource_name']}.cfg" | ||
"/etc/nvidia-imex/config.cfg" | ||
end | ||
|
||
def nvidia_imex_nodes_conf_file | ||
"#{node['cluster']['nvidia']['imex']['shared_dir']}/nodes_config_#{node['cluster']['scheduler_queue_name']}_#{node['cluster']['scheduler_compute_resource_name']}.cfg" | ||
"/etc/nvidia-imex/nodes_config.cfg" | ||
end | ||
|
||
def enable_force_configuration? | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] What about calling any the action install_configuration_files? I think could be a good practice to call actions that are part of the install phase as install_SOMETHING and actions called in configuration phase configure_SOMETHING.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer keeping the action name as
creation
as the action of the template is create. Even though we use it in Install phase of our recipes, we are creating these configuration files.