Skip to content

Commit

Permalink
Add NVSwitch device ID for p5.48xlarge instance and test gpu_health_c…
Browse files Browse the repository at this point in the history
…heck for multi-gpu instances which require nvidia fabric manager to be enabled. (#2431)

Co-authored-by: Himani Deshpande <himanidp@amazon.com>
  • Loading branch information
himani2411 and Himani Deshpande committed Aug 22, 2023
1 parent 5891b3a commit 328df5d
Show file tree
Hide file tree
Showing 3 changed files with 10 additions and 4 deletions.
Expand Up @@ -78,12 +78,14 @@ suites:
- 'resource:package { "package_name": "dkms" }'
- resource:build_tools
- recipe:aws-parallelcluster-platform::nvidia_install
# - resource:fabric_manager:configure # Needed for Multi-gpu instance like p5.48xlarge
resource: gdrcopy:configure
cluster:
nvidia:
enabled: true
driver:
instance_type: g4dn.2xlarge
# instance_type: p5.48xlarge
- name: intel_hpc
run_list:
- recipe[aws-parallelcluster-tests::setup]
Expand Down
Expand Up @@ -63,8 +63,10 @@ def _nvidia_driver_version

# Get number of nv switches
def get_nvswitches
# NVSwitch device id is 10de:1af1
nvswitch_check = Mixlib::ShellOut.new("lspci -d 10de:1af1 | wc -l")
nvswitch_check.run_command
nvswitch_check.stdout.strip.to_i
# A100 (P4) and H100(P5) systems have NVSwitches
# NVSwitch device id is 10de:1af1 for P4 instance
# NVSwitch device id is 10de:22a3 for P5 instance
nvswitch_check_p4 = shell_out("lspci -d 10de:1af1 | wc -l")
nvswitch_check_p5 = shell_out("lspci -d 10de:22a3 | wc -l")
nvswitch_check_p4.stdout.strip.to_i + nvswitch_check_p5.stdout.strip.to_i
end
2 changes: 2 additions & 0 deletions cookbooks/aws-parallelcluster-slurm/kitchen.slurm-config.yml
Expand Up @@ -84,10 +84,12 @@ suites:
- /gpu_health_check_execution/
driver:
instance_type: g4dn.xlarge
# instance_type: p5.48xlarge
attributes:
dependencies:
- recipe:aws-parallelcluster-slurm::mock_slurm
- resource:node_attributes
# - resource:fabric_manager:configure # Needed for Multi-gpu instance like p5.48xlarge
cluster:
node_type: HeadNode
scheduler: 'slurm'
Expand Down

0 comments on commit 328df5d

Please sign in to comment.