Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 8 additions & 6 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,18 @@ This file is used to list changes made in each version of the AWS ParallelCluste
**CHANGES**
- Assign Slurm dynamic nodes a priority (weight) of 1000 by default. This allows Slurm to prioritize idle static nodes over idle dynamic ones.
- Create a Slurm partition-nodelist mapping JSON file to be used by the node package daemons to recognize PC-managed Slurm partitions and nodelists.
- Upgrade NVIDIA driver to version 470.199.02.
- Upgrade NVIDIA driver to version 535.54.03.
- Upgrade CUDA library to version 12.2.0.
- Upgrade NVIDIA Fabric manager to `nvidia-fabricmanager-535`
- Increase EFS-utils watchdog poll interval to 10 seconds. Note: This change is meaningful only if [EncryptionInTransit](https://docs.aws.amazon.com/parallelcluster/latest/ug/SharedStorage-v3.html#yaml-SharedStorage-EfsSettings-EncryptionInTransit) is set to `true`, because watchdog does not run otherwise.
- Upgrade EFA installer to `1.24.0`
- Efa-driver: `efa-2.4.1-1`
- Upgrade EFA installer to `1.25.0`
- Efa-driver: `efa-2.5.0-1`
- Efa-config: `efa-config-1.15-1`
- Efa-profile: `efa-profile-1.5-1`
- Libfabric-aws: `libfabric-aws-1.18.0-1`
- Libfabric-aws: `libfabric-aws-1.18.1-1`
- Rdma-core: `rdma-core-46.0-1`
- Open MPI: `openmpi40-aws-4.1.5-1`
- Upgrade Slurm to version 23.02.3.
- Open MPI: `openmpi40-aws-4.1.5-3`
- Upgrade Slurm to version 23.02.4.
- Upgrade ARM PL to version 23.04.1 for Ubuntu 22.04 only.

**BUG FIXES**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@
# EFA setup: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html
#

property :efa_version, String, default: '1.24.0'
property :efa_checksum, String, default: '878623f819a0d9099d76ecd41cf4f569d4c3aac0c9bb7ba9536347c50b6bf88e'
property :efa_version, String, default: '1.25.0'
property :efa_checksum, String, default: '98b7b26ce031a2d6a93de2297cc71b03af647194866369ca53b60d82d45ad342'

action :setup do
if efa_installed? && !::File.exist?(efa_tarball)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,11 @@
action_install_nfs
action_install_nfs4
action_disable_start_at_boot
node.default['nfs']['config']['server_template'] = '/etc/nfs.conf.d/parallelcluster-nfs.conf'
end

action_class do
def override_server_template
node.default['nfs']['config']['server_template'] = '/etc/nfs.conf.d/parallelcluster-nfs.conf'
edit_resource(:template, node['nfs']['config']['server_template']) do
source 'nfs/nfs-ubuntu22+.conf.erb'
cookbook 'aws-parallelcluster-environment'
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@
service node['nfs']['service']['server'] do
action %i(restart enable)
supports restart: true
retries 5
retry_delay 10
end unless on_docker?
else
service node['nfs']['service']['server'] do
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@

# parallelcluster default source dir defined in attributes
source_dir = '/opt/parallelcluster/sources'
efa_version = '1.24.0'
efa_checksum = '878623f819a0d9099d76ecd41cf4f569d4c3aac0c9bb7ba9536347c50b6bf88e'
efa_version = '1.25.0'
efa_checksum = '98b7b26ce031a2d6a93de2297cc71b03af647194866369ca53b60d82d45ad342'

class ConvergeEfa
def self.setup(chef_run)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

# NVidia
default['cluster']['nvidia']['enabled'] = 'no'
default['cluster']['nvidia']['driver_version'] = '470.199.02'
default['cluster']['nvidia']['driver_version'] = '535.54.03'

# DCV
default['cluster']['dcv']['authenticator']['user'] = "dcvextauth"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ main() {
os=$(< /etc/chef/dna.json jq -r .cluster.base_os)
_log "Input parameters: user: ${user}, OS: ${os}, shared_folder_path: ${shared_folder_path}."

if ! [[ "${os}" =~ ^(alinux2|ubuntu2004|centos[7-8]|rhel8)$ ]]; then
if ! [[ "${os}" =~ ^(alinux2|ubuntu2004|ubuntu2204|centos[7-8]|rhel8)$ ]]; then
_fail "OS not supported."
fi

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -308,4 +308,4 @@ suites:
- recipe[aws-parallelcluster-platform::users]
verifier:
controls:
- /tag:install_users/
- /tag:install_users/
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,13 @@

# Cuda installer from https://developer.nvidia.com/cuda-toolkit-archive
# Cuda installer naming: cuda_11.8.0_520.61.05_linux
cuda_version = '11.8'
cuda_version = '12.2'
cuda_patch = '0'
cuda_complete_version = "#{cuda_version}.#{cuda_patch}"
cuda_version_suffix = '520.61.05'
cuda_version_suffix = '535.54.03'
cuda_arch = arm_instance? ? 'linux_sbsa' : 'linux'
cuda_url = "https://developer.download.nvidia.com/compute/cuda/#{cuda_complete_version}/local_installers/cuda_#{cuda_complete_version}_#{cuda_version_suffix}_#{cuda_arch}.run"
cuda_samples_version = '11.8'
cuda_samples_version = '12.2'
cuda_samples_url = "https://github.com/NVIDIA/cuda-samples/archive/refs/tags/v#{cuda_samples_version}.tar.gz"
tmp_cuda_run = '/tmp/cuda.run'
tmp_cuda_sample_archive = '/tmp/cuda-sample.tar.gz'
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@
supports restart: false
reload_command chrony_reload_command
action %i(enable start)
retries 5
retry_delay 10
end unless redhat_on_docker?
end

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
use 'partial/_fabric_manager_install_debian.rb'

def fabric_manager_package
'nvidia-fabricmanager-470'
'nvidia-fabricmanager-535'
end

def fabric_manager_version
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,7 @@ def set_compiler?
# Amazon linux 2 with Kernel 5 need to set CC to /usr/bin/gcc10-gcc using dkms override
node['kernel']['release'].split('.')[0].to_i == 5
end

def compiler_version
'CC=/usr/bin/gcc10-gcc'
end
Original file line number Diff line number Diff line change
Expand Up @@ -61,13 +61,14 @@
end

# Install driver
# TODO remove --no-cc-version-check when we can update ubuntu 22 images
bash 'nvidia.run advanced' do
user 'root'
group 'root'
cwd '/tmp'
code <<-NVIDIA
set -e
./nvidia.run --silent --dkms --disable-nouveau
#{compiler_version} ./nvidia.run --silent --dkms --disable-nouveau --no-cc-version-check
rm -f /tmp/nvidia.run
NVIDIA
creates '/usr/bin/nvidia-smi'
Expand Down Expand Up @@ -102,3 +103,7 @@ def rebuild_initramfs?
def set_compiler?
false
end

def compiler_version
""
end
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
require 'spec_helper'

describe 'aws-parallelcluster-platform::cuda' do
cached(:cuda_version) { '11.8' }
cached(:cuda_version) { '12.2' }
cached(:cuda_patch) { '0' }
cached(:cuda_complete_version) { "#{cuda_version}.#{cuda_patch}" }
cached(:cuda_version_suffix) { '520.61.05' }
cached(:cuda_version_suffix) { '535.54.03' }

context 'when nvidia not enabled' do
cached(:chef_run) do
Expand All @@ -20,7 +20,7 @@
context 'when on arm' do
cached(:cuda_arch) { 'linux_sbsa' }
cached(:cuda_url) { "https://developer.download.nvidia.com/compute/cuda/#{cuda_complete_version}/local_installers/cuda_#{cuda_complete_version}_#{cuda_version_suffix}_#{cuda_arch}.run" }
cached(:cuda_samples_version) { '11.8' }
cached(:cuda_samples_version) { '12.2' }
cached(:cuda_samples_url) { "https://github.com/NVIDIA/cuda-samples/archive/refs/tags/v#{cuda_samples_version}.tar.gz" }

cached(:chef_run) do
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -167,7 +167,7 @@ def self.configure(chef_run)

for_all_oses do |platform, version|
context "on #{platform}#{version}" do
cached(:fabric_manager_package) { platform == 'ubuntu' ? 'nvidia-fabricmanager-470' : 'nvidia-fabric-manager' }
cached(:fabric_manager_package) { platform == 'ubuntu' ? 'nvidia-fabricmanager-535' : 'nvidia-fabric-manager' }
cached(:fabric_manager_version) { platform == 'ubuntu' ? "#{nvidia_driver_version}*" : nvidia_driver_version }

context 'when fabric manager is to install' do
Expand Down Expand Up @@ -218,7 +218,7 @@ def self.configure(chef_run)

for_all_oses do |platform, version|
context "on #{platform}#{version}" do
cached(:fabric_manager_package) { platform == 'ubuntu' ? 'nvidia-fabricmanager-470' : 'nvidia-fabric-manager' }
cached(:fabric_manager_package) { platform == 'ubuntu' ? 'nvidia-fabricmanager-535' : 'nvidia-fabric-manager' }
cached(:fabric_manager_version) { platform == 'ubuntu' ? "#{nvidia_driver_version}*" : nvidia_driver_version }

context('when nvswithes are > 1') do
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -203,22 +203,32 @@ def self.setup(chef_run, nvidia_driver_version: nil)
mode: '0644'
)
end
it 'installs nvidia driver' do
is_expected.to run_bash('nvidia.run advanced')
.with(
user: 'root',
group: 'root',
cwd: '/tmp',
creates: '/usr/bin/nvidia-smi'
)
.with_code(%r{CC=/usr/bin/gcc10-gcc ./nvidia.run --silent --dkms --disable-nouveau --no-cc-version-check})
.with_code(%r{rm -f /tmp/nvidia.run})
end
else
it "doesn't install gcc10" do
is_expected.not_to install_package('gcc10')
end
end

it 'installs nvidia driver' do
is_expected.to run_bash('nvidia.run advanced')
.with(
user: 'root',
group: 'root',
cwd: '/tmp',
creates: '/usr/bin/nvidia-smi'
)
.with_code(%r{./nvidia.run --silent --dkms --disable-nouveau})
.with_code(%r{rm -f /tmp/nvidia.run})
it 'installs nvidia driver' do
is_expected.to run_bash('nvidia.run advanced')
.with(
user: 'root',
group: 'root',
cwd: '/tmp',
creates: '/usr/bin/nvidia-smi'
)
.with_code(%r{./nvidia.run --silent --dkms --disable-nouveau --no-cc-version-check})
.with_code(%r{rm -f /tmp/nvidia.run})
end
end

if platform == 'ubuntu'
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,5 +44,5 @@ puts stderr "At compile time add '-I<armpl_include>' and at link time"

# EULA
if [ module-info mode load ] {
puts stderr "Use of the free of charge version of Arm Performance Libraries is subject to the terms and conditions of the Arm Performance Libraries (free version) - End User License Agreement (EULA). A copy of the EULA can be found in the '$root/arm-performance-libraries_${major_minor_version}_gcc-${gcc_version}/license_terms' folder"
puts stderr "Use of the free of charge version of Arm Performance Libraries is subject to the terms and conditions of the Arm Performance Libraries (free version) - End User License Agreement (EULA). A copy of the EULA can be found in the '<%= @armpl_license_dir %>' folder"
}
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@
control 'tag:config_ulimit_is_not_lower_than_8192' do
only_if { !instance.custom_ami? }

describe bash("ulimit -Sn") do
describe bash("sudo -u #{user} bash -c 'ulimit -Sn'") do
its('stdout') { should cmp >= '8192' }
end
end
6 changes: 3 additions & 3 deletions cookbooks/aws-parallelcluster-shared/attributes/versions.rb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
default['cluster']['python-version'] = '3.9.16'

# ParallelCluster versions
default['cluster']['parallelcluster-version'] = '3.7.0'
default['cluster']['parallelcluster-cookbook-version'] = '3.7.0'
default['cluster']['parallelcluster-node-version'] = '3.7.0'
default['cluster']['parallelcluster-version'] = '3.7.0b1'
default['cluster']['parallelcluster-cookbook-version'] = '3.7.0b1'
default['cluster']['parallelcluster-node-version'] = '3.7.0b1'
default['cluster']['parallelcluster-awsbatch-cli-version'] = '1.1.0'
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

# Slurm attributes shared between install_slurm and configure_slurm_accounting
default['cluster']['slurm']['commit'] = ''
default['cluster']['slurm']['sha256'] = 'c41747e4484011cf376d6d4bc73b6c4696cdc0f7db4f64174f111bb9f53fb603'
default['cluster']['slurm']['sha256'] = '7290143a71ce2797d0df3423f08396fd5c0ae4504749ff372d6860b2d6a3a1b0'
default['cluster']['slurm']['install_dir'] = '/opt/slurm'

default['cluster']['dns_domain'] = nil
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Slurm
default['cluster']['slurm']['version'] = '23-02-3-1'
default['cluster']['slurm']['version'] = '23-02-4-1'
# Munge
default['cluster']['munge']['munge_version'] = '0.5.15'
4 changes: 4 additions & 0 deletions cookbooks/aws-parallelcluster-slurm/libraries/helpers.rb
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,8 @@ def enable_munge_service
service "munge" do
supports restart: true
action %i(enable start)
retries 5
retry_delay 10
end
end

Expand Down Expand Up @@ -111,6 +113,8 @@ def setup_munge_compute_node
# Enforce correct permission on the key
chmod 0600 /etc/munge/munge.key
COMPUTE_MUNGE_KEY
retries 5
retry_delay 10
end

enable_munge_service
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
if os.redhat?
mysql_packages.concat %w(mysql-community-client-plugins mysql-community-common
mysql-community-devel mysql-community-libs mysql-community-libs-compat)
elsif os_properties.ubuntu2004?
elsif os_properties.ubuntu2004? || os_properties.ubuntu2204?
mysql_packages.concat %w(libmysqlclient-dev libmysqlclient21)
else
describe "unsupported OS" do
Expand Down
12 changes: 6 additions & 6 deletions kitchen.ec2.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<%
pcluster_version = ENV['KITCHEN_PCLUSTER_VERSION'] || '3.7.0'
pcluster_version = ENV['KITCHEN_PCLUSTER_VERSION'] || '3.7.0b1'
pcluster_prefix = "aws-parallelcluster-#{pcluster_version}"
%>
---
Expand Down Expand Up @@ -89,7 +89,7 @@ platforms:
block_device_mappings:
- device_name: /dev/xvda
ebs:
volume_size: <% if (ENV['KITCHEN_VOLUME_SIZE'] || '') == '' %> 35 <% else %> <%= ENV['KITCHEN_VOLUME_SIZE'] %> <% end %>
volume_size: <% if (ENV['KITCHEN_VOLUME_SIZE'] || '') == '' %> 40 <% else %> <%= ENV['KITCHEN_VOLUME_SIZE'] %> <% end %>
volume_type: gp2
delete_on_termination: true
<% %w(a b c d e f g h i j k l m n o p q r s t u v w x).each_with_index do | c, i | %>
Expand All @@ -115,7 +115,7 @@ platforms:
block_device_mappings:
- device_name: /dev/sda1
ebs:
volume_size: <% if (ENV['KITCHEN_VOLUME_SIZE'] || '') == '' %> 35 <% else %> <%= ENV['KITCHEN_VOLUME_SIZE'] %> <% end %>
volume_size: <% if (ENV['KITCHEN_VOLUME_SIZE'] || '') == '' %> 40 <% else %> <%= ENV['KITCHEN_VOLUME_SIZE'] %> <% end %>
volume_type: gp2
delete_on_termination: true
<% %w(a b c d e f g h i j k l m n o p q r s t u v w x).each_with_index do | c, i | %>
Expand All @@ -141,7 +141,7 @@ platforms:
block_device_mappings:
- device_name: /dev/sda1
ebs:
volume_size: <% if (ENV['KITCHEN_VOLUME_SIZE'] || '') == '' %> 35 <% else %> <%= ENV['KITCHEN_VOLUME_SIZE'] %> <% end %>
volume_size: <% if (ENV['KITCHEN_VOLUME_SIZE'] || '') == '' %> 40 <% else %> <%= ENV['KITCHEN_VOLUME_SIZE'] %> <% end %>
volume_type: gp2
delete_on_termination: true
<% %w(a b c d e f g h i j k l m n o p q r s t u v w x).each_with_index do | c, i | %>
Expand All @@ -167,7 +167,7 @@ platforms:
block_device_mappings:
- device_name: /dev/sda1
ebs:
volume_size: <% if (ENV['KITCHEN_VOLUME_SIZE'] || '') == '' %> 35 <% else %> <%= ENV['KITCHEN_VOLUME_SIZE'] %> <% end %>
volume_size: <% if (ENV['KITCHEN_VOLUME_SIZE'] || '') == '' %> 40 <% else %> <%= ENV['KITCHEN_VOLUME_SIZE'] %> <% end %>
volume_type: gp2
delete_on_termination: true
<% %w(a b c d e f g h i j k l m n o p q r s t u v w x).each_with_index do | c, i | %>
Expand All @@ -193,7 +193,7 @@ platforms:
block_device_mappings:
- device_name: /dev/sda1
ebs:
volume_size: <% if (ENV['KITCHEN_VOLUME_SIZE'] || '') == '' %> 35 <% else %> <%= ENV['KITCHEN_VOLUME_SIZE'] %> <% end %>
volume_size: <% if (ENV['KITCHEN_VOLUME_SIZE'] || '') == '' %> 40 <% else %> <%= ENV['KITCHEN_VOLUME_SIZE'] %> <% end %>
volume_type: gp2
delete_on_termination: true
<% %w(a b c d e f g h i j k l m n o p q r s t u v w x).each_with_index do | c, i | %>
Expand Down