The accelerator role allows users to set up the AMD ROCm platform or the CUDA Nvidia toolkit. These tools allow users to unlock the potential of installed GPUs.
Ensure that CUDA and ROCm local repositories are configured using the local_repo.yml script.
Enter all required parameters in input/accelerator_config.yml
.
Parameters | Details |
---|---|
|
Required CUDA toolkit version. By default latest cuda is installed unless cuda_toolkit_path is specified. Default: latest (11.8.0).
|
|
If the latest cuda toolkit is not required, provide an offline copy of the toolkit installer in the path specified. (Take an RPM copy of the toolkit from here). If cuda_toolkit_version is not latest, giving cuda_toolkit_path is mandatory. |
|
A stream in CUDA is a sequence of operations that execute on the device in the order in which they are issued by the host code.
|
Note
* Nodes provisioned using the Omnia provision tool do not require a RedHat subscription to run accelerator.yml
on RHEL target nodes. * For RHEL target nodes not provisioned by Omnia, ensure that RedHat subscription is enabled on all target nodes. Every target node will require a RedHat subscription. * AMD ROCm driver installation is not supported by Omnia on Rocky Linux cluster nodes.
To install all the latest GPU drivers and toolkits, run: :
cd accelerator
ansible-playbook accelerator.yml -i inventory
The following configurations take place when running accelerator.yml
- Servers with AMD GPUs are identified and the latest GPU drivers and ROCm platforms are downloaded and installed.
- Servers with NVIDIA GPUs are identified and the specified CUDA toolkit is downloaded and installed.
- For the rare servers with both NVIDIA and AMD GPUs installed, all the above mentioned download-ables are installed to the server.
- Servers with neither GPU are skipped.
To add an user to the
render
andvideo
group, use the following command: :sudo usermod -a -G render,video <user>
Note
* <user> is the system name of the end user. * This command must be run with root
permissions. * If the root user wants to provide access to other users and their individual GPU nodes, the previous command needs to be run on all of them separately. :
To enable users to use rocm tools, use the following command as shown in the below added sample file: :
/opt/rocm/bin/<rocm command>
For any configuration changes, check out ROCm's official documentation here.