This page: https://hackmd.io/@enccs/hip101-april2021 Workshop page: https://enccs.se/events/2021/04/hip101-workshop/ Github repository: https://github.com/csc-training/hip Using Puhti: https://github.com/csc-training/hip/blob/main/using-puhti.md SLURM reservation for this training: hip101 AMD porting guide: https://rocmdocs.amd.com/en/latest/Programming_Guides/HIP-porting-guide.html AMD ROCm documentation: https://rocmdocs.amd.com/en/latest/
- Artem Zhmurov
- Mark Abraham
- Roberto Di Remigio
- Xin Li
- Qiang Li
We strive to follow the Code of Conduct developed by The Carpentries organisation to foster a welcoming environment for everyone (see https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html). In short:
- Use welcoming and inclusive language
- Be respectful of different viewpoints and experiences
- Gracefully accept constructive criticism
- Focus on what is best for the community
- Show courtesy and respect towards other community members
:::danger At the bottom of this document you can ask questions about the workshop content! We use the Zoom chat only for reporting Zoom problems and such. :::
Time | Topic |
---|---|
09:00–10:00 | Introduction to AMD architecture, HIP, and Hipify tools |
10:00–10:15 | Break |
10:15-10:45 | Deep dive to Hipify tools and examples |
10:45-11:30 | Lunch |
11:30-12:10 | Hands-on sessions |
12:10-12:20 | Break |
12:20-12:40 | GROMACS and CMake |
12:40-13:00 | Hands-on sessions - test your own code if you want |
If you tomorrow got access to 10x larger computational resources than you currently have, would you be able to take advantage of it? If so, how?
docs
porting
└── codes
├── saxpy/cuda
├── saxpy/cublas
├── Discrete_Hankel_Transform
├── Heat-Equation
├── 2D Wave Propagation
├── KMeans clustering
└── Vector_addition
- trainingXXX: NAME
- training018 : Cristian-Vasile A.
- training019 : Xavier A.
- training020 : Farid A.
- training021 : Arash Alizad B.
- training022 : Abdulrahman Azab M.
- training023 : Markus B.
- training024 : Oscar Bulancea L.
- training025 : Wei Der C.
- training026 : Gibson C.
- training027 : Robert C.
- training028 : Marco Crialesi E.
- training029 : Tewodros D.
- training030 : Yaxing D.
- training031 : Pascal H.
- training032 : Johan H.
- training033 : Patric H.
- training034 : Chia-Jung H.
- training035 : Zuhair I.
- training036 : Niclas J.
- training037 : Esko J.
- training038 : Joe J.
- training039 : Bijoy J.
- training040 : Kimmo K.
- training041 : Umair K.
- training042 : Tuomas K.
- training043 : Oskar L.
- training044 : Ashenafi L.
- training045 : Qiang L.
- training046 : Floriano M.
- training047 : Talgat M.
- training048 : Anders M.
- training049 : Meghdoot M.
- training050 : Gnanavel M.
- training051 : Pedro O.
- training052 : Johannes P.
- training053 : Janne P.
- training054 : Chaitanya P.
- training055 : Jarno R.
- training056 : Zilvinas R.
- training057 : Tuomas R.
- training058 : Kari R.
- training059 : Assa Aravindh S.
- training060 : Nicolo S.
- training061 : Suraj S.
- training062 : Pedro Simoes C.
- training063 : Samuele S.
- training064 : Pascal S.
- training065 : Ronith S.
- training066 : Walter T.
- training067 : Fedor U.
- training068 : Olav V.
- training069 : Masi V.
- training070 : Henric Z.
- training071 : Hongyang Z.
ssh trainingXXX@puhti.csc.fi
- Give your password and you should be located in the directory:
/users/trainingXXX
module list
Currently Loaded Modules:
1) intel/19.0.4 2) hpcx-mpi/2.4.0 3) intel-mkl/2019.0.4 4) StdEnv
- Submit script
sub.sh
sbatch sub.sh
- Check the status of a job
squeue -u $USER
- Cancel a job
scancel JOBID
- Sample batch job script
#!/bin/bash
#SBATCH --job-name=hip_test
#SBATCH --account=project_2000745
#SBATCH --partition=gpu
#SBATCH --time=00:05:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=8000
#SBATCH --gres=gpu:v100:1
#SBATCH --reservation=gpu_training
module load hip
srun my_hip_program
- Load HIP module
module load hip/4.0.0c
module list
Currently Loaded Modules:
1) StdEnv 2) gcc/9.1.0 3) cuda/11.1.0 4) hip/4.0.0c 5) intel-mkl/2019.0.4 6) hpcx-mpi/2.4.0
There is also a module hip/4.0.0 but we created also one hip/4.0.0c which is an installation from the source code. The name will comply with the version in the future.
- hipconfig
hipconfig
HIP version : 4.0.20496-4f163c6
== hipconfig
HIP_PATH : /appl/opt/rocm/rocm-4.0.0c/hip
ROCM_PATH : /appl/opt/rocm/rocm-4.0.0c/
HIP_COMPILER : clang
HIP_PLATFORM : nvcc
HIP_RUNTIME : ROCclr
CPP_CONFIG : -D__HIP_PLATFORM_NVCC__= -I/appl/opt/rocm/rocm-4.0.0c/hip/include -I/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2//include
== nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Tue_Sep_15_19:10:02_PDT_2020
Cuda compilation tools, release 11.1, V11.1.74
Build cuda_11.1.TC455_06.29069683_0
=== Environment Variables
PATH=/appl/opt/rocm/rocm-4.0.0c/hip/bin:/appl/spack/install-tree/gcc-9.1.0/hwloc-2.0.2-wqrgpf/bin:/appl/opt/ucx/1.9.0-cuda/bin:/appl/spack/install-tree/gcc-9.1.0/openmpi-4.0.5-ym53tz/bin:/appl/spack/install-tree/gcc-9.1.0/hdf5-1.12.0-wtlera/bin:/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/bin:/appl/spack/install-tree/gcc-4.8.5/gcc-9.1.0-vpjht2/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/appl/bin:/users/markoman/.local/bin:/users/markoman/bin
CUDA_PATH=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/
HIPFORT_ARCH=nvptx
HIP_PLATFORM=nvcc
LD_LIBRARY_PATH=/appl/opt/rocm/rocm-4.0.0c/hip/lib:/appl/spack/install-tree/gcc-9.1.0/hwloc-2.0.2-wqrgpf/lib:/appl/opt/ucx/1.9.0-cuda/lib:/appl/spack/install-tree/gcc-9.1.0/openmpi-4.0.5-ym53tz/lib:/appl/spack/install-tree/gcc-9.1.0/hdf5-1.12.0-wtlera/lib:/appl/opt/cluster_studio_xe2019/compilers_and_libraries_2019.4.243/linux/tbb/lib/intel64_lin/gcc4.7:/appl/opt/cluster_studio_xe2019/compilers_and_libraries_2019.4.243/linux/compiler/lib/intel64_lin:/appl/opt/cluster_studio_xe2019/compilers_and_libraries_2019.4.243/linux/mkl/lib/intel64_lin:/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/lib64:/appl/spack/install-tree/gcc-4.8.5/gcc-9.1.0-vpjht2/lib64:/appl/spack/install-tree/gcc-4.8.5/gcc-9.1.0-vpjht2/lib:/appl/opt/rocm/rocm-4.0.0/hiprand/lib:/appl/opt/rocm/rocm-4.0.0c/hipblas/hipblas/lib
HIP_RUNTIME=ROCclr
HIPFORT_GPU=sm_70
CUDA_INSTALL_ROOT=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2
HIPFORT_HOME=/appl/opt/rocm/rocm-4.0.0c//hipfort/
HIPFORT_ARCHGPU=nvptx-sm_70
HIPCC_OPTS=--x cu
HIP_COMPILER=clang
HIP_PATH=/appl/opt/rocm/rocm-4.0.0c/hip
== Linux Kernel
Hostname : puhti-login1.bullx
Linux puhti-login1.bullx 3.10.0-1062.33.1.el7.x86_64 #1 SMP Thu Aug 13 10:55:03 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
LSB Version: :core-4.1-amd64:core-4.1-noarch
Distributor ID: RedHatEnterpriseServer
Description: Red Hat Enterprise Linux Server release 7.7 (Maipo)
Release: 7.7
Codename: Maipo
- The wrapper to compile on NVIDIA system is called hipcc
which hipcc
/appl/opt/rocm/rocm-4.0.0c/hip/bin/hipcc
- You can read the file /appl/opt/rocm/rocm-4.0.0c/hip/bin/hipcc for more information
hipcc -h
Usage : nvcc [options] <inputfile>
Options for specifying the compilation phase
============================================
More exactly, this option specifies up to which stage the input files must be compiled,
according to the following compilation trajectories for different input file types:
.c/.cc/.cpp/.cxx : preprocess, compile, link
.o : link
.i/.ii : compile, link
.cu : preprocess, cuda frontend, PTX assemble,
merge with host C code, compile, link
.gpu : cicc compile into cubin
.ptx : PTX assemble into cubin.
- Start porting the CUDA codes on an NVIDIA system
- When it is finished, compile the code with HIP on an AMD system (no access to AMD hardware yet)
- HIP can be used on both AMD and NVIDIA GPUs
- The script hipconvertinplace-perl.sh can hipify all the files in a directory
- Some HIP libraries seem not to work on NVIDIA systems
- If you want to see the command that is executed from hipcc, declare the following_
export HIPCC_VERBOSE=1
- For example, on Puhti, the command:
hipcc "--gpu-architecture=sm_70" -g -O3 -I../common -c core_cuda.cu -o core_cuda.o
would also print the command that was actually executed:
hipcc-cmd: /appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2//bin/nvcc -D__HIP_ROCclr__ -Wno-deprecated-gpu-targets -isystem /appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2//include -isystem /appl/opt/rocm/rocm-4.0.0c/hip/include --gpu-architecture=sm_70 -g -O3 -I../common -c core_cuda.cu -o core_cuda.o
Add in your submission script before srun:
export AMD_LOG_LEVEL=4
For example, with no debug mode:
srun: error: r01g01: task 0: Segmentation fault
srun: Terminating job step 4339273.0
with debug mode:
:3:rocdevice.cpp :458 : 2193024923864 us: Initializing HSA stack.
:1:rocdevice.cpp :466 : 2193024923948 us: hsa_init failed.
:4:runtime.cpp :82 : 2193024923950 us: init
srun: error: r01g01: task 0: Segmentation fault
srun: Terminating job step 4339273.0
The outcome is that the used library does require AMD hardware and it crashes immediately. In a real execution you will observe a lot of output data.
In this point, we assume that you have cloned the github repository
Clone the Git repository of the training:
$ git clone https://github.com/csc-training/hip.git
$ cd hip
$ export rootdir=$PWD
Acknowledgment: Some exercises were provided by Cristian-Valise Achim, Jussi Enkovaara, AMD, and found online.
SAXPY is used for Single-Precision A*X Plus Y. It combines a scalar multiplication and vector addition.
cd ${rootdir}/porting/codes/saxpy/cuda
#include <stdio.h>
__global__
void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
int main(void)
{
int N = 1<<30;
float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
// Perform SAXPY on 1M elements
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);
cudaFree(d_x);
cudaFree(d_y);
free(x);
free(y);
}
make clean
make
sbatch sub.sh
Check the files out_* and error_*
The error output includes the duration for the execution which is close to 7.1 seconds and the out_* file includes the max error which should be 0.
cp Makefile saxpy.cu sub.sh ../hip/
cd ../hip
- Examine the hipify procedure
module load hip/4.0.0c
hipexamine-perl.sh saxpy.cu
info: converted 14 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:7 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:3 include:0 include_cuda_main_header:0 type:0 literal:0 numeric_literal:3 define:0 extern_shared:0 kernel_launch:1 )
warn:0 LOC:42 in 'saxpy.cu'
hipMemcpy 3
hipFree 2
hipMemcpyHostToDevice 2
hipMalloc 2
hipLaunchKernelGGL 1
hipMemcpyDeviceToHost 1
There is no warning, thus all the code can be hipified.
hipify-perl --inplace saxpy.cu
The file saxpy.cu is hipified:
#include "hip/hip_runtime.h"
#include <stdio.h>
__global__
void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
int main(void)
{
int N = 1<<30;
float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));
hipMalloc(&d_x, N*sizeof(float));
hipMalloc(&d_y, N*sizeof(float));
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
hipMemcpy(d_x, x, N*sizeof(float), hipMemcpyHostToDevice);
hipMemcpy(d_y, y, N*sizeof(float), hipMemcpyHostToDevice);
// Perform SAXPY on 1M elements
hipLaunchKernelGGL(saxpy, dim3((N+255)/256), dim3(256), 0, 0, N, 2.0f, d_x, d_y);
hipMemcpy(y, d_y, N*sizeof(float), hipMemcpyDeviceToHost);
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);
hipFree(d_x);
hipFree(d_y);
free(x);
free(y);
}
14 CUDA calls were converted with no errors.
- Edit the Makefile and change the nvcc to hipcc
CC = nvcc
Modify to
CC = hipcc
- Clean and compile
make clean
make
- Submit
sbatch sub.sh
Check the files out_* and error_*
The error output includes the duration for the execution which is close to 7.32 seconds and the out_* file includes the max error which should be 0. The overhead seems to be close to 3%.
The solution is here: https://github.com/csc-training/hip/tree/main/porting/codes/saxpy/hip_solution
mv saxpy.cu saxpy.cpp
-
Edit Makefile
- HIP Makefile with .cu
# Compiler can be set below, or via environment variable
CC = hipcc
OPTIMIZE = yes
#
#===============================================================================
# Program name & source code list
#===============================================================================
program = saxpy
source = saxpy.cu
obj = $(source:.cu=.o)
#===============================================================================
# Sets Flags
#===============================================================================
# Standard Flags
CFLAGS := -Xcompiler -Wall
# Linker Flags
LDFLAGS =
# Optimization Flags
ifeq ($(OPTIMIZE),yes)
CFLAGS += -O3
endif
#===============================================================================
# Targets to Build
#===============================================================================
#
$(program): $(obj) Makefile
$(CC) $(CFLAGS) $(obj) -o $@ $(LDFLAGS)
%.o: %.cu Makefile
$(CC) $(CFLAGS) -c $< -o $@
* HIP Makefile with .cpp
# Compiler can be set below, or via environment variable
CC = hipcc
HIP_CU = hipcc --x cu
OPTIMIZE = yes
#
#===============================================================================
# Program name & source code list
#===============================================================================
program = saxpy
source = saxpy.cpp
obj = $(source:.cpp=.o)
#===============================================================================
# Sets Flags
#===============================================================================
# Standard Flags
CFLAGS := -Xcompiler -Wall
# Linker Flags
LDFLAGS =
# Optimization Flags
ifeq ($(OPTIMIZE),yes)
CFLAGS += -O3
endif
#===============================================================================
# Targets to Build
#===============================================================================
#
$(program): $(obj) Makefile
$(CC) $(CFLAGS) $(obj) -o $@ $(LDFLAGS)
%.o: %.cpp Makefile
$(HIP_CU) $(CFLAGS) -c $< -o $@
- Replace nvcc with hipcc in the Makefile
- Hipify in-place with
hipify-perl --inplace filename
- For NVIDIA system, if the HIP code is in a file with extension .cpp use hipcc --x cu instead of hipcc
SAXPY but using cuBLAS
cd ${rootdir}/porting/codes/saxpy/cublas
#include <iostream>
#include "cublas_v2.h"
using namespace std;
int N = 1 << 30;
int main(){
float *a_h, *b_h;
a_h = new float[N];
b_h = new float[N];
float *a_d, *b_d;
for(int i = 0; i < N; i++){
a_h[i] = 1.0f;
b_h[i] = 2.0f ;
}
cublasHandle_t handle;
cublasCreate(&handle);
cudaMalloc((void**) &a_d, sizeof(float) * N);
cudaMalloc((void**) &b_d, sizeof(float) * N);
cublasSetVector( N, sizeof(float), a_h, 1, a_d, 1);
cublasSetVector( N, sizeof(float), b_h, 1, b_d, 1);
const float s = 2.0f;
cublasSaxpy( handle, N, &s, a_d, 1, b_d, 1);
cublasGetVector( N, sizeof(float), b_d, 1, b_h, 1);
cudaFree(a_d);
cudaFree(b_d);
cublasDestroy(handle);
float maxError = 0.0f;
for(int i = 0; i < N; i++)
maxError = max(maxError, abs(b_h[i]-4.0f));
cout << "Max error: " << maxError << endl;
delete[] a_h;
delete[] b_h;
return 0;
}
make clean
make
sbatch sub.sh
Check the files out_* and error_*
The error output includes the duration for the execution which is close to 7.1 seconds and the out_* file includes the max error which should be 0.
- Examine the hipify procedure
make clean
cp * ../hipblas
cd ../hipblas
module load hip/4.0.0
hipexamine-perl.sh saxpy_cublas.cu
info: converted 12 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:4 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:6 device_library:0 device_function:0 include:0 include_cuda_main_header:1 type:1 literal:0 numeric_literal:0 define:0 extern_shared:0 kernel_launch:0 )
warn:0 LOC:39 in 'saxpy_cublas.cu'
hipFree 2
hipMalloc 2
We observe the there are 6 library calls that will be converted. Hipify the code:
hipify-perl --inplace saxpy_cublas.cu
Now the code is:
#include <iostream>
#include "hipblas.h"
using namespace std;
const int N = 1 << 30;
int main(){
float *a_h, *b_h;
a_h = new float[N];
b_h = new float[N];
float *a_d, *b_d;
for(int i = 0; i < N; i++){
a_h[i] = 1.0f;
b_h[i] = 2.0f ;
}
hipblasHandle_t handle;
hipblasCreate(&handle);
hipMalloc((void**) &a_d, sizeof(float) * N);
hipMalloc((void**) &b_d, sizeof(float) * N);
hipblasSetVector( N, sizeof(float), a_h, 1, a_d, 1);
hipblasSetVector( N, sizeof(float), b_h, 1, b_d, 1);
const float s = 2.0f;
hipblasSaxpy( handle, N, &s, a_d, 1, b_d, 1);
hipblasGetVector( N, sizeof(float), b_d, 1, b_h, 1);
hipFree(a_d);
hipFree(b_d);
hipblasDestroy(handle);
float maxError = 0.0f;
for(int i = 0; i < N; i++)
maxError = max(maxError, abs(b_h[i]-4.0f));
cout << "Max error: " << maxError << endl;
delete[] a_h;
delete[] b_h;
return 0;
}
- Modify the Makefile to:
...
CC = hipcc
...
CFLAGS := -Xcompiler -Wall -I/appl/opt/rocm/rocm-4.0.0c/hipblas/hipblas/include
...
LDFLAGS = -L/appl/opt/rocm/rocm-4.0.0c/hipblas/hipblas/lib/ -lhipblas
...
- Define variables to find the hipBLAS header and library and compile
export LD_LIBRARY_PATH=/appl/opt/rocm/rocm-4.0.0c/hipblas/hipblas/lib/:$LD_LIBRARY_PATH
Load the custom installation of ROCm
module load hip/4.0.0c
make clean
make
- Submit your job script
sbatch sub.sh
- Check the out* and error* files.
The solution is here: https://github.com/csc-training/hip/tree/main/porting/codes/saxpy/hipblas_solution
- Always link with the appropriate library when it is available
- Do not forget to declare the LD_LIBRARY_PATH environment variable
- Adjust the Makefile
Description: https://github.com/csc-training/hip/tree/main/porting/codes/Discrete_Hankel_Transform
cd ${rootdir}/porting/codes/Discrete_Hankel_Transform/cuda
nvcc -arch=sm_70 -o code Code.cu
sbatch sub.sh
cp * ../hip/
cd ../hip/
$ hipexamine-perl.sh Code.cu
info: converted 46 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:24 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:11 include:0 include_cuda_main_header:0 type:0 literal:0 numeric_literal:9 define:0 extern_shared:0 kernel_launch:2 )
warn:0 LOC:220 in 'Code.cu'
hipMalloc 9
hipMemcpy 9
hipMemcpyHostToDevice 7
hipFree 3
hipLaunchKernelGGL 2
hipMemGetInfo 2
hipMemcpyDeviceToHost 2
hipMemset 1
$ hipify-perl --inplace Code.cu
$ ls
bessel_zeros.in Code.cu Code.cu.prehip README.md sub.sh
hipcc -arch=sm_70 -o code Code.cu
sbatch sub.sh
The solution is here: https://github.com/csc-training/hip/tree/main/porting/codes/Discrete_Hankel_Transform/hip_solution
Example implementations of two dimensional heat equation.
cd ${rootdir}/porting/codes/heat-equation/cuda
- Load CUDA aware MPI
module load openmpi
- Compile and execute
make
sbatch sub.sh
- Check the out* file, for example:
cat out_5003895
Average temperature at start: 59.763305
Iteration took 0.179 seconds.
Average temperature: 59.281239
Reference value with default arguments: 59.281239
make clean
mkdir ../hip
cp *.cpp *.h *.cu ../hip/
cd ../hip
hipexamine-perl.sh
info: converted 13 CUDA->HIP refs ( error:0 init:0 version:0 device:1 context:0 module:0 memory:6 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:1 include_cuda_main_header:0 type:0 literal:0 numeric_literal:4 define:0 extern_shared:0 kernel_launch:1 )
warn:0 LOC:90 in './core_cuda.cu'
info: converted 3 CUDA->HIP refs ( error:0 init:0 version:0 device:2 context:0 module:0 memory:0 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:1 include_cuda_main_header:0 type:0 literal:0 numeric_literal:0 define:0 extern_shared:0 kernel_launch:0 )
warn:0 LOC:200 in './setup.cpp'
info: TOTAL-converted 16 CUDA->HIP refs ( error:0 init:0 version:0 device:3 context:0 module:0 memory:6 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:2 include_cuda_main_header:0 type:0 literal:0 numeric_literal:4 define:0 extern_shared:0 kernel_launch:1 )
warn:0 LOC:702
kernels (1 total) : evolve_kernel(1)
hipMemcpy 4
hipMemcpyHostToDevice 3
hipMalloc 2
hip_runtime_api 2
hipGetDeviceCount 1
hipDeviceSynchronize 1
hipLaunchKernelGGL 1
hipSetDevice 1
hipMemcpyDeviceToHost 1
hipconvertinplace-perl.sh .
info: converted 13 CUDA->HIP refs ( error:0 init:0 version:0 device:1 context:0 module:0 memory:6 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:1 include_cuda_main_header:0 type:0 literal:0 numeric_literal:4 define:0 extern_shared:0 kernel_launch:1 )
warn:0 LOC:90 in './core_cuda.cu'
info: converted 3 CUDA->HIP refs ( error:0 init:0 version:0 device:2 context:0 module:0 memory:0 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:1 include_cuda_main_header:0 type:0 literal:0 numeric_literal:0 define:0 extern_shared:0 kernel_launch:0 )
warn:0 LOC:200 in './setup.cpp'
info: TOTAL-converted 16 CUDA->HIP refs ( error:0 init:0 version:0 device:3 context:0 module:0 memory:6 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:2 include_cuda_main_header:0 type:0 literal:0 numeric_literal:4 define:0 extern_shared:0 kernel_launch:1 )
warn:0 LOC:702
kernels (1 total) : evolve_kernel(1)
hipMemcpy 4
hipMemcpyHostToDevice 3
hipMalloc 2
hip_runtime_api 2
hipGetDeviceCount 1
hipDeviceSynchronize 1
hipLaunchKernelGGL 1
hipSetDevice 1
hipMemcpyDeviceToHost 1
ls
core.cpp core_cuda.cu heat.h io.cpp main.cpp Makefile setup.cpp sub2.sh utilities.cpp
core.cpp.prehip core_cuda.cu.prehip heat.h.prehip io.cpp.prehip main.cpp.prehip Makefile_orig setup.cpp.prehip sub.sh utilities.cpp.prehip
- Original Makefile
ifeq ($(COMP),)
COMP=gnu
endif
...
ifeq ($(COMP),gnu)
CXX=mpicxx
CC=gcc
NVCC=nvcc
NVCCFLAGS=-g -O3 -I$(COMMONDIR)
CCFLAGS=-g -O3 -Wall -I$(COMMONDIR)
LDFLAGS=
LIBS=-lpng -lcudart
endif
EXE=heat_cuda
OBJS=main.o core.o core_cuda.o setup.o utilities.o io.o
OBJS_PNG=$(COMMONDIR)/pngwriter.o
...
$(EXE): $(OBJS) $(OBJS_PNG)
$(CXX) $(CCFLAGS) $(OBJS) $(OBJS_PNG) -o $@ $(LDFLAGS) $(LIBS)
%.o: %.cpp
$(CXX) $(CCFLAGS) -c $< -o $@
%.o: %.c
$(CC) $(CCFLAGS) -c $< -o $@
%.o: %.cu
$(NVCC) $(NVCCFLAGS) -c $< -o $@
-
Tips
- Use hipcc to compile the code with HIP calls
- hipcc can add the necessary options to link with the default libraries that are required (not the MPI etc.)
-
New Makefile with regards that we use nvcc under the hipcc
ifeq ($(COMP),)
COMP=hipcc
endif
...
ifeq ($(COMP),hipcc)
CXX=hipcc
CC=gcc
NVCC=hipcc --x cu
NVCCFLAGS=-g -O3 -I$(COMMONDIR)
CXXFLAGS=-g -O3 -Xcompiler -Wall -I$(COMMONDIR)
CCFLAGS=-g -O3 -Wall -I$(COMMONDIR)
LDFLAGS=
LIBS=-lpng -lmpi
endif
...
$(EXE): $(OBJS) $(OBJS_PNG)
$(CXX) $(CXXFLAGS) $(OBJS) $(OBJS_PNG) -o $@ $(LDFLAGS) $(LIBS)
%.o: %.cpp
$(CXX) $(CXXFLAGS) -c $< -o $@
%.o: %.c
$(CC) $(CCFLAGS) -c $< -o $@
%.o: %.cu
$(NVCC) $(NVCCFLAGS) -c $< -o $@
- Compile and Execute
make
sbatch sub.sh
2D Wave Propagation
The 2D Wave Propagation case was provided by Ludovic Rass
cd ${rootdir}/porting/codes/wave_2d/cuda_cpu
ls
compile.sh sub.sh vizme2D.m Wave_2D.c Wave_2D.cu
The file Wave_2D.c is for CPU and the Wave_2D.cu is for GPU.
- Compile and Submit
cat compile.sh
#!/bin/bash
g++ -O3 Wave_2D.c -o wcpu
nvcc -arch=sm_70 -O3 Wave_2D.cu -o wgpu
./compile.sh
sbatch sub.sh
- Check the out* file
cat out_5015029
Perf: 220 iterations took 7.392e-03 seconds @ 32.9915 GB/s.
Process uses GPU with id 0 .
Perf: 220 iterations took 3.312e-03 seconds @ 73.6352 GB/s.
The CUDA code has 2.28 times better bandwidth. Of course, it depends on the problem size which in this case seems small.
There is a script to compile the code in the hip folder already. Copy the CUDA file to the ../hip directory
cp *.cu sub.sh ../hip
cd ../hip
hipify-perl --print-stats --inplace Wave_2D.cu
info: converted 28 CUDA->HIP refs ( error:2 init:0 version:0 device:9 context:0 module:0 memory:4 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:4 include:0 include_cuda_main_header:1 type:1 literal:0 numeric_literal:4 define:0 extern_shared:0 kernel_launch:3 )
warn:0 LOC:126 in 'Wave_2D.cu'
hipDeviceReset 3
hipDeviceSynchronize 3
hipLaunchKernelGGL 3
hipMemcpy 2
hipGetDevice 1
hipError_t 1
hipFree 1
hipDeviceSetCacheConfig 1
hipMalloc 1
hip_runtime 1
hipSetDevice 1
hipMemcpyDeviceToHost 1
hipMemcpyHostToDevice 1
hipSuccess 1
hipGetErrorString 1
hipGetLastError 1
hipFuncCachePreferL1 1
- Compile and submit
Before you proceed with the submission, edit the sub.sh
and comment the srun command to execute the CPU executable
./compile.sh
sbatch sub.sh
- From the output file
cat out_*
Process uses GPU with id 0 .
Perf: 220 iterations took 3.385e-03 seconds @ 72.0481 GB/s.
The HIP version provides similar results to the CUDA version with a small overhead.
Parallel k-means clustering code
cd ${rootdir}porting/codes/kmeans/cuda
ls
cuda_io.cu cuda_kmeans.cu cuda_main.cu cuda_wtime.cu Image_data kmeans.h LICENSE Makefile README sample.output sub.sh
- Compile and Execute
make cuda
sbatch sub.sh
- We can check the out* and error* files
Writing coordinates of K=128 cluster centers to file "Image_data/color17695.bin.cluster_centres"
Writing membership of N=17695 data objects to file "Image_data/color17695.bin.membership"
Performing **** Regular Kmeans (CUDA version) ****
Input file: Image_data/color17695.bin
numObjs = 17695
numCoords = 9
numClusters = 128
threshold = 0.0010
Loop iterations = 131
I/O time = 0.0529 sec
Computation timing = 0.2059 sec
- Copy the data to the ../hip directory
cp -r *.cu *.h Image_data ../hip
cd ../hip
- Hipify
hipconvertinplace-perl.sh .
info: converted 28 CUDA->HIP refs ( error:0 init:0 version:0 device:4 context:0 module:0 memory:13 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:1 include:0 include_cuda_main_header:0 type:1 literal:0 numeric_literal:5 define:0 extern_shared:2 kernel_launch:2 )
warn:0 LOC:372 in './cuda_kmeans.cu'
info: converted 8 CUDA->HIP refs ( error:3 init:0 version:0 device:0 context:0 module:0 memory:0 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:1 include_cuda_main_header:0 type:2 literal:0 numeric_literal:1 define:1 extern_shared:0 kernel_launch:0 )
warn:0 LOC:79 in './kmeans.h'
info: TOTAL-converted 36 CUDA->HIP refs ( error:3 init:0 version:0 device:4 context:0 module:0 memory:13 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:1 include:1 include_cuda_main_header:0 type:3 literal:0 numeric_literal:6 define:1 extern_shared:2 kernel_launch:2 )
warn:0 LOC:843
kernels (1 total) : compute_delta(1)
hipMemcpy 5
hipFree 4
hipMalloc 4
hipMemcpyHostToDevice 3
hipError_t 2
hipDeviceSynchronize 2
hipLaunchKernelGGL 2
hipGetErrorString 2
HIP_DYNAMIC_SHARED 2
hipMemcpyDeviceToHost 2
hipGetDevice 1
hipDeviceProp_t 1
hipGetDeviceProperties 1
hipSuccess 1
hipGetLastError 1
- Compile and execute
make -f Makefile.hip
sbatch sub.sh
- The output file
Performing **** Regular Kmeans (CUDA version) ****
Input file: Image_data/color17695.bin
numObjs = 17695
numCoords = 9
numClusters = 128
threshold = 0.0010
Loop iterations = 131
I/O time = 0.0081 sec
Computation timing = 0.2000 sec
This code developed in the context of porting the MadGraph5_aMC@NLO event generator software onto GPU hardware. MadGraph5_aMC@NLO is able to generate code for various physics processes in different programming languages (Fortran, C, C++).
cd ${rootdir}/porting/codes/
mkdir madgraph4gpu
cd madgraph4gpu
wget https://github.com/madgraph5/madgraph4gpu/archive/master.zip
unzip master.zip
cd madgraph4gpu-master/
ls
epoch0 epoch1 epoch2 README.md test tools
cd epoch1/
cp -r cuda hip
hipconvertinplace-perl.sh hip/
info: converted 7 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:3 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:4 include:0 include_cuda_main_header:0 type:0 literal:0 numeric_literal:0 define:0 extern_shared:0 kernel_launch:0 )
warn:0 LOC:841 in 'hip/gg_tt/SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cu'
...
info: TOTAL-converted 294 CUDA->HIP refs ( error:11 init:0 version:0 device:3 context:0 module:0 memory:59 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:14 library:6 device_library:16 device_function:35 include:0 include_cuda_main_header:1 type:26 literal:0 numeric_literal:31 define:79 extern_shared:0 kernel_launch:13 )
warn:2 LOC:15920
warning: unconverted cudaTearDown : 2
kernels (2 total) : sigmaKin(3) gProc::sigmaKin(2)
hipMemcpy 23
hipMemcpyToSymbol 19
hipMemcpyDeviceToHost 18
hipLaunchKernelGGL 18
hipFree 14
hipMalloc 11
hipPeekAtLastError 9
hipMemcpyHostToDevice 6
hipDeviceReset 5
hipSuccess 5
HIP_SYMBOL 5
hipError_t 4
hipGetErrorString 4
hipDoubleComplex 3
hip_runtime 3
hipHostFree 3
hipHostMalloc 3
hipMemcpy3D 3
hipFloatComplex 3
hipMemcpy2D 3
HIPRAND_STATUS_SUCCESS 2
HIPRAND_RNG_PSEUDO_PHILOX4_32_10 2
HIPRAND_RNG_PSEUDO_MT19937 2
HIPRAND_RNG_PSEUDO_MTGP32 2
HIPRAND_RNG_PSEUDO_XORWOW 2
HIPRAND_RNG_PSEUDO_MRG32K3A 2
hipComplex 1
hip_complex 1
hipCsubf 1
hipCmulf 1
hipCaddf 1
hipCdiv 1
hipCrealf 1
hipCsub 1
hipCreal 1
hipCimag 1
hipCmul 1
hipCadd 1
hipCimagf 1
hipCdivf 1
- The warning is about a cuda variable deployed by the developers, so it is safe. The hiprand is not working on our environment this moment, thus the utilization of this code requires the actual AMD hardware, however, it is ported. It requires tuning and checking.
- As we discussed already, there is no straight forward approach with CUDA Fortran.
- The HIP functions are callable from C and with
extern C
are callable from Fortran - Procedure:
- Port CUDA Fortran code to HIP kernels in C++. The hipfort helps to call some HIP calls from Fortran.
- Wrap the kernel launch in C function
- Call the C function from Fortran through Fortran 2003 C binding, using pointers etc.
We have the following example. SAXPY code in CUDA Fortran. In this case, to hipify the code, we follow this procedure.
$ cd ${rootdir}/porting/codes/cuda_fortran_saxpy/cuda
$ ls
main.cuf
cat main.cuf
module mathOps
contains
attributes(global) subroutine saxpy(x, y, a)
implicit none
real :: x(:), y(:)
real, value :: a
integer :: i, n
n = size(x)
i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
if (i <= n) y(i) = y(i) + a*x(i)
end subroutine saxpy
end module mathOps
program testSaxpy
use mathOps
use cudafor
implicit none
integer, parameter :: N = 40000
real :: x(N), y(N), a
real, device :: x_d(N), y_d(N)
type(dim3) :: grid, tBlock
tBlock = dim3(256,1,1)
grid = dim3(ceiling(real(N)/tBlock%x),1,1)
x = 1.0; y = 2.0; a = 2.0
x_d = x
y_d = y
call saxpy<<<grid, tBlock>>>(x_d, y_d, a)
y = y_d
write(*,*) 'Max error: ', maxval(abs(y-4.0))
end program testSaxpy
- Compile the code and submit
./compile.sh
sbatch sub.sh
- Check the out* and error* files
cat out_*
Max error: 0.000000
cat error*
real 0m0.404s
...
- Original kernel
i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
if (i <= n) y(i) = y(i) + a*x(i)
- HIP kernel
__global__ void saxpy(float *y, float *x, float a, int n)
{
size_t i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < n) y[i] = y[i] + a*x[i];
}
- _global_ means that the function will be executed on the GPU and it will be called from the host
- In Fortran the variables such as blockDim%x are used in C/C++ as blockDim.x. This means that you have to change all these variables but a find and replace through sed could be easy
- Using arrays also is different for example
y(i)
becomesy[i]
which again sed could help - Overall we need to be careful that we do not do any mistake, always check the results
extern "C"
{
void launch(float **dout, float **da, float db, int N)
{
dim3 tBlock(256,1,1);
dim3 grid(ceil((float)N/tBlock.x),1,1);
hipLaunchKernelGGL((saxpy), grid, tBlock, 0, 0, *dout, *da, db, N);
}
}
program testSaxpy
use iso_c_binding
use hipfort
use hipfort_check
implicit none
interface
subroutine launch(y,x,b,N) bind(c)
use iso_c_binding
implicit none
type(c_ptr) :: y,x
integer, value :: N
real, value :: b
end subroutine
end interface
type(c_ptr) :: dx = c_null_ptr
type(c_ptr) :: dy = c_null_ptr
integer, parameter :: N = 40000
integer, parameter :: bytes_per_element = 4
integer(c_size_t), parameter :: Nbytes = N*bytes_per_element
real, allocatable,target,dimension(:) :: x, y
real, parameter :: a=2.0
real :: x_d(N), y_d(N)
call hipCheck(hipMalloc(dx,Nbytes))
call hipCheck(hipMalloc(dy,Nbytes))
allocate(x(N))
allocate(y(N))
x = 1.0;y = 2.0
call hipCheck(hipMemcpy(dx, c_loc(x), Nbytes, hipMemcpyHostToDevice))
call hipCheck(hipMemcpy(dy, c_loc(y), Nbytes, hipMemcpyHostToDevice))
call launch(dy, dx, a, N)
call hipCheck(hipDeviceSynchronize())
call hipCheck(hipMemcpy(c_loc(y), dy, Nbytes, hipMemcpyDeviceToHost))
write(*,*) 'Max error: ', maxval(abs(y-4.0))
call hipCheck(hipFree(dx))
call hipCheck(hipFree(dy))
deallocate(x)
deallocate(y)
end program testSaxpy
- hipfort provides a Makefile called Makefile.hipfort
export HIPFORT_HOME=${ROCM_PATH}/hipfort/
include ${HIPFORT_HOME}/bin/Makefile.hipfort
OUTPUT_DIR ?= $(PWD)
APP = $(OUTPUT_DIR)/saxpy
.DEFAULT_GOAL := all
all: $(APP)
$(APP): $(OUTPUT_DIR)/main.o $(OUTPUT_DIR)/hipsaxpy.o
$(FC) $^ $(LINKOPTS) -o $(APP)
$(OUTPUT_DIR)/main.o: main.f03
$(FC) -c $^ -o $(OUTPUT_DIR)/main.o
$(OUTPUT_DIR)/hipsaxpy.o: hipsaxpy.cpp
$(CXX) --x cu -c $^ -o $(OUTPUT_DIR)/hipsaxpy.o
clean:
rm -f $(APP) *.o *.mod *~
Tip: Not sure how safe it is but if all your cpp files had HIP calls under NVIDIA system, you could define export HIPCC_COMPILE_FLAGS_APPEND="--x cu"
and not to modify the Makefile. Be careful as this can break something else.
- Compile and submit
module load hip/4.0.0c
make
submit sub.sh
GROMACS is a molecular dynamics package mainly designed for simulations of proteins, lipids, and nucleic acids.
Do not follow these instructions as it could take a long time, they are documented to help you in your case
wget https://ftp.gromacs.org/gromacs/gromacs-2021.tar.gz
tar zxvf gromacs-2021.tar.gz
cd gromacs-2021
ls
admin api AUTHORS build cmake CMakeLists.txt computed_checksum COPYING CPackInit.cmake CTestConfig.cmake docs INSTALL python_packaging README scripts share src tests
Let's hipify the application automatically with the hipconvertinplace-perl.sh script
cd src
hipconvertinplace-perl.sh .
info: converted 10 CUDA->HIP refs ( error:0 init:0 version:0 device:3 context:0 module:0 memory:4 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:0 include_cuda_main_header:0 type:1 literal:0 numeric_literal:2 define:0 extern_shared:0 kernel_launch:0 )
warn:0 LOC:89 in './gromacs/gpu_utils/tests/devicetransfers.cu'
info: converted 13 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:2 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:0 include_cuda_main_header:0 type:2 literal:0 numeric_literal:8 define:1 extern_shared:0 kernel_launch:0 )
warn:0 LOC:126 in './gromacs/gpu_utils/pinning.cu'
info: converted 12 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:6 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:0 include_cuda_main_header:0 type:3 literal:0 numeric_literal:0 define:3 extern_shared:0 kernel_launch:0 )
warn:0 LOC:113 in './gromacs/gpu_utils/pmalloc_cuda.cu'
warn:0 LOC:113 in './gromacs/gpu_utils/pmalloc_cuda.cu'
warning: ./gromacs/gpu_utils/gpu_utils.cu:#69 : static bool cudaProfilerRun = ((getenv("NVPROF_ID") != nullptr));
warning: ./gromacs/gpu_utils/gpu_utils.cu:#90 : isPinned = (memoryAttributes.type == cudaMemoryTypeHost);
warning: ./gromacs/gpu_utils/gpu_utils.cu:#114 : if (cudaProfilerRun)
warning: ./gromacs/gpu_utils/gpu_utils.cu:#126 : if (cudaProfilerRun)
warning: ./gromacs/gpu_utils/gpu_utils.cu:#143 : if (cudaProfilerRun)
warning: ./gromacs/gpu_utils/gpu_utils.cu:#154 : * \param[in] cudaCallName name of CUDA peer access call
warning: ./gromacs/gpu_utils/gpu_utils.cu:#160 : const char* cudaCallName)
warning: ./gromacs/gpu_utils/gpu_utils.cu:#165 : gmx::formatString("%s from GPU %d to GPU %d failed", cudaCallName, gpuA, gpuB);
warning: ./gromacs/gpu_utils/gpu_utils.cu:#175 : gpuA, gpuB, cudaCallName, gmx::getDeviceErrorString(stat).c_str());
info: converted 32 CUDA->HIP refs ( error:1 init:0 version:0 device:6 context:0 module:0 memory:0 virtual_memory:0 addressing:2 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:4 graphics:0 profiler:4 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:1 include_cuda_main_header:0 type:7 literal:0 numeric_literal:7 define:0 extern_shared:0 kernel_launch:0 )
warn:9 LOC:239 in './gromacs/gpu_utils/gpu_utils.cu'
info: converted 13 CUDA->HIP refs ( error:0 init:0 version:0 device:1 context:0 module:0 memory:0 virtual_memory:0 addressing:0 stream:5 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:0 include_cuda_main_header:0 type:4 literal:0 numeric_literal:2 define:1 extern_shared:0 kernel_launch:0 )
warn:0 LOC:103 in './gromacs/gpu_utils/device_stream.cu'
warning: ./gromacs/hardware/device_management.cu:#137 : // it is enough to check for cudaErrorDevicesUnavailable only here because
warning: ./gromacs/hardware/device_management.cu:#139 : if (cu_err == cudaErrorDevicesUnavailable)
...
info: converted 3 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:0 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:3 include:0 include_cuda_main_header:0 type:0 literal:0 numeric_literal:0 define:0 extern_shared:0 kernel_launch:0 )
warn:0 LOC:320 in './gromacs/ewald/pme_spread.cu'
warning: ./gromacs/ewald/pme_solve.cu:260: unsupported device function "__shfl_down_sync": virxx += __shfl_down_sync(activeMask, virxx, 1, width);
warning: ./gromacs/ewald/pme_solve.cu:261: unsupported device function "__shfl_up_sync": viryy += __shfl_up_sync(activeMask, viryy, 1, width);
...
info: converted 1 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:0 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:0 include_cuda_main_header:0 type:0 literal:0 numeric_literal:0 define:1 extern_shared:0 kernel_launch:0 )
warn:0 LOC:2709 in './external/googletest/googletest/include/gtest/internal/gtest-port.h'
info: TOTAL-converted 377 CUDA->HIP refs ( error:9 init:0 version:5 device:27 context:0 module:0 memory:29 virtual_memory:0 addressing:2 stream:18 event:27 external_resource_interop:0 stream_memory:0 execution:3 graph:0 occupancy:0 texture:3 surface:0 peer:4 graphics:0 profiler:4 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:18 device_library:0 device_function:38 include:1 include_cuda_main_header:5 type:81 literal:0 numeric_literal:85 define:11 extern_shared:7 kernel_launch:0 )
warn:45 LOC:869044
warning: unconverted cudaProfilerRun : 4
warning: unconverted cudaCallName : 4
warning: unconverted cudaErrorDevicesUnavailable : 2
warning: unconverted cudaMemoryTypeHost : 1
kernels (0 total) :
hipError_t 47
hipSuccess 46
hipEventDestroy 8
hipMemcpyAsync 8
hipSetDevice 8
HIP_DYNAMIC_SHARED 7
hipFuncSetCacheConfig 6
hipErrorInvalidValue 5
...
hipStream_t 4
hipHostMalloc 4
hipGetDevice 4
hipStreamQuery 4
hip_runtime 4
hipStreamSynchronize 4
...
hipHostMallocDefault 2
hipProfilerStop 2
hipPointerAttribute_t 2
hipDeviceEnablePeerAccess 2
...
hipDeviceGetStreamPriorityRange 1
hipGetErrorName 1
hipPeekAtLastError 1
hipErrorInvalidDeviceFunction 1
HIPFFT_SUCCESS 1
- Description: We should check all the warnings as these declare that something was not translated, it does not mean for sure that this is a mistake. If a developer declares a variable called cudaXXX the tool will report a warning but it is expected.
warning: ./gromacs/gpu_utils/gpu_utils.cu:#69 : static bool cudaProfilerRun = ((getenv("NVPROF_ID") != nullptr));
warning: ./gromacs/gpu_utils/gpu_utils.cu:#90 : isPinned = (memoryAttributes.type == cudaMemoryTypeHost);
warning: ./gromacs/gpu_utils/gpu_utils.cu:#114 : if (cudaProfilerRun)
warning: ./gromacs/gpu_utils/gpu_utils.cu:#126 : if (cudaProfilerRun)
warning: ./gromacs/gpu_utils/gpu_utils.cu:#143 : if (cudaProfilerRun)
warning: ./gromacs/gpu_utils/gpu_utils.cu:#154 : * \param[in] cudaCallName name of CUDA peer access call
warning: ./gromacs/gpu_utils/gpu_utils.cu:#160 : const char* cudaCallName)
warning: ./gromacs/gpu_utils/gpu_utils.cu:#165 : gmx::formatString("%s from GPU %d to GPU %d failed", cudaCallName, gpuA, gpuB);
warning: ./gromacs/gpu_utils/gpu_utils.cu:#175 : gpuA, gpuB, cudaCallName, gmx::getDeviceErrorString(stat).c_str());
warning: ./gromacs/hardware/device_management.cu:#137 : // it is enough to check for cudaErrorDevicesUnavailable only here because
warning: ./gromacs/hardware/device_management.cu:#139 : if (cu_err == cudaErrorDevicesUnavailable)
warn:45 LOC:869044
warning: unconverted cudaProfilerRun : 4
warning: unconverted cudaCallName : 4
warning: unconverted cudaErrorDevicesUnavailable : 2
warning: unconverted cudaMemoryTypeHost : 1
Check the files with the warnings:
- For example the warning of cudaProfilerRun is not actually a serious issue as it is a variable declared by the developers
static bool cudaProfilerRun
- Similar for
const char* cudaCallName
- About cudaErrorDevicesUnavailable we can see from the web page https://rocmdocs.amd.com/en/latest/Programming_Guides/CUDAAPIHIPTEXTURE.html that there is no specific HIP call.
if (cu_err == cudaErrorDevicesUnavailable)
{
return DeviceStatus::Unavailable;
}
However, if we try to track the call path from the code, we can see:
cu_err == cudaErrorDevicesUnavailable)
└── cu_err = checkCompiledTargetCompatibility(deviceInfo.id, deviceInfo.prop);
└── static cudaError_t checkCompiledTargetCompatibility(int deviceId, const cudaDeviceProp& deviceProp)
{
cudaFuncAttributes attributes;
cudaError_t stat = cudaFuncGetAttributes(&attributes, dummy_kernel);
...
return stat;
}
Thus, the returned value is from the call to cudaFuncAttributes. From the PDF of HIP API which is more updated from the web site, we can see for example, for v4.0, see here: https://github.com/RadeonOpenCompute/ROCm/blob/master/HIP-API_Guide_v4.0.pdf where we can find that
4.8.2.11 hipFuncGetAttributes()
Returns
hipSuccess, hipErrorInvalidValue, hipErrorInvalidDeviceFunction
Thus we can replace in the file .gromacs/hardware/device_management.cu manually the cudaErrorDevicesUnavailable with hipErrorInvalidDeviceFunction
- About the warning:
warning: unconverted cudaMemoryTypeHost : 1
We can see from the HIP programming guide PDF that there is hipMemoryTypeHost and we are not sure why it was not converted, it could be a bug, so we manually do the appropriate modification in the file ./gromacs/gpu_utils/gpu_utils.cu
- Description: The Warp-Level primitives are not supported by HIP (yet)
warning: ./gromacs/ewald/pme_solve.cu:260: unsupported device function "__shfl_down_sync": virxx += __shfl_down_sync(activeMask, virxx, 1, width);
Github issue: ROCm/HIP#1491
Solution:
Change the calls of shfl_*_sync to shfl_* for example __shfl_down_sync to __shfl_down
- Description: Gromacs uses FFT
Solution:
The hipFFT seems not to be able to compile on NVIDIA systems, it is possible with the hip/4.0.0 but the explanation is different.
- Description:
Error: src/gromacs/utility/cuda_version_information.cu(49): error: identifier "hipDriverGetVersion" is undefined
Code:
#include "gmxpre.h"
#include "cuda_version_information.h"
#include "gromacs/utility/stringutil.h"
namespace gmx
{
std::string getCudaDriverVersionString()
{
int cuda_driver = 0;
if (hipDriverGetVersion(&cuda_driver) != hipSuccess)
{
return "N/A";
}
return formatString("%d.%d", cuda_driver / 1000, cuda_driver % 100);
}
std::string getCudaRuntimeVersionString()
{
int cuda_runtime = 0;
if (hipRuntimeGetVersion(&cuda_runtime) != hipSuccess)
{
return "N/A";
}
return formatString("%d.%d", cuda_runtime / 1000, cuda_runtime % 100);
}
} // namespace gmx
Solution:
Add #include "hip/hip_runtime.h"
This is an example it does not mean that this is the best way
CXX=/appl/opt/rocm/rocm-4.0.0/hip/bin/hipcc cmake -DGMX_GPU=CUDA ..
make
One cmake file, called cmake/gmxManageNvccConfig.cmake
from Gromacs is the following:
#
# This file is part of the GROMACS molecular simulation package.
#
# Copyright (c) 2012,2013,2014,2015,2016 by the GROMACS development team.
# Copyright (c) 2017,2018,2019,2020, by the GROMACS development team, led by
# Mark Abraham, David van der Spoel, Berk Hess, and Erik Lindahl,
# and including many others, as listed in the AUTHORS file in the
# top-level source directory and at http://www.gromacs.org.
...
# set up host compiler and its options
if(CUDA_HOST_COMPILER_CHANGED)
set(CUDA_HOST_COMPILER_OPTIONS "")
if(APPLE AND CMAKE_C_COMPILER_ID MATCHES "GNU")
# Some versions of gcc-4.8 and gcc-4.9 have produced errors
# (in particular on OS X) if we do not use
# -D__STRICT_ANSI__. It is harmless, so we might as well add
# it for all versions.
list(APPEND CUDA_HOST_COMPILER_OPTIONS "-D__STRICT_ANSI__")
endif()
work_around_glibc_2_23()
set(CUDA_HOST_COMPILER_OPTIONS "${CUDA_HOST_COMPILER_OPTIONS}"
CACHE STRING "Options for nvcc host compiler (do not edit!).")
mark_as_advanced(CUDA_HOST_COMPILER CUDA_HOST_COMPILER_OPTIONS)
endif()
if (GMX_CUDA_TARGET_SM OR GMX_CUDA_TARGET_COMPUTE)
set(GMX_CUDA_NVCC_GENCODE_FLAGS)
set(_target_sm_list ${GMX_CUDA_TARGET_SM})
foreach(_target ${_target_sm_list})
list(APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_${_target},code=sm_${_target}")
endforeach()
set(_target_compute_list ${GMX_CUDA_TARGET_COMPUTE})
foreach(_target ${_target_compute_list})
list(APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_${_target},code=compute_${_target}")
endforeach()
else()
if(CUDA_VERSION VERSION_LESS "11.0")
list (APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_30,code=sm_30")
endif()
list (APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_35,code=sm_35")
...
list (APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_70,code=compute_70")
if(NOT CUDA_VERSION VERSION_LESS "10.0")
list (APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_75,code=compute_75")
endif()
if(NOT CUDA_VERSION VERSION_LESS "11.0")
list (APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_80,code=compute_80")
endif()
endif()
if((_cuda_nvcc_executable_or_flags_changed OR CUDA_HOST_COMPILER_CHANGED OR NOT GMX_NVCC_WORKS) AND NOT WIN32)
message(STATUS "Check for working NVCC/C++ compiler combination with nvcc '${CUDA_NVCC_EXECUTABLE}'")
execute_process(COMMAND ${CUDA_NVCC_EXECUTABLE} -ccbin ${CUDA_HOST_COMPILER} -c ${CUDA_NVCC_FLAGS} ${CUDA_NVCC_FLAGS_${_build_type}} ${CMAKE_SOURCE_DIR}/cmake/TestCUDA.cu
RESULT_VARIABLE _cuda_test_res
OUTPUT_VARIABLE _cuda_test_out
ERROR_VARIABLE _cuda_test_err
OUTPUT_STRIP_TRAILING_WHITESPACE)
...
endif() # GMX_CHECK_NVCC
macro(GMX_SET_CUDA_NVCC_FLAGS)
set(CUDA_NVCC_FLAGS "${GMX_CUDA_NVCC_FLAGS};${CUDA_NVCC_FLAGS}")
endmacro()
function(gmx_cuda_add_library TARGET)
add_definitions(-DHAVE_CONFIG_H)
# Source files generated by NVCC can include gmxmpi.h, and so
# need access to thread-MPI.
include_directories(SYSTEM ${PROJECT_SOURCE_DIR}/src/external/thread_mpi/include)
# Source files can also contain topology related files and need access to
# the remaining external headers
include_directories(SYSTEM ${PROJECT_SOURCE_DIR}/src/external)
# Now add all the compilation options
gmx_cuda_target_compile_options(CUDA_${TARGET}_CXXFLAGS)
list(APPEND CMAKE_CXX_FLAGS ${CUDA_${TARGET}_CXXFLAGS})
foreach(build_type ${build_types_with_explicit_flags})
list(APPEND CMAKE_CXX_FLAGS_${build_type} ${CUDA_${TARGET}_CXXFLAGS_${build_type}})
endforeach()
cuda_add_library(${TARGET} ${ARGN})
endfunction()
hipify-cmakefile cmake/gmxManageNvccConfig.cmake > cmake/gmxManageHipConfig.cmake
warning: cmake/gmxManageNvccConfig.cmake:#38 : unsupported macro/option : # - use the CUDA_HOST_COMPILER if defined by the user, otherwise
warning: cmake/gmxManageNvccConfig.cmake:#39 : unsupported macro/option : # - check if nvcc works with CUDA_HOST_COMPILER and the generated nvcc and C++ flags
warning: cmake/gmxManageNvccConfig.cmake:#42 : unsupported macro/option : # * CUDA_HOST_COMPILER_OPTIONS - the full host-compiler related option list passed to nvcc
warning: cmake/gmxManageNvccConfig.cmake:#44 : unsupported macro/option : # Note that from CMake 2.8.10 FindCUDA defines CUDA_HOST_COMPILER internally,
warning: cmake/gmxManageNvccConfig.cmake:#59 : unsupported macro/option : list(APPEND CUDA_HOST_COMPILER_OPTIONS "-D_FORCE_INLINES")
warning: cmake/gmxManageNvccConfig.cmake:#60 : unsupported macro/option : set(CUDA_HOST_COMPILER_OPTIONS ${CUDA_HOST_COMPILER_OPTIONS} PARENT_SCOPE)
warning: cmake/gmxManageNvccConfig.cmake:#64 : unsupported macro/option : gmx_check_if_changed(CUDA_HOST_COMPILER_CHANGED CUDA_HOST_COMPILER)
warning: cmake/gmxManageNvccConfig.cmake:#67 : unsupported macro/option : if(CUDA_HOST_COMPILER_CHANGED)
warning: cmake/gmxManageNvccConfig.cmake:#68 : unsupported macro/option : set(CUDA_HOST_COMPILER_OPTIONS "")
warning: cmake/gmxManageNvccConfig.cmake:#75 : unsupported macro/option : list(APPEND CUDA_HOST_COMPILER_OPTIONS "-D__STRICT_ANSI__")
warning: cmake/gmxManageNvccConfig.cmake:#80 : unsupported macro/option : set(CUDA_HOST_COMPILER_OPTIONS "${CUDA_HOST_COMPILER_OPTIONS}"
warning: cmake/gmxManageNvccConfig.cmake:#83 : unsupported macro/option : mark_as_advanced(CUDA_HOST_COMPILER CUDA_HOST_COMPILER_OPTIONS)
warning: cmake/gmxManageNvccConfig.cmake:#178 : unsupported macro/option : list(APPEND GMX_CUDA_NVCC_FLAGS "${CUDA_HOST_COMPILER_OPTIONS}")
warning: cmake/gmxManageNvccConfig.cmake:#210 : unsupported macro/option : if((_cuda_nvcc_executable_or_flags_changed OR CUDA_HOST_COMPILER_CHANGED OR NOT GMX_NVCC_WORKS) AND NOT WIN32)
warning: cmake/gmxManageNvccConfig.cmake:#212 : unsupported macro/option : execute_process(COMMAND ${CUDA_NVCC_EXECUTABLE} -ccbin ${CUDA_HOST_COMPILER} -c ${HIP_NVCC_FLAGS} ${HIP_NVCC_FLAGS_${_build_type}} ${CMAKE_SOURCE_DIR}/cmake/TestCUDA.cu
The new cmake looks like:
if(HIP_VERSION VERSION_LESS "11.0")
list (APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_30,code=sm_30")
endif()
list (APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_35,code=sm_35")
...
if(NOT HIP_VERSION VERSION_LESS "11.0")
# Requesting sm or compute 35, 37, or 50 triggers deprecation messages with
# nvcc 11.0, which we need to suppress for use in CI
list (APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-Wno-deprecated-gpu-targets")
endif()
...
macro(GMX_SET_CUDA_NVCC_FLAGS)
set(HIP_NVCC_FLAGS "${GMX_CUDA_NVCC_FLAGS};${HIP_NVCC_FLAGS}")
endmacro()
- The tool will be improved and also probably you can add more variables using the file: /appl/opt/rocm/rocm-4.0.0c/hip/bin/hipify-cmakefile
Also there will be many improvements regarding CMake ROCm/HIP#2158 (comment)
- Hipify the code in the repository: https://github.com/csc-training/hip/tree/main/porting/codes/Vector_Addition
-
Some HIP libraries need dependencies not available on NVIDIA platform, need to investigate.
-
If your CUDA kernel, includes the dim3() call, then hipify will convert wrongly. Issue in Github . It was fixed on February 24th, not yet installed on Puhti
-
In CUDA,
__CUDACC__
is defined bynvcc
, but the HIP equivalent__HIPCC__
is defined inhip_runtime.h
. Thus, if code uses__CUDACC__
without#include <cuda_runtime_api.h>
, one needs to add manually#include <hip_runtime.h>
to have the automatically converted__HIPCC__
to get defined. Issue in Github
- question
- answer
:::info Always ask questions at the very bottom of this document, right above this. :::