# AthenaK scaling on ALCF Polaris

2022-05-19

## Description of datasets

- GRMHD Torus v1: `weak_scaling.csv`, `strong_scaling_11_0.csv`, `strong_scaling.csv`
  - Many small MeshBlocks, 512x 32^3 per GPU
  - All outputs disabled
  - Ran for `time/tlim=100.0`. Default limit of 1e4 would take ~6 hrs on one GPU
  - Only problem configuration for which strong scaling tests were run.
  - Strong scaling looked better compared to abysmal weak scaling results, since it results in fewer tiny MeshBlocks per GPU as you scale up in a strong scaling test.
 
- GRMHD Torus v2: `weak_scaling_large_mbs.csv`
  - Fewer large MeshBlocks, 16x 128^3 per GPU
    - 9.788851e+07 zone-cycles/sec/GPU
    - 1x per GPU = 5.177544e+07
    - 8x per GPU = 9.159908e+07
    - 32x per GPU = 9.920768e+07
    - Note, if you 10x tlim, 1x MeshBlock per GPU drops from 5e7 to 3.77e7 ???
  - Otherwise, same parameters as before
  
- GRMHD (no-rad) linear wave: `weak_scaling_grmhd_linwave.csv`
  - 2x 128^3 MeshBlocks per GPU; base problem scaled up to 256x128x128, 6.0,3.0,3.0
  - `nlim=4000`, `tlim=10.0`

- GRMHD radiation linear wave, level 2 spherical mesh: `weak_scaling_grrad_linwave_nlevel2.csv`
  - 2x 128^3 MeshBlocks per GPU; base problem scaled up to 256x128x128, 6.0,3.0,3.0
    - 6.475858e+06 zone-cycles/sec/GPU
    - vs. single 128x64x64 MeshBlock = 6.235982e+06
    - vs. 4x 128^3 MeshBlocks = 6.42e6 zone-cycles/sec/GPU
  - `nlim=400`, `tlim=1.0`  
  - The number of angles is = 10*(level^2)+2
  - 42 angles

- GRMHD radiation linear wave, level 3 spherical mesh: `weak_scaling_grrad_linwave_nlevel3.csv`
  - Would not run the 2x 128^3 MeshBlocks per GPU configuration due to exhausting the 40.0 GiB of GPU memory 
  - Ran with 2x 64^3 MeshBlocks per GPU, base problem 128x64x64, 3.0x1.5x1.5
  - `nlim=1000`, `tlim=1.0`    
  - 92 angles


### New radiation timings (Saturday 2022-05-28)

Each GPU's throughput should increase roughly 4x according to Patrick:

>  For the radiating hydro linear wave test (128^3 mesh, with a single 128^3 meshblock)
level 2 (42 angles)    zone-cycles/cpu_second = 2.660367e+07
level 3 (92 angles)    zone-cycles/cpu_second = 1.331120e+07
level 4 (162 angles)  zone-cycles/cpu_second = 7.887175e+06
This makes the radiating hydro linear wave calculation with a level 2 geodesic mesh only 4.5x slower than a GRMHD linear wave calculation (it used to be 18x slower).  Another way of putting this---the new performance with 162 angles is better than our earlier performance with only 42 angles.  

So I roughly 4x'd the `time/nlim`. Recall, tlim=1.0 is rescaled by the code to be equivalent to 1.0 wave periods.

- GRMHD radiation linear wave, level 2 spherical mesh: `weak_scaling_grrad_linwave_nlevel2-postfix.csv`
  - 2x 128^3 MeshBlocks per GPU; base problem scaled up to 256x128x128, 6.0,3.0,3.0
    - 2.841674e+07 zone-cycles/sec/GPU
    - 4.39x improvement
  - `nlim=2000`, `tlim=1.0`  


- GRMHD radiation linear wave, level 3 spherical mesh: `weak_scaling_grrad_linwave_nlevel3-postfix.csv`
  - 2x 64^3 MeshBlocks per GPU, base problem 128x64x64, 3.0x1.5x1.5
    - 1.352471e+07 zone-cycles/sec/GPU
    - 4.32x improvement
  - `nlim=4000`, `tlim=10.0`  


## Notes

- Zsh is broken on compute nodes
- For sub-node scaling results (1, 2, or 3 GPUs), need to manually 
- Decent improvement moving from CUDA Toolkit 11.0 to 11.6. Compare 2x strong scaling CSV files.
- Needed commit from `compile_hotfix` branch to compile correctly with CUDA 11.6
- Seg-fault will occur at runtime unless `export MPICH_GPU_SUPPORT_ENABLED=1` is executed to enable GPUDirect/CUDA-aware Cray MPICH. Not needed at compile time 
- For all scaling tests, I would proportionally scale the `mesh/x3min, mesh/x3max, mesh/nx3` parameters so that the resolution remains fixed for all problems (and hence the timestep)
- `-d 16 --cpu-bind depth -env OMP_NUM_THREADS=16` didnt seem to change results much/ at all when used on `mpiexec`
- `-ppn 4` is essential on multinode scaling tests, otherwise ranks might be spread across nodes rather than being packed? todo: checkout `qsub -place=free ...` default and alternatives:

```
mpiexec -np 2 ./gpu_affinity.sh
2.320449e+07
mpiexec -np 2 -ppn 2 ./gpu_affinity.sh
1.015839e+08
```

### Compilation instructions
Best to compile on compute node, so CMake can autodetect the A100s. Example for compiling and running on 1 node:
```
qsub -q run_next -I -l select=1:ncpus=64:ngpus=4,walltime=03:00:00 -j oe -S /bin/bash

module load craype-accel-nvidia80
export MPICH_GPU_SUPPORT_ENABLED=1

cd athenak
rm -rfd build
mkdir build; cd build
cmake -D Athena_ENABLE_MPI=ON -DKokkos_ENABLE_CUDA=On -DKokkos_ENABLE_CUDA_LAMBDA=On -DKokkos_ARCH_AMPERE80=On -DCMAKE_CXX_COMPILER=/home/felker/athenak/kokkos/bin/nvcc_wrapper ..
make -j 32

cd src
cp ~/athenak-scaling/*.sh ./

mpiexec -np 1 -ppn 4 -d 16 --cpu-bind depth -env OMP_NUM_THREADS=16 ./gpu_affinity.sh
```

- GR Torus requires `-DPROBLEM=gr_torus` in CMake command. Used `compile_hotfix` branch
- GRMHD (no radiation) linear wave requires `-DPROBLEM=rad_linear_wave` in CMake command. Used `scaling_grrad` branch
- GRMHD Radiation linear wave requires **no** `-DPROBLEM` option in CMake command. Used `scaling_grmhd` branch


### To do
- [x] Get dedicated full machine reservation; compare scaling efficiency loss at >= 64 nodes. Most extreme in GRMHD Torus problem, practically nonexistent in GRMHD Linear Wave plot (but drop off from 1 to 2 GPUs is more extreme).
- [ ] Explain differences in GRMHD Torus and GRMHD Linear Wave problem plots. Same physics capabilities, right? Just MeshBlock setup?
- [ ] Also retest after Slingshot 11 NIC upgrade in the fall
- [x] Run 512 and/or 550 node jobs when the ~50 down nodes are brought back into service
- [ ] Explain the 7x efficiency loss from 1:8 GPUs when the 512x 32^3 MeshBlocks were used in the original GR torus problem setup
- [ ] Explain how CMake setup, Kokkos buildchain, `nvcc_wrapper`, etc. was able to successfully link Cray MPICH libraries, etc. despite not ever explicitly referencing the `PrgEnv-nvidia` wrapper compilers `cc`, `CC` in the scripts. 

We explicitly identify CXX compiler as `nvcc_wrapper`, which defaults to `g++`. CMake automatically picks up `cc` as the C compiler, which then gets everything else?
```
-- The C compiler identification is NVHPC 22.3.0
-- The CXX compiler identification is GNU 7.5.0
```

# Code

In [None]:
from matplotlib import pyplot as plt
import matplotlib as mpl
import numpy as np

import pandas as pd

mpl.rcParams['figure.dpi'] = 300

from IPython.display import Image

import seaborn as sns


## Strong scaling, 1 to 16 nodes

In [None]:
data = pd.read_csv('strong_scaling.csv')
data['Speedup'] = data['zone-cycles/cpu_second']/data['zone-cycles/cpu_second'][0]

In [None]:
data

In [None]:
data['zone-cycles/cpu_second'][0]

In [None]:
fig, ax = plt.subplots()
ax.plot(data['Num MPI ranks'], data['zone-cycles/cpu_second'],'-o')
#ax.set_xlabel('Number of MPI ranks = N_A100')
ax.set_xlabel(r'$N_{\mathrm{MPI}} = N_{\mathrm{A100}}$')

# ax.set_xscale('log', base=2)

ax.set_ylabel('Zone-cycles/second')
ax.axvline(x=4, color='0.8', alpha=0.8)


In [None]:
fig.savefig("strong-scaling.pdf")

In [None]:
fig, ax = plt.subplots()
ax.plot(data['Num MPI ranks'], data['Speedup'],'-o')
#ax.set_xlabel('Number of MPI ranks = N_A100')
ax.set_xlabel(r'$N_{\mathrm{MPI}} = N_{\mathrm{A100}}$')
ax.set_ylabel('Speedup')

ax.axvline(x=4, color='0.8', alpha=0.8)


In [None]:
fig.savefig("strong-scaling-speedup.pdf")

## Weak scaling

In [None]:
#data = pd.read_csv('weak_scaling.csv')
# data = pd.read_csv('weak_scaling_large_mbs.csv', comment='#')
# data = pd.read_csv('weak_scaling_grmhd_linwave.csv', comment='#')
# data = pd.read_csv('weak_scaling_grrad_linwave_nlevel2.csv', comment='#')
# data = pd.read_csv('weak_scaling_grrad_linwave_nlevel3.csv', comment='#')
# data = pd.read_csv('weak_scaling_grrad_linwave_nlevel2-postfix.csv', comment='#')
data = pd.read_csv('weak_scaling_grrad_linwave_nlevel3-postfix.csv', comment='#')

data['Scaled speedup'] = data['zone-cycles/cpu_second']/data['zone-cycles/cpu_second'][0]
data['Efficiency'] = data['cpu time used'][0]/data['cpu time used']

In [None]:
data

In [None]:
fig, ax = plt.subplots()
ax.plot(data['Num MPI ranks'], data['cpu time used'],'-o')
#ax.set_xlabel('Number of MPI ranks = N_A100')
ax.set_xlabel(r'$N_{\mathrm{MPI}} = N_{\mathrm{A100}}$')
ax.set_ylabel('Wall time (s)')
ax.axvline(x=4, color='0.8', alpha=0.8)


In [None]:
fig.savefig("weak-scaling-walltime.pdf")

In [None]:
fig, ax = plt.subplots()
ax.plot(data['Num MPI ranks'], data['cpu time used'],'-o')
ax.set_xlabel(r'$N_{\mathrm{MPI}} = N_{\mathrm{A100}}$')
ax.set_ylabel('Wall time (s)')
ax.axvline(x=4, color='0.8', alpha=0.8)
ax.set_xscale('log', base=2)


In [None]:
fig.savefig("weak-scaling-walltime-semilogy.pdf")

In [None]:
fig, ax = plt.subplots()
ax.plot(data['Num MPI ranks'], data['Efficiency'],'-o')
ax.set_xlabel(r'$N_{\mathrm{MPI}} = N_{\mathrm{A100}}$')
ax.set_ylabel('Efficiency')
ax.axvline(x=4, color='0.8', alpha=0.8)

In [None]:
fig.savefig("weak-scaling-efficiency.pdf")

In [None]:
fig, ax = plt.subplots()
ax.plot(data['Num MPI ranks'], data['Efficiency'],'-o')
ax.set_xlabel(r'$N_{\mathrm{MPI}} = N_{\mathrm{A100}}$')
ax.set_ylabel('Efficiency')
ax.axvline(x=4, color='0.8', alpha=0.8)
ax.set_xscale('log', base=2)

In [None]:
fig.savefig("weak-scaling-efficiency-semilogy.pdf")

In [None]:
fig, ax = plt.subplots()
ax.plot(data['Num MPI ranks'], data['zone-cycles/cpu_second']/data['Num MPI ranks'],'-o')
ax.set_xlabel(r'$N_{\mathrm{MPI}} = N_{\mathrm{A100}}$')
ax.set_ylabel('Normalized performance [zone-cycles/second/GPU]')
ax.axvline(x=4, color='0.8', alpha=0.8)

#ax.set_xscale('log', base=2)

In [None]:
fig.savefig("weak-scaling-normalized-performance.pdf")

In [None]:
fig, ax = plt.subplots()
ax.plot(data['Num MPI ranks'], data['zone-cycles/cpu_second']/data['Num MPI ranks'],'-o')
ax.set_xlabel(r'$N_{\mathrm{MPI}} = N_{\mathrm{A100}}$')
ax.set_ylabel('Normalized performance [zone-cycles/second/GPU]')
ax.axvline(x=4, color='0.8', alpha=0.8)

ax.set_xscale('log', base=2)

In [None]:
fig.savefig("weak-scaling-normalized-performance-semilogy.pdf")

In [None]:
# fig, ax = plt.subplots()
# ax.plot(data['Num nodes'], data['zone-cycles/cpu_second']/data['Num MPI ranks'],'-o')
# ax.set_xlabel(r'$N_{\mathrm{Nodes}} = \frac{N_{\mathrm{A100}}}{4}$')
# ax.set_ylabel('Normalized performance [zone-cycles/second/GPU]')

In [None]:
# fig.savefig("weak-scaling-normalized-performance-nodes.pdf")

# Compute node environment

```
# Currently Loaded Modulefiles:
  1) craype-x86-milan         5) nvidia/22.3              9) cray-pmi/6.1.1          13) PrgEnv-nvidia/8.3.3
  2) libfabric/1.11.0.4.87    6) craype/2.7.15           10) cray-pmi-lib/6.1.1      14) cudatoolkit/11.6
  3) craype-network-ofi       7) cray-dsmml/0.2.2        11) cray-pals/1.1.6         15) craype-accel-nvidia80
  4) perftools-base/22.04.0   8) cray-mpich/8.1.15       12) cray-libpals/1.1.6
    
felker@x3006c0s13b1n0:~> echo $PATH
/opt/nvidia/hpc_sdk/Linux_x86_64/22.3/cuda/11.6/bin:/home/felker/.local/bin:/home/felker/bin:/home/felker/mygit/bin:/home/felker/myemacs/bin:/home/felker/.local/bin:/home/felker/bin:/home/felker/mygit/bin:/home/felker/myemacs/bin:/opt/cray/pe/pals/1.1.6/bin:/opt/cray/pe/craype/2.7.15/bin:/opt/nvidia/hpc_sdk/Linux_x86_64/22.3/compilers/bin:/opt/cray/pe/perftools/22.04.0/bin:/opt/cray/pe/papi/6.0.0.14/bin:/opt/cray/libfabric/1.11.0.4.87/bin:/opt/clmgr/sbin:/opt/clmgr/bin:/opt/sgi/sbin:/opt/sgi/bin:/home/felker/bin:/usr/local/bin:/usr/bin:/bin:/opt/c3/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/opt/pbs/bin:/sbin:/bin:/opt/cray/pe/bin
                                                                                                                    
felker@x3006c0s13b1n0:~> echo $LD_LIBRARY_PATH
/opt/nvidia/hpc_sdk/Linux_x86_64/22.3/cuda/11.6/lib64:/opt/nvidia/hpc_sdk/Linux_x86_64/22.3/math_libs/11.6/lib64:/opt/nvidia/hpc_sdk/Linux_x86_64/22.3/cuda/11.6/extras/CUPTI/lib64:/opt/nvidia/hpc_sdk/Linux_x86_64/22.3/cuda/11.6/nvvm/lib64:/opt/nvidia/hpc_sdk/Linux_x86_64/22.3/math_libs/lib64:/opt/nvidia/hpc_sdk/Linux_x86_64/22.3/compilers/lib:/opt/cray/pe/papi/6.0.0.14/lib64:/opt/cray/libfabric/1.11.0.4.87/lib64
                            
felker@x3006c0s13b1n0:~> nvidia-smi
Fri May 20 21:33:59 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   29C    P0    54W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:46:00.0 Off |                    0 |
| N/A   29C    P0    51W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:85:00.0 Off |                    0 |
| N/A   27C    P0    53W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:C7:00.0 Off |                    0 |
| N/A   31C    P0    56W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
```

## CMake output, GR Torus example
```
felker@x3005c0s7b1n0:~/athenak/build> cmake -D Athena_ENABLE_MPI=ON -DPROBLEM=gr_torus -DKokkos_ENABLE_CUDA=On -DKokkos_ARCH_AMPERE80=On -DCMAKE_CXX_COMPILER=/home/felker/athenak/kokkos/bin/nvcc_wrapper ../
-- The C compiler identification is NVHPC 22.3.0
-- The CXX compiler identification is GNU 7.5.0
-- Cray Programming Environment 2.7.15 C
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/cray/pe/craype/2.7.15/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /home/felker/athenak/kokkos/bin/nvcc_wrapper - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Setting build type to 'Release' as none was specified.
-- Found MPI_CXX: /opt/nvidia/hpc_sdk/Linux_x86_64/22.3/cuda/11.6/targets/x86_64-linux/lib/stubs/libcuda.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1") found components: CXX
-- Including user-specified problem generator file: gr_torus
-- Setting default Kokkos CXX standard to 17
-- Setting policy CMP0074 to use <Package>_ROOT variables
-- The project name is: Kokkos
-- Compiler Version: 11.6.112
-- SERIAL backend is being turned on to ensure there is at least one Host space. To change this, you must enable another host execution space and configure with -DKokkos_ENABLE_SERIAL=OFF or change CMakeCache.txt
-- Using -std=c++17 for C++17 standard as feature
-- Execution Spaces:
--     Device Parallel: CUDA
--     Host Parallel: NONE
--       Host Serial: SERIAL
--
-- Architectures:
--  AMPERE80
-- Found CUDAToolkit: /opt/nvidia/hpc_sdk/Linux_x86_64/22.3/cuda/11.6/include (found version "11.6.112")
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Found TPLCUDA: TRUE
-- Found TPLLIBDL: /usr/lib64/libdl.so
-- Configuring done
-- Generating done
-- Build files have been written to: /home/felker/athenak/build
```