Skip to content

Commit

Permalink
Doc: Frontier (OLCF) hipcc
Browse files Browse the repository at this point in the history
On Frontier, we switch back to `hipcc` as compiler since we see:
- HPE/Cray compilers have the performance regression we saw for 5.3-5.4
  even when we load the ROCm 5.2 modules in the latest CCE programming
  environment (PE)
- ROCm 5.5 still has the performance regressions

This also adds RZ+PSATD dependencies of BLAS++ & LAPACK++ and Python
dependencies, and modernizes/streamlines the dependency install.
  • Loading branch information
ax3l committed Jun 14, 2023
1 parent 0e45735 commit 26e8a12
Show file tree
Hide file tree
Showing 4 changed files with 228 additions and 35 deletions.
130 changes: 109 additions & 21 deletions Docs/source/install/hpc/frontier.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,9 @@
Frontier (OLCF)
===============

The `Frontier cluster (see: Crusher) <https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html>`_ is located at OLCF.
Each node contains 4 AMD MI250X GPUs, each with 2 Graphics Compute Dies (GCDs) for a total of 8 GCDs per node.
The `Frontier cluster <https://www.olcf.ornl.gov/frontier/>`_ is located at OLCF.

On Frontier, each compute node provides four AMD MI250X GPUs, each with two Graphics Compute Dies (GCDs) for a total of 8 GCDs per node.
You can think of the 8 GCDs as 8 separate GPUs, each having 64 GB of high-bandwidth memory (HBM2E).


Expand All @@ -13,59 +14,144 @@ Introduction

If you are new to this system, **please see the following resources**:

* `Crusher user guide <https://docs.olcf.ornl.gov/systems/frontier_user_guide.html>`_
* `Frontier user guide <https://docs.olcf.ornl.gov/systems/frontier_user_guide.html>`_
* Batch system: `Slurm <https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#running-jobs>`_
* `Production directories <https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#data-and-storage>`_:
* `Filesystems <https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#data-and-storage>`_:

* ``$HOME``: per-user directory, use only for inputs, source and scripts; backed up; mounted as read-only on compute nodes, that means you cannot run in it (50 GB quota)
* ``$PROJWORK/$proj/``: shared with all members of a project, purged every 90 days (recommended)
* ``$MEMBERWORK/$proj/``: single user, purged every 90 days (usually smaller quota, 50TB default quota)
* ``$WORLDWORK/$proj/``: shared with all users, purged every 90 days (50TB default quota)
* Note that the ``$HOME`` directory is mounted as read-only on compute nodes.
That means you cannot run in your ``$HOME``.
It's default quota is 50GB.

Note: the Orion lustre filesystem on Frontier and the older Alpine GPFS filesystem on Summit are not mounted on each others machines.
Use `Globus <https://www.globus.org>`__ to transfer data between them if needed.


Installation
------------
.. _building-frontier-preparation:

Preparation
-----------

Use the following commands to download the WarpX source code and switch to the correct branch:

.. code-block:: bash
git clone https://github.com/ECP-WarpX/WarpX.git $HOME/src/warpx
We use the following modules and environments on the system (``$HOME/frontier_warpx.profile``).
We use the software modules and environments on the system, stored in the file ``$HOME/frontier_warpx.profile``.

.. literalinclude:: ../../../../Tools/machines/frontier-olcf/frontier_warpx.profile.example
:language: bash
:caption: You can copy this file from ``Tools/machines/frontier-olcf/frontier_warpx.profile.example``.
.. code-block:: bash
cp $HOME/src/warpx/Tools/machines/frontier-olcf/frontier_warpx.profile.example $HOME/frontier_warpx.profile
We recommend to store the above lines in a file, such as ``$HOME/frontier_warpx.profile``, and load it into your shell after a login:
Edit the 2nd line of this script, which sets the ``export proj=""`` variable.
For example, if you are member of the project ``aph114``, then run ``vi $HOME/frontier_warpx.profile``.
Enter the edit mode by typing ``i`` and edit line 2 to read:

.. code-block:: bash
export proj="aph114"
Exit the ``vi`` editor with ``Esc`` and then type ``:wq`` (write & quit).

Now, and *after every future login to Frontier*, activate the environment settings in this file:

.. code-block:: bash
source $HOME/frontier_warpx.profile
Finally, since Frontier does not yet provide a module for some of our dependencies, install them once:

.. code-block:: bash
bash $HOME/src/warpx/Tools/machines/frontier-olcf/install_dependencies.sh
.. _building-frontier-compilation:

Compilation
-----------

Then, ``cd`` into the directory ``$HOME/src/warpx`` and use the following commands to compile:
Change directory via ``cd`` into ``$HOME/src/warpx`` and use the following commands to compile:

.. code-block:: bash
cd $HOME/src/warpx
rm -rf build
rm -rf build_frontier
cmake -S . -B build -DWarpX_COMPUTE=HIP
cmake --build build -j 32
cmake -S . -B build_frontier -DWarpX_COMPUTE=HIP -DWarpX_PSATD=ON -DWarpX_DIMS="1;2;RZ;3"
cmake --build build_frontier -j 16
The general :ref:`cmake compile-time options <building-cmake>` apply as usual.

**That's it!**
A 3D WarpX executable is now in ``build/bin/`` and :ref:`can be run <running-cpp-frontier>` with a :ref:`3D example inputs file <usage-examples>`.
A 3D WarpX executable is now in ``build_frontier/bin/`` and :ref:`can be run <running-cpp-frontier>` with a :ref:`3D example inputs file <usage-examples>`.
Most people execute the binary directly or copy it out to a location in ``$PROJWORK/$proj/``.

If you want to run WarpX as a Python (PICMI) script, do the :ref:`following additional installation steps <building-cmake-python>`:

.. code-block:: bash
# PICMI build
cd $HOME/src/warpx
# compile parallel PICMI interfaces in 3D, 2D, 1D and RZ
WARPX_COMPUTE=HIP WARPX_MPI=ON WARPX_PSATD=ON BUILD_PARALLEL=16 python3 -m pip install -v .
**You are all set!**
You can now :ref:`run <running-cpp-frontier>` WarpX :ref:`Python (PICMI) scripts <usage-picmi>` (see our :ref:`example PICMI input scripts <usage-examples>`.).


.. _building-frontier-update:

Update WarpX & Dependencies
---------------------------

If you already installed WarpX in the past and want to update it, start by getting the latest source code

.. code-block:: bash
cd $HOME/src/warpx
# read the output of this command - does it look ok?
git status
# get the latest WarpX source code
git fetch
git pull
# read the output of this command - does it look ok?
git status
Remove the old Python install

.. code-block:: bash
python3 -m pip uninstall -y pywarpx
And, if needed, execute the dependency install scripts above again.
As a last step, clean the build directory ``rm -rf cd $HOME/src/warpx/build_frontier`` and rebuild WarpX.


.. _building-frontier-install-dev:

Developer Workflow
------------------

Our Python bindings install all geometries of WarpX at once, which can take a while to compile.
If you are *developing*, you can do a quick PICMI install of a *single geometry* (see: :ref:`WarpX_DIMS <building-cmake-options>`) using:

.. code-block:: bash
cd $HOME/src/warpx
rm -rf build_frontier
# find dependencies & configure
cmake -S . -B build_frontier -DWarpX_COMPUTE=HIP -DWarpX_PSATD=ON -DWarpX_LIB=ON -DWarpX_DIMS=RZ
# build and then call "python3 -m pip install ..."
cmake --build build_frontier --target pip_install -j 16
.. _running-cpp-frontier:

Expand Down Expand Up @@ -131,6 +217,8 @@ Known System Issues

January, 2023 (OLCFDEV-1284, AMD Ticket: ORNLA-130):
We discovered a regression in AMD ROCm, leading to 2x slower current deposition (and other slowdowns) in ROCm 5.3 and 5.4.
Reported to AMD and fixed for the 5.5 release of ROCm.

Upgrade ROCm or stay with the ROCm 5.2 module to avoid.
June, 2023:
Although a fix was planned for ROCm 5.5, we still see the same issue in this release and continue to exchange with AMD and HPE on the issue.

Stay with the ROCm 5.2 module to avoid a 2x slowdown.
6 changes: 4 additions & 2 deletions Docs/source/install/hpc/lumi.rst
Original file line number Diff line number Diff line change
Expand Up @@ -125,9 +125,11 @@ Known System Issues

January, 2023:
We discovered a regression in AMD ROCm, leading to 2x slower current deposition (and other slowdowns) in ROCm 5.3 and 5.4.
Reported to AMD and fixed for the 5.5 release of ROCm.

Upgrade ROCm or stay with the ROCm 5.2 module to avoid.
June, 2023:
Although a fix was planned for ROCm 5.5, we still see the same issue in this release and continue to exchange with AMD and HPE on the issue.

Stay with the ROCm 5.2 module to avoid a 2x slowdown.

.. warning::

Expand Down
29 changes: 17 additions & 12 deletions Tools/machines/frontier-olcf/frontier_warpx.profile.example
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
# please set your project account
#export proj=APH114-frontier
export proj="" # change me!

# remembers the location of this script
export WARPX_PROFILE=$(cd $(dirname $BASH_SOURCE) && pwd)"/"$(basename $BASH_SOURCE)
if [ -z "${proj}" ]; then echo "WARNING: The 'proj' variable is not yet set in your $WARPX_PROFILE file! Please edit its line 2 to continue!"; return; fi

# required dependencies
module load cmake/3.23.2
module load craype-accel-amd-gfx90a
module load rocm/5.2.0 # waiting for 5.5 for next bump
module load rocm/5.2.0 # waiting for 5.6 for next bump
module load cray-mpich
module load cce/15.0.0 # must be loaded after rocm

Expand All @@ -15,17 +19,18 @@ module load ninja
# optional: just an additional text editor
module load nano

# optional: for PSATD in RZ geometry support (not yet available)
#module load cray-libsci_acc/22.06.1.2
#module load blaspp
#module load lapackpp
# optional: for PSATD in RZ geometry support
export CMAKE_PREFIX_PATH=${HOME}/sw/frontier/gpu/blaspp-master:$CMAKE_PREFIX_PATH
export CMAKE_PREFIX_PATH=${HOME}/sw/frontier/gpu/lapackpp-master:$CMAKE_PREFIX_PATH
export LD_LIBRARY_PATH=${HOME}/sw/frontier/gpu/blaspp-master/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=${HOME}/sw/frontier/gpu/lapackpp-master/lib64:$LD_LIBRARY_PATH

# optional: for QED lookup table generation support
module load boost/1.79.0-cxx17

# optional: for openPMD support
module load adios2/2.8.3
module load cray-hdf5-parallel/1.12.2.3
module load hdf5/1.14.0

# optional: for Python bindings or libEnsemble
module load cray-python/3.9.13.1
Expand All @@ -38,10 +43,10 @@ umask 0027

# an alias to request an interactive batch node for one hour
# for paralle execution, start on the batch node: srun <command>
alias getNode="salloc -A $proj -J warpx -t 01:00:00 -p batch -N 1 --ntasks-per-node=8 --gpus-per-task=1 --gpu-bind=closest"
alias getNode="salloc -A $proj -J warpx -t 01:00:00 -p batch -N 1"
# an alias to run a command on a batch node for up to 30min
# usage: runNode <command>
alias runNode="srun -A $proj -J warpx -t 00:30:00 -p batch -N 1 --ntasks-per-node=8 --gpus-per-task=1 --gpu-bind=closest"
alias runNode="srun -A $proj -J warpx -t 00:30:00 -p batch -N 1"

# GPU-aware MPI
export MPICH_GPU_SUPPORT_ENABLED=1
Expand All @@ -50,9 +55,9 @@ export MPICH_GPU_SUPPORT_ENABLED=1
export AMREX_AMD_ARCH=gfx90a

# compiler environment hints
export CC=$(which cc)
export CXX=$(which CC)
export CC=$(which hipcc)
export CXX=$(which hipcc)
export FC=$(which ftn)
export CFLAGS="-I${ROCM_PATH}/include"
export CXXFLAGS="-I${ROCM_PATH}/include -Wno-pass-failed"
export LDFLAGS="-L${ROCM_PATH}/lib -lamdhip64"
export LDFLAGS="-L${ROCM_PATH}/lib -lamdhip64 ${PE_MPICH_GTL_DIR_amd_gfx90a} -lmpi_gtl_hsa"
98 changes: 98 additions & 0 deletions Tools/machines/frontier-olcf/install_dependencies.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
#!/bin/bash
#
# Copyright 2023 The WarpX Community
#
# This file is part of WarpX.
#
# Author: Axel Huebl
# License: BSD-3-Clause-LBNL

# Exit on first error encountered #############################################
#
set -eu -o pipefail


# Check: ######################################################################
#
# Was perlmutter_gpu_warpx.profile sourced and configured correctly?
if [ -z "${proj}" ]; then echo "WARNING: The 'proj' variable is not yet set in your frontier_warpx.profile file! Please edit its line 2 to continue!"; return; fi


# Check $proj variable is correct and has a corresponding CFS directory #######
#
if [ ! -d "${PROJWORK}/${proj}/" ]
then
echo "WARNING: The directory $PROJWORK/$proj/ does not exist!"
echo "Is the \$proj environment variable of value \"$proj\" correctly set? "
echo "Please edit line 2 of your frontier_warpx.profile file to continue!"
exit
fi


# Remove old dependencies #####################################################
#
rm -rf ${HOME}/sw/frontier/gpu

# remove common user mistakes in python, located in .local instead of a venv
python3 -m pip uninstall -qq -y pywarpx
python3 -m pip uninstall -qq -y warpx
python3 -m pip uninstall -qqq -y mpi4py 2>/dev/null || true


# General extra dependencies ##################################################
#

# BLAS++ (for PSATD+RZ)
if [ -d $HOME/src/blaspp ]
then
cd $HOME/src/blaspp
git fetch
git pull
cd -
else
git clone https://github.com/icl-utk-edu/blaspp.git $HOME/src/blaspp
fi
rm -rf $HOME/src/blaspp-frontier-gpu-build
CXX=$(which CC) cmake -S $HOME/src/blaspp -B $HOME/src/blaspp-frontier-gpu-build -Duse_openmp=OFF -Dgpu_backend=hip -DCMAKE_CXX_STANDARD=17 -DCMAKE_INSTALL_PREFIX=${HOME}/sw/frontier/gpu/blaspp-master
cmake --build $HOME/src/blaspp-frontier-gpu-build --target install --parallel 16

# LAPACK++ (for PSATD+RZ)
if [ -d $HOME/src/lapackpp ]
then
cd $HOME/src/lapackpp
git fetch
git pull
cd -
else
git clone https://github.com/icl-utk-edu/lapackpp.git $HOME/src/lapackpp
fi
rm -rf $HOME/src/lapackpp-frontier-gpu-build
CXX=$(which CC) CXXFLAGS="-DLAPACK_FORTRAN_ADD_" cmake -S $HOME/src/lapackpp -B $HOME/src/lapackpp-frontier-gpu-build -DCMAKE_CXX_STANDARD=17 -Dbuild_tests=OFF -DCMAKE_INSTALL_RPATH_USE_LINK_PATH=ON -DCMAKE_INSTALL_PREFIX=${HOME}/sw/frontier/gpu/lapackpp-master
cmake --build $HOME/src/lapackpp-frontier-gpu-build --target install --parallel 16


# Python ######################################################################
#
python3 -m pip install --upgrade pip
python3 -m pip install --upgrade virtualenv
python3 -m pip cache purge
rm -rf ${HOME}/sw/frontier/gpu/venvs/warpx
python3 -m venv ${HOME}/sw/frontier/gpu/venvs/warpx
source ${HOME}/sw/frontier/gpu/venvs/warpx/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install --upgrade wheel
python3 -m pip install --upgrade cython
python3 -m pip install --upgrade numpy
python3 -m pip install --upgrade pandas
python3 -m pip install --upgrade scipy
MPICC="cc -shared" python3 -m pip install --upgrade mpi4py --no-cache-dir --no-build-isolation --no-binary mpi4py
python3 -m pip install --upgrade openpmd-api
python3 -m pip install --upgrade matplotlib
python3 -m pip install --upgrade yt
# install or update WarpX dependencies such as picmistandard
python3 -m pip install --upgrade -r requirements.txt
# optional: for libEnsemble
python3 -m pip install -r $HOME/src/warpx/Tools/LibEnsemble/requirements.txt
# optional: for optimas (based on libEnsemble & ax->botorch->gpytorch->pytorch)
#python3 -m pip install --upgrade torch --index-url https://download.pytorch.org/whl/rocm5.4.2
#python3 -m pip install -r $HOME/src/warpx/Tools/optimas/requirements.txt

0 comments on commit 26e8a12

Please sign in to comment.