Executing Albany on Ride or White

Jerry Watkins edited this page Feb 8, 2017 · 1 revision

These execution instructions are for running Albany on the Ride or White IBM Power8 GPU clusters at Sandia National Laboratories. Batch scripts are used to submit jobs to a queue manager. The script will run when resources become available.

As of February 2017, Ride and White are split into three queues, each having different numbers of nodes and GPUs:

Ride Name Node Names Number of Nodes GPU Model Number of GPUs per Node
Firestone nodes (default queue) rhel7F ride7 - ride16 10 K80 (12GB) 4
Garrison nodes rhel7G ride17 - ride28 12 P100 (16GB) 4
Tuleta nodes rhel7T ride2 - ride5 4 K40m (12GB) 2
White Name Node Names Number of Nodes GPU Model Number of GPUs per Node
Firestone nodes (default queue) rhel7F white20 - white27 8 K80 (12GB) 4
Garrison nodes rhel7G white28 - white35 8 P100 (16GB) 4
Tuleta nodes rhel7T white13 - white19 7 K40m (12GB) 2

Ride and White use LSF as a resource manager and job scheduler. Here is a list of useful commands:

  • bsub -Is bash - Submit an interactive job to the LSF system
  • bsub < [BatchScriptFile] - Submit a batch job to the LSF system where [BatchScriptFile] refers to the batch script file being used
  • bkill - Kill a running job
  • bjobs - See the status of user jobs in the LSF queue
  • bjobs -u all - See the status of all jobs in the LSF queue
  • bqueues - Information about LSF batch queues
  • bqueues -l - More detailed information about the settings for each queue

A useful reference for LSF commands can be found here.

Executing MPI+GPU jobs with Kokkos::Cuda

The following script executes Albany with 8 MPI ranks across 2 nodes (4 ranks per node). Since each GPU pair is connected to a socket, --map-by ppr:2:socket is used to set 2 MPI ranks per socket. --kokkos-ndevices=4 is used to set the number of GPUs used per node.

#!/bin/bash -login

#BSUB -J MPIGPUjob          # Job Name
#BSUB -o MPIGPUjob.%J.out   # Standard output filename (%J is the job number)
#BSUB -e MPIGPUjob.%J.err   # Standard error filename
#BSUB -q rhel7G             # Queue Name
#BSUB -m "ride27 ride28"    # Node Names
#BSUB -n 8                  # Number of processors
#BSUB -R "span[ptile=4]"    # Number of processors per node
#BSUB -W 02:00              # Runtime limit [Hours]:[Minutes]
#BSUB -x                    # No other jobs can run on this node

# Limit disk usage for large files
ulimit -c 0

# Load modules
source ${HOME}/Albany/doc/ride-white/modules_cuda.sh

# Run MPIGPU job
mpirun -n 8 --map-by ppr:2:socket [AlbanyExecutable] [InputFile] --kokkos-ndevices=4
Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.