Skip to content

Important HPL GPU and CALDGEMM options

David Rohr edited this page Oct 27, 2015 · 26 revisions

With the new dynamic configuration system, many options are automatically set, and almost everything can be configured at runtime. All options required for standard use are shown in the compile time config sample and the run time config sample. This guide will go through all there relevant options in detail:


Compile time configuration (Make.Generic.Options)

#Any custom options here
HPL_DEFS      += 

You can place any HPL compile time configuration constant here. See the HPL compile time constant readme for a full list of options.


#This option defines whether HPL-GPU is compiled with
#MPI for multi-node support
HPL_CONFIG_MPI = 0

Set to 1 to enable MPI. If set to zero, all MPI calls are replaced by a single-node MPI mockup, so you do not install and link to an MPI library at all. HPL-GPU supports several MPI implementations, the default is OpenMPI. For other MPI backends, you have to change the low level build file Make.Generic.


#This option defines the verbosity level of HPL-GPU.
#The setting ranges from 0 (no output) to 4 (very verbose), default is 3.
#For many-node runs the suggested setting is 1.
HPL_CONFIG_VERBOSE = 3

Verbosity levels are:

  • 0: No verbosity
  • 1: Report progress in one line per HPL iteration.
    1. Report also DGEMM performance for each iteration and for each node (disabled CALDGEMM cal_info.Quiet).
    1. Report also timing statistics for the CPU tasks (enables -DHPL_DETAILED_TIMING -DCALDGEMM_TEST options).
    1. Report detailed timing for all pipeline steps in Lookahead 2/3 (enables -DCALDGEMM_TEST2).

#Add git commit hashes to binary
HPL_GIT_STATUS = 1

Add some text to the standard output of HPL that shows the git commit and branch for HPL and CALDGEMM, whether the build tree was clean, and the compile time.


#Select the backend for CALDGEMM, the default is opencl
#(possible options: cal, cuda, opencl, cpu
HPL_CALDGEMM_BACKEND = opencl

Select the CALDGEMM backend used for HPL. The respective backend must be enabled in the CALDGEMM config file config_options.mak. Possible options are: opencl, cuda, cal, cpu. Be advised that the environment variables $AMDAPPSDKROOT or $CUDA_PATH respectively must be set.


HPL_USE_LTO   = 1
#HPL_AGGRESSIVE_OPTIMIZATION = 1

HPL_USE_LTO enables GCC link time optimization. If you enable this option, you might also want to enable LTO for CALDGEMM in config_options.mak, because CALDGEMM is linked statically.

HPL_AGGRESSIVE_OPTIMIZATION enables more aggressive compiler optimizations. The aggressive and non-aggressive levels are defined in Make.Generic.


#Use AVX LASPW implementation
HPL_DEFS      += -DHPL_LASWP_AVX 

There are three implementations of LASWP provided by HPL.

  • An AVX2 version that requires an Intel Haswell CPU or newer.
  • An SSE version for all other CPUs.
  • The legacy x86 version that comes with standard HPL.

You should enable HPL_LASWP_AVX on Haswell CPUs or newer. Otherwise HPL-GPU will use the SSE version. (You can use the compile-time constant -DUSE_ORIGINAL_LASWP for the legacy x86 version. On AMD CPUs you might also want to enable -DHPL_HAVE_PREFETCHW


#This setting links HPL to libcpufreq, allowing to change CPU frequencies
#during runtime. This can be used to obtain better efficiency.
#HPL_DEFS     += -DHPL_CPUPOWER

Link HPL-GPU to libcpupower. This is required to use the setcpufreq() function in HPL_CUSTOM_PARAMETER_CHANGE or HPL_CUSTOM_PARAMETER_CHANGE_CALDGEMM. If you want to use the older libcpufreq, please set -DHPL_CPUFREQ.


#These settings cal alter certain parameters during runtime. The first
#setting is insider HPL, the second for CALDGEMM parameters. The example
#alters the CPU frequency over time and disables some GPUs towards the
#end for better efficiency.
#HPL_DEFS     += -DHPL_CUSTOM_PARAMETER_CHANGE="if (j == startrow) setcpufreq(3000000, 3000000);"
#HPL_DEFS     += DHPL_CUSTOM_PARAMETER_CHANGE_CALDGEMM="cal_dgemm->SetNumberDevices(factorize_first_iteration || global_m_remain > 50000 ? 4 : (global_m_remain > 35000 ? 3 : (global_m_remain > 25000 ? 2 : 1)));if (curcpufreq >= 2200000 && global_m_remain > 100000) setcpufreq(2200000, 1200000); else if (global_m_remain < 50000 && global_m_remain >= 50000 - K) setcpufreq(2700000, 1200000); else if (global_m_remain < 70000 && global_m_remain >= 70000 - K) setcpufreq(2400000, 2400000); else if (global_m_remain < 85000 && global_m_remain >= 85000 - K) setcpufreq(2200000, 2200000); else if (global_m_remain < 95000 && global_m_remain >= 95000 - K) setcpufreq(2000000, 2000000);"
#This second example alters the AsyncDGEMMThreshold with the remaining matrix
#size, and adepts the pipeline mid marker -Aq dynamically
#HPL_DEFS      += -DHPL_CUSTOM_PARAMETER_CHANGE_CALDGEMM="if (factorize_first_iteration) {cal_info.AsyncDGEMMThreshold = 256;} else if (global_m_remain < 125000) {cal_info.AsyncDGEMMThreshold = 480;} else if (global_m_remain < 155000) {cal_info.AsyncDGEMMThreshold = 1920;} else {cal_info.AsyncDGEMMThreshold = 3840;} if (factorize_first_iteration || global_m_remain > 50000) cal_info.PipelinedMidMarker = 25000; else cal_info.PipelinedMidMarker = global_m_remain / 2 + 1000;"

#As above, but this time we modify some factorization parameters over time.
#(Relevant for older AMD processors)
#HPL_DEFS     += -DHPL_CUSTOM_PARAMETER_CHANGE="ALGO->nbmin = j > 45000 ? 64 : 32;"

The HPL_CUSTOM_PARAMETER_CHANGE and HPL_CUSTOM_PARAMETER_CHANGE_CALDGEMM constants are actually C++ code that is inserted at the beginning of each HPL-GPU iteration. The HPL_CUSTOM_PARAMETER_CHANGE constant is placed in HPL's pdgesv function and can thus manipulate HPL parameters, such as the current NB. The HPL_CUSTOM_PARAMETER_CHANGE_CALDGEMM is placed inside the CALDGEMM call and has access to the cal_info struct such that it can manipulate CALDGEMM settings.

In both cases, you have access to the global_m_remain variable which defines the total number of columns that remain in the factorization. This should be used to change values at a particular point in time during the run. If the HPL_CALDGEMM_ASYNC_FACT_FIRST option is enabled, global_m_remain will be set to 1 during the factorization of the first iteration, which is the only factorization step that does not run in parallel to GPU DGEMM with lookahead. You can access the factorize_first_iteration variable to detect this case.


#Temperature threshold when run is aborted. Requires CALDGEMM to be
#compiled with ADL support.
#HPL_DEFS     += -DHPL_GPU_TEMPERATURE_THRESHOLD=92.

This can set a threshold for GPU temperature. As soon as the temperature of the hottest GPU in the system exceeds HPL_GPU_TEMPERATURE_THRESHOLD, the HPL process is killed. GPU temperature is queried via the ADL library, which must be installed for this purpose, and it only works with AMD GPUs at the moment.


Runtime configuration (HPL-GPU.conf)

#Custom CALDGEMM Parameters:
#HPL_PARAMDEFS:

Use this HPL_PARAMDEFS to place arbitrary CALDGEMM options using CALDGEMM's dgemm_bench [parameter syntax](CALDGEMM Command Line Options).


#In case the selected backend is OpenCL, this setting chooses the
#3rd party library file that provides the OpenCL kernels.
HPL_PARAMDEFS: -Ol amddgemm.so

#This sets the GPU ratio, i.e. the fraction of the workload performed by the
#GPU. The following options exist:
#1.0: All work performed by GPU,
#0.0 <= GPURatio < 1.0: Fixed setting of GPU Ratio,
#-1.0: Fully automatic GPU Ratio calculation,
#-1.0 < GPURatio < 0.0: This defines the negative initial GPU Ratio; during
#runtime the ratio is then dynamically adapted. This means e.g. GPURatio =
#-0.9 will start with a ratio of 0.9, and then adapt it dynamically.
#The suggested settings are: 1.0 for GPU only and for efficiency optimized
#cases; In any other case a negative initial setting is suggested.
#Please see also the advence GPURatio options below.
HPL_PARAMDEFS: -j 1.0

Be aware that there are more related settings (-jm, -jt, -js, -jl, -jp, -jq) below.


#This defines the CPU core used for the management thread (-K) and for the
#broadcast thread (-Kb: -2 = MPI affinity). If unset this is autodetected.
#Optionally set -KG and -KN
HPL_PARAMDEFS: -K 0 -Kb -2
#HPL_PARAMDEFS: -KG 0 -KN 20

Please refer to Thread to core pinning in HPL and CALDGEMM for an overview of all CPU pinning options.


#Local matrix size for which alternate lookahead is activated. Usualy you
#want to leave this infinitely high. For optimized efficiency, always
#activate. For optimized performance, you can lower this setting in some
#cases. Optimal value must be tuned manually.
HPL_PARAMDEFS: -Ca 10000000

See section 3b of HPL Tuning for a reference how to obtain the optimal AlternateLookahead settings.


#Use the fast random number generator (faster initialization of the program,
#should be disabled for official runs)
#0: Disabled,
#1: Only fast initialization (cannot verify),
#2: Fast Initialization and Verification (default)
#HPL_FASTRAND: 2

HPL's standard random number generator will take very long (about half an hour) for large matrices like 256GB+. HPL-GPU has a faster multi-threaded RNG which will however produce different numbers. For official runs, you should use the standard RNG and set this to 0. For testing, you should set it to 2. (There is principally no reason to set this to 1, because in that case the HPL-GPU result cannot be verified.)


#You can set several thresholds. If the remaining global matrix dimension is
#above the n-th threshold, the current NB for the next iteration is
#multiplied by the n-th multiplier
#HPL_NB_MULTIPLIER_THRESHOLD: 20000;10000
#HPL_NB_MULTIPLIER: 3;2

Tuning of this parameter is explained in section 4 of HPL Tuning.


#Restrict CALDGEMM to 2 GPUs.
#HPL_PARAMDEFS: -Y 2 -Yu      //For using -Yu, either set -Ya as well, or
                              //use SetNumberDevices in
                              //HPL_CUSTOM_PARAMETER_CHANGE_CALDGEMM of
                              //Make.Generic.Options file.

Here you can set the number of devices to use, otherwise CALDGEMM will use all devices. This is helpfull for tests and needed if you want to run multiple MPI processes per node as explained in Example 5 of Thread to core pinning in HPL and CALDGEMM. Towards the end of the run, you might want to reduce the number of GPUs, mostly to improve power efficiency. For this, you can use the SetNumberDevices() function in HPL_CUSTOM_PARAMETER_CHANGE_CALDGEMM function, and you can use the -Ya parameter here to test this. In that case, the -Yu parameter might yield you better async factorization / DTRSM performance, because it will use the GPUs not used for GPU DGEMM for these purposes. (Async factorization / DTRSM is enabled via -Oa and -Od settings as described below.)


#Pin MPI runtime threads to these CPU core(s)
HPL_MPI_AFFINITY: 1           //3 * number of physical cores per socket
                              //(= 2nd virtual core of 1st core on socket 2

This sets the affinity of the MPI runtime threads. Consider that there is also the CALDGEMM broadcast thread which should have a related affinity. See Thread to core pinning in HPL and CALDGEMM for details.


#Change the thread order for NUMA architectures:
#0: Default order,
#1: Optimized for CPU with Hyperthreading and interterleaved virtual cores,
#2: Optimized for other NUMA systems.
#It might make sense to just try all settings
#HPL_PARAMDEFS: -: 0

Details are explained in Thread to core pinning in HPL and CALDGEMM.


#Mostly on Intel CPUs with Hyperthreading, it might make sense to limit the
#number of LASWP cores to the number of physical cores - n (with n the number
#of caldgemm threads, 1 for OpenCL)
#HPL_NUM_LASWP_CORES: 19

On a system with HyperThreading and with -Oc 1 you will usually want to set this to the number of physical CPU cores - 1 (-1 for the GPU related thread). See Thread to core pinning in HPL and CALDGEMM for details.


#NUMA Memory interleaving:
#0 -> Disabled,
#1 -> Interleave all memory,
#2 -> Interleave matrix-memory (default)
#HPL_INTERLEAVE_MEMORY: 2

In general, you will want to set this to 2 on every NUMA system. The only exclusion is if you run multiple MPI processes per node as explained in example 5 of Thread to core pinning in HPL and CALDGEMM. The setting of 1 is usually just a slightly worse alternative of 2.


#You can reorder the GPU device numbering. In general, it is good to
#interleave NUMA nodes, i.e. if you have 2 NUMA nodes, 8 GPUs, GPUs
#0 to 3 on node 0, GPUs 4 to 7 on node 1, the below setting is suggested.
#Keep in mind that the altered numbering affects other settings relative
#to GPU numbering, e.g. GPU_ALLOC_MAPPING
#HPL_PARAMDEFS: -/ 0;4;2;6;1;5;3;7 

#Define CPU cores used to allocated GPU related memory. You should chose
#the correct NUMA nodes. For the above 8-GPU example and 2 10-core CPUs
#(with Hyperthreading) the following would be good:
#HPL_PARAMDEFS: -UA0 0 -UA1 39 -UA2 0 -UA3 39 -UA4 0 -UA5 39 -UA6 0 -UA7 39

#Apply only to odd MPI ranks: Exclude some cores from CALDGEMM / HPL-GPU
#!%2,1 HPL_PARAMDEFS: -@ 8;9;18;19

This example excludes CPU cores 8,9,18,and 19 from HPL and CALDGEMM usage. It uses the special syntax !%2,1 to apply this setting only to odd numbered MPI ranks of a multi-node run. See Section 1 of HPL Tuning for an explanation of the syntax.


#SimpleGPUScheduling mode: 0 / outcommented: standard round-robin use of
#command queues - 1 / use dedicated queues for sent, receive, kernels.
#For AMD GPUs you want to enable this!
#HPL_PARAMDEFS: -OQ

CALDGEMM provides various GPU command queue scheduling modes:

  • Standard: (Enabled via -Oq 0): 3 queues are used in a round robin fashion, each queue performs transfers in both directions and kernel executions. Buffers are synchronized by the host.
  • SimpleQueueing: (Enabled via -Oq, default): As Standard, but synchronization is performed by the GPU by default.
  • AlternateSimpleQueueing: (Enabled via -Oq -OQ, or only by -OQ): Uses GPU synchronization as SimpleQueueing and also uses three command queues. But, queue 0 is only used for transfers to the device, queue 1 is only used for transfers to the host, and queue 2 is only used for kernels. This makes sure that at every point in time only 1 kernel can be running. This fixes some performance problems on some GPUs. This can improve results on AMD GPUs and usually yields worse results on NVIDIA GPUs. Alternate queueing has the additional benefit of better overlapping sends and receives to the GPU.
  • AlternateSimpleMultiQueueing: (Enabled via -Oq -OQ -OM, on only by -OM): As the alternate queueing, but it uses multiple Queues for the kernel calls. It is thus similar to SimpleQueueing but has the improved transfer overlapping of the AlternateSimpleQueueing. Still, it is the most complex one with most synchronization.

#Outcomment the following to support lookahead 3 and set -Aq appropriate.
#If you have sufficient GPU memory, enable also -Ab 1
#HPL_PARAMDEFS: -Ap -Aq 25000 -Ab 1

These options need to be enabled for Lookahead 3. -Ap enables lookahead 3 support in general. -Aq sets the mitmarker when lookahead 3 starts, i.e. as soon as the previous iteration has finished its DGEMM until column set via -Aq, the current iteration initializes the DTRSM pipeline until this column. Usually, this value should be at least 4-5 times the block size Nb, but it should be less than half the remaining matrix size. Ideally, the value of -Aq is changed dynamically via CALDGEMM_CUSTOM_PARAMETER_CHANGE_CALDGEMM as in the example above.


#These option can define cutoff points (remaining local matrix size)
#where the Lookahead and Lookahead 2 features are turned off. (Needed
#for older AMD processors for better performance).
#HPL_DISABLE_LOOKAHEAD: 6000
#HPL_LOOKAHEAD2_TURNOFF: 4000
#HPL_LOOKAHEAD3_TURNOFF: 0

#Tool to find the duration of the core phase of HPL, needed to measure
#power consumption and power efficiency.
#HPL_DURATION_FIND_HELPER

This option will print a timestamp directly before the start of the core phase of linpack and directly after it. These timestamps can be used to synchronize a power meter to measure the power during exactly the core phase. In addition, this option will add two 10 second periods of inactivity before and after the core phase. These give an obvious signature such that one can verify whether the timestamps are synchronized correctly.


#This defines the CPU core used for the device runtime. (-2 = same
#as main thread (default), -1 = all cores available, >= 0 = select core)
#HPL_PARAMDEFS: -tr -1

See Thread to core pinning in HPL and CALDGEMM for details on thread pinning.


#In multi-node runs, the factorization causes significant CPU load on
#some but not all nodes. Caldgemm tries to take this into accound for
#automatic gpu ratio calculation, but sometimes this fails.
#In this case, the following setting can define a minimum GPU ratio
#(GPURatioDuringFact, -jf) in iterations where the node performs the
#factorization. See also -jm (GPURatioMax), -jt (GPURatioMarginTime),
#-js (GPURatioMarginTimeDuringFact), -jl (GPURatioLookaheadSizeMod),
#-jp (GPURatioPenalties), -jq (GPURatioPenaltyFactor). If you do not
#want them to interfere with ratio calculation, set them all to 0!
#HPL_PARAMDEFS: -jf 0.9

See CALDGEMM Command Line Options for a detailed list of all the -j and -j* parameters related to the splitting of the Matrix onto CPU and GPU.


#Offload some steps during the factorization to the GPU. Offloading
#is active as soon as local matrix size is lower than the defined
#parameter. For best efficiency, you want to enable this always,
#i.e. set the parameter very high. For best performance, parameter
#must be tuned manually.
#HPL_CALDGEMM_ASYNC_FACT_DGEMM: 10000000
#HPL_CALDGEMM_ASYNC_FACT_FIRST
#HPL_PARAMDEFS: -Oa
##HPL_PARAMDEFS: -Or 480    //Can be handled by
                            //HPL_CUSTOM_PARAMETER_CHANGE_CALDGEMM
                            //in Make.Generic.Options file

The -Oa option creates an additional asynchronous side-queue in CALDGEMM that can run small dgemms asynchronously. The idea is to offload some of the large dgemm cals during Linpack's factorization (which are still very small compared to the large update DGEMM that usually runs on the GPU) onto the GPU. This can reduce CPU load and thus improve overall performance at the end of the run when the CPU is the bottleneck. It also increases power efficiency. There are several options to tune this in more detail:

  • HPL_CALDGEMM_ASYNC_FACT_DGEMM This defines a threshold for the maximum matrix size when the async queue is used. The idea is not to use it for very large matrices and then use it for matrices below HPL_CALDGEMM_ASYNC_FACT_DGEMM.
  • -Or sets a minimum threshold for the m, n, and k parameters of DGEMM for GPU offload. If any of the parameters is below that threshold, the GPU is not used. This is because the GPU is not suited for very small DGEMMs.
  • In the very first factorization iteration of Linpack, there is no GPU DGEMM running in parallel. This setting offloads DGEMMs during the first factorization iteration to the GPU regardless of the HPL_CALDGEMM_ASYNC_FACT_DGEMM setting.

#Same as above, but for the large Update-DTRSM (and DTRSM inside
#the factorization below). Optimal value for performance and
#efficiency is usually identical. Value must be tuned manually.
#(You can use HPL_CALDGEMM_ASYNC_DTRSM_MIN_NB in combination
#with HPL_NB_MULTIPLIER to disable the async dtrsm below a certain NB).
#HPL_CALDGEMM_ASYNC_DTRSM: 215000
#HPL_PARAMDEFS: -Od
#HPL_CALDGEMM_ASYNC_FACT_DTRSM: 0
#HPL_CALDGEMM_ASYNC_DTRSM_MIN_NB: 480

This is very similar to the async DGEMM above:

  • -Od is needed to enable asynchronous DTRSM at all.
  • -Or sets the threshold for the minumum parameters m and k of NB.
  • HPL_CALDGEMM_ASYNC_FACT_DTRSM Sets the threshold for the trailing matrix size to run async DTRSM during factorization. 0 disables it.
  • HPL_CALDGEMM_ASYNC_DTRSM Sets the threshld for trailing matrix size for DTRSM outsite factorization.
  • HPL_CALDGEMM_ASYNC_DTRSM_MIN_NB : Use in combination with HPL_NB_MULTIPLIER: sets a threshold for the minimum blocking size Nb required to run async DTRSM.

#Preallocate some data to avoid unnecessary dynamic mallocs,
#define number of BBuffers to save memory. PreallocData should be
#larger than local matrix size / Height. max_bbuffers should be larger
#than local matrix size / (Number of GPUs * Height).
#HPL_PARAMDEFS: -Op 60 -bb 17

Within each HPL iteration CALDGEMM has to allocate certain buffers. The -Op setting preallocates these buffers. You have to provide the maximum number of buffers required as argument. This can be calculated as Matrix Size / CALDGEMM Height (-h parameter, which is usually autoset by the DGEMM backend. Normal values of -h are between 2048 and 4096.)

-bb sets the number of bbufers. There should be at least (Remaining Matrix size / (Number of GPUs * value of -h)) BBuffers.


#Minimize the CPU workload, as soon as the local matrix size is below
#this threshold. Only applicable if GPURatio < 1.0. This can be
#used to help the GPU ratio estimation, and forces GPURatio to 1.0
#for small matrices.
#HPL_PARAMDEFS: -Cm 2000000

The -Cm parameter disables the CPU fraction of the update DGEMM (and thus overrides -j parameter and all -j* parameters. This parameter can be used to make sure that ad the end of the run when Linpack is CPU limited all DGEMM workload is offloaded to the GPU.


#Enable (default) / disable HPL warmup iteration
#HPL_WARMUP

This setting runs one warmup itertion before HPL-GPU starts its actual benchmark. This neglects some of the slowdown-effects sometimes created by the first time a task is performed, like allocating temporary buffers in the GPU driver on for MPI. It is in general good to leave this enabled.