AutoGemm

I - Introduction to AutoGemm

AutoGemm is a python tool for writing Gemm kernels in OpenCL and selecting the fastest available kernel for any Gemm. It supports all the major gemm parameters: precisions (single, double, single-complex, double-complex), orders (column major, row major) and transposes (no transpose, transpose, conjugate transpose).

Within clBLAS, CMake has been programmed to use AutoGemm to produce the kernels which clBLAS uses as its gemm backend. In this context, AutoGemm comes preconfigured to produce a variety of kernels to handle all gemm parameters and most matrix sizes. For developers wanting to support unique matrix sizes, see Section IV on how to customize AutoGemm for different applications.

II - Achieving Peak Gemm Performance

In order for Gemm to achieve peak floating-point performance on GPUs, an implementation must ensure:

global memory bandwidth isn't a bottleneck
global memory latency isn't a bottleneck
local memory bandwidth isn't a bottleneck
local memory latency isn't a bottleneck
kernel launch overhead is low
highest percentage of GPU cycles are spent on MADD operations
special considerations for "small" matrices
special considerations for "skinny" matrices

For W9100, peak double precision floating-point throughput is 2.62 TFlop/s, while the memory throughput is 320 GB/s. Therefore, in order to achieve peak floating performance, the implementation must do 66 floating-point operations per element loaded.

1. Global Memory Bandwidth as Bottleneck

A naive gemm implementation (real precision) performs 2*M*N*K floating-point operations and loads 2*M*N*K elements; that's a ratio of 1:1 while we need a ratio of 66:1 to achieve peak performance. To reduce the number of global memory loads we use local memory tiles.

A GxH local-memory tile reduces the number of loads from global memory to 2*(M/G)*(N/H)*(G+H)*K. Tiling local memory overcomes the global memory bandwidth bottleneck as long as 2GH/(G+H) > 66. Here are some of the smallest GxH tiles which could achieve peak performance.

67x67
56x80
52x90
50x100
40x200
37x300
36x400
35x600
...
33xinfinity

Because a GPU also uses a cache to reuse data; peak performance may also be achieved for a smaller tile if the cache hit percentage is high enough.

As a GPU's memory bandwidth increases, the minimum tile size to hide memory bandwidth decreases.

2. Global Memory Latency as Bottleneck

GPUs hide the high latency of accessing global memory by having a high occupancy of work-groups on each compute unit. The occupancy of AutoGemm's kernels are limited by register usage (from the size of the microtile).

3. Local Memory Bandwidth as Bottleneck

A work-item responsible for a PxQ microtile loads P+Q elements from local memory then performs P*Q MADDs; that's a ratio of 2PQ/(P+X) flops / element loaded. The W9100 has a LDS bandwidth of 5240 GB/s and a floating-point throughput of 2620 flop/s; which is a ratio of 4 flops/element. This would require a micro tile size of PxQ where PQ/P+Q)>2; the smallest micro tiles would be:

4x4
3x6

However, this assumes that each work-item is requesting a different element from LDS. We organize our LDS loads such that for a RxS work-group, all S threads in a row load the same A value, and all R threads in a column load the same B value from LDS. This means that for each iteration over K, we do G*H MADDs but we only load:

R*S*(P/S + Q/R) elements
P*R + Q*S elements
G + H elements So, our implementation does 2GH flops / G+H loads. 2GH/(G+H) > 4 is less restrictive than our criteria from (i). Therefore, any tile which satisfies (i) will also ensure that LDS bandwidth won't be the bottleneck.

4. Local Memory Latency as Bottleneck

GPUs hide the latency of accessing local memory by having a high occupancy of work-groups on each compute unit. The occupancy of AutoGemm's kernels are limited by register usage (from the size of the microtile). Currently, it is believed the peak performance of AutoGemm kernels are limited by not having a high-enough occupancy to hide the latency of local memory.

5. Kernel Launch Overhead is Low

Enqueueing a kernel has a pretty high overhead (in terms of time). It is therefore advantageous to enqueue as few kernels as possible: one.

A Gemm can be calculated by one kernel using a GxH macro tile if M is a multiple of G, and N is a multiple of H; this is ideal ( assuming GxH is a high-performing tile).

When MxN is not an exact multiple of a high-performing tile, then two main options exist. First, put branches in the kernel, so that each work group can compute less than a full tile size; this option will slow performance as discussed in (vi). Second, enqueue one kernel to handle as much of MxN that is a multiple of GxH, then enqueue 1-3 additional kernels to handle the extra rows and column which were not a multiple of GxH. AutoGemm is architected to use the second option.

6. Highest Percentage of GPU Instructions are MADDs

The last step in reaching peak performance is to minimize the number of branch instructions (if or loops) and memory instructions. AutoGemm therefore pull as many branches as possible outside of the kernel. For example, rather than having an if/else for matrix A getting transposed or not, we write one kernel for the transpose A case and another kernel for the non-transpose A case.

The only remaining

7. Special Considerations for "Small" Matrices

Here, a small matrix is one which, if partitioned into GxH (peak-performing) tiles, too few work-groups are produced to fill all the compute units of a GPU to their maximal occupancy. In this case, the isolated gemm system itself is too small to achieve the peak flops of the GPU. In this case, the goal is shifted from achieving peak flops to minimizing execution time.

When a kernel uses a microtile of size PxQ, each thread is responsible for P*Q elements of the C matrix. The largest a microtile is, the more serialized the matrix calculation is, i.e., each thread is doing more work in serial and fewer threads are required to carry out the gemm. The smaller a microtile is, the more parallelised the matrix calculation is and more threads are required to carry out the gemm.

When the GxH macrotile is too large for the gemm to fill the GPU, there has been too much serialization for the entire GPU to be engaged. Therefore, it becomes advantageous to use shrink the tile size down to the point that the resulting number of work-group fills up the GPU.

8. Special Considerations for Skinny Matrices

A skinny matrix is one where M is smaller than G or N is smaller than H of a peak-performing tile size, but the other dimension is large enough to fill the GPU. In these scenarios, there is possibility of data reuse being high enough to prevent global memory bandwidth from being the bottleneck.

The best which can be done for a skinny matrix is to have the smallest dimension of MxN be the smallest dimension of GxH and have the other dimension of GxH be as large as possible (to at least re-use data well in one of the dimensions.

III - Architecture

AutoGemm uses tile parameters and non-tile parameters to describe the kernel and size cutoffs to determine which kernels to use when.

Tile parameters:

workGroupNumRows,Cols - number of rows/columns in a work-group
microTileNumRows,Cols - number of rows/columns in each work-item's micro tile
macroTileNumRows,Cols - total number of rows/columns in a work-group's macro tile shared across LDS
unroll - the number of iterations the loop over K is unrolled

Non-Tile Parameters:

precision - data is single or double precision and real or complex
order - matrices are stored as row-major or column-major
transA - matrix A is "N"ot transposed, "T"ransposed or "C"onjugate transposed
transB - matrix B is "N"ot transposed, "T"ransposed or "C"onjugate transposed
beta - 0 means kernel will handle beta=0 in optimized fashion. 1 means kernel will handle non-zero betas.

AutoGemm Parameters

AutoGemm defines these parameters in AutoGemmParameters.py as input data to determine what OpenCL kernels to write and how to use them.

precisions = [ "s", "z" ]
orders = [ "col" ]
transposes = { "s":[ "N", "T"], "z":["N", "T", "C"] }
betas = [ 1 ]
unrolls = { "s":[16, 1], "z":[8, 1] }
kernelSelectionData = {
  "s":[
    [ size0, fallbackTile0, [tile0-0, tile0-1 ... tile0-N] ],
    [ size1, fallbackTile1, [tile1-0, tile1-1 ... tile1-N] ]
    ...
    [ sizeN, fallbackTileN, [tileN-0, tileN-1 ... tileN-N] ]
  ],
  "z":[
    [ size0, fallbackTile0, [tile0-0, tile0-1 ... tile0-N] ],
    [ size1, fallbackTile1, [tile1-0, tile1-1 ... tile1-N] ]
    ...
    [ sizeN, fallbackTileN, [tileN-0, tileN-1 ... tileN-N] ]
  ]
}

In the above example, AutoGemm will create kernels only for sgemm and zgemm and only column major kernels. For sgemm it will create kernels which handle all permutations of the real transpose cases (NN, NT, TN, TT), and all for zgemm all permutations of the complex transpose cases. For sgemm, it will create kernels which unroll the loop over K by 16 (this makes the loop faster but is limited to where K is a multiple of 16) and it will create kernels where K is unrolled 1, i.e., not unrolled; simillarly for zgemm. It will only create kernels which can handle non-zero betas.

Kernel Selection Data

The kernelSelectionData determines when the various kernel tile sizes get used. The kernel selection logic first narrows down which kernels will be used by matching exactly the non-tile parameters. It then determines which tile size to use based on the kernelSelectionData as follows. A kernel "matches" the input matrices if M is a multiple of the macroTileNumRows, N is a multiple of the macroTileNumCols and K is a multiple of unroll.

if (M\*N >= size0\*size0) {
  if (tile0-0 matches M,N,K) {
    return tile0-0; // exact match found; only 1 kernel needed
  }
  if (tile0-1 matches M,N,K) {
    return tile0-1; // exact match found; only 1 kernel needed
  }
  ...
  if (tile0-N matches M,N,K) {
    return tile0-N; // exact match found; only 1 kernel needed
  }
  // no exact match found; only 2-4 kernels will be needed
  return fallbackTile0
}
if (M\*N >= size1\*size1) {
  if (tile1-0 matches M,N,K) {
    return tile0-0; // exact match found; only 1 kernel needed
  }
  if (tile1-1 matches M,N,K) {
    return tile1-1; // exact match found; only 1 kernel needed
  }
  ...
  if (tile1-N matches M,N,K) {
    return tile1-N; // exact match found; only 1 kernel needed
  }
  // no exact match found; 2-4 kernels will be needed
  return fallbackTile1
}

Generating the Kernel Selection Data

AutoGemm comes equipped with an AutoGemmProfiler. The profiler is a host application which "brute-force" benchmarks performance of all tile sizes for a range of matrix sizes to detmine which tile sizes was the fastest and second fastest... It writes out kernel selection data which is consumable by AutoGemm.

AutoGemm Files Generated

Hierarchy of output:

AutoGemmIncludes
- AutoGemmClKernels.cpp .h - declares and defines the cl_kernel objects
- AutoGemmKernelBinaries.cpp .h - declares and defines the kernel binary arrays to be NULL; kernels which are pre-compiled will have a non-null binary array declared which will override (through pre-processor ifdef's) the null declarations.
- AutoGemmKernelBuildOptionsBinary.cpp .h - declares and defines the build options to be used when building kernels from pre-compiled binaries
- AutoGemmKernelBuildOptionsSource.cpp .h - declares and defines the build options to be used when building kernels from source
- AutoGemmKernelEnumeration.h - declares and defines arrays which list all of the tile and non-tile parameters being used. This file is only used only by AutoGemm adjunct tools for debugging etc.
- AutoGemmKernelSelection.cpp .h - declares and defines gemmSelectKernel(), which selects the optimal gemm kernel based on matrix size. clBLAS only #includes this file which, in turn, #includes other header files as necessary.
- AutoGemmKernelSelectionSpecific.cpp .h - declares and defines gemmSelectKernelSpecific which selects a kernel by specifying the tile parameters. This file is only used by AutoGemm adjunct tools for debugging etc.
- AutoGemmKernelSources.cpp .h - First #includes UserGemmKernelSources, then #includes the AutoGemm kernels in AutoGemmKernelSources/; user-defined kernels override AutoGemm kernels through preprocessor ifdefs.
- AutoGemmKernelsToPreCompile.h - Lists which kernels should be pre-compiled. This file is only used by the AutoGemmPreCompile project.
AutoGemmKernelBinaries (will be empty if no kernels selected to be pre-compiled)
- AutoGemmKernelBinariesPreCompiled.cpp .h - #includes all the pre-compiled binary kernel files
- *_bin.cpp - pre-compiled binary kernel files
AutoGemmKernelSources
- *_src.cpp - kernel source files

"Building" AutoGemm with CMake for clBLAS

CMake handles the dependencies for running AutoGemm within the context of clBLAS. Within CMake, the user specifies which kernels, if any, are to be pre-compiled. The user chooses from a list of precisions and transpose pairs. All kernels (all tiles, beta, orders) fitting that descriptions will be precompiled. The use may also choose which OpenCL compiler version to use: 2.0 (recommended) if available, or 1.2.

CMake is configured as follows:

Run AutoGemm.py to generate headers.
(if pre-compiling) Run KernelsToPreCompile.py to generate AutoGemmKernelsToPreCompile.h which gets compiled into AutoGemm_PreCompile_Bin.
(if pre-compiling) Compile AutoGemm_PreCompile_Bin.
(if pre-compiling) Execute AutoGemm_PreCompile_Bin to generate AutoGemmKernelBinariesPreCompiled.cpp, .h and *_bin.cpp.
Compile clBLAS now that all the dependencies have been generated.

IV - Customizing to your Needs

AutoGemm within clBLAS comes pre-loaded with a list of kernels to use and kernel selection data, usually based on AMD's latest generation flag-ship GPU, which would suffice most applications. Advanced users or developers seeking a higher level of performance for unique applications (e.g., "skinny" matrices) may which to re-parameterize AutoGemm and re-tune it to they needs.

Customize Kernel Assortment for an Application

Because data re-use is higher, square macrotile sizes outperform their rectangular counterparts; therefore, AutoGemm usually find highest performance by using square macrotiles. AutoGemm will, however, support most work-group dimensions and microtile dimensions, which in turn can produce a wide variety of macrotile dimensions (though these have not been fully tested/debugged).

To add your own custom tile:

Add your new tile's dimensions anywhere within the kernelSelectionData.
Re-run AutoGemm; it will write kernels using your new tile size and, importantly, it will add those tile sizes to AutoGemmKernelEnumeration.h.
Compile and run the AutoGemm_Tools_Profile project. This will benchmark all matrix sizes using all tile sizes including the new one.
The output from profiling will be new kernel selection data which says for which matrices your new kernel is the fastest and should be used.

Caveat: If your new kernel is designed to be fast for a matrices of particular dimensions (e.g., skinny tile for skinny matrices) you would first need to modify ProfileAutoGemm.cpp to have the profiler test your custom matrix dimension.

Tune Kernel Selection to new GPU

Follow steps 3-4 from "Customize Kernel Assortment for an Application" to benchmark matrices for a new GPU; it outputs new data which would need to be copied into kernelSelectionData of AutoGemm.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoGemm

AutoGemm

I - Introduction to AutoGemm

II - Achieving Peak Gemm Performance

1. Global Memory Bandwidth as Bottleneck

2. Global Memory Latency as Bottleneck

3. Local Memory Bandwidth as Bottleneck

4. Local Memory Latency as Bottleneck

5. Kernel Launch Overhead is Low

6. Highest Percentage of GPU Instructions are MADDs

7. Special Considerations for "Small" Matrices

8. Special Considerations for Skinny Matrices

III - Architecture

AutoGemm Parameters

Kernel Selection Data

Generating the Kernel Selection Data

AutoGemm Files Generated

"Building" AutoGemm with CMake for clBLAS

IV - Customizing to your Needs

Customize Kernel Assortment for an Application

Tune Kernel Selection to new GPU

Clone this wiki locally