Skip to content
HTML Cuda CMake C++ Other
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.travis
cmake
figures
src
third_party
tools
.agignore
.clang-format
.gitignore
.gitmodules
.travis.yml
CMakeLists.txt
README.md

README.md

Accelerating Reduction and Scan Using Tensor Core Units

@inproceedings{Dakkak:2019:ARS:3330345.3331057,
 author = {Dakkak, Abdul and Li, Cheng and Xiong, Jinjun and Gelado, Isaac and Hwu, Wen-mei},
 title = {Accelerating Reduction and Scan Using Tensor Core Units},
 booktitle = {Proceedings of the ACM International Conference on Supercomputing},
 series = {ICS '19},
 year = {2019},
 isbn = {978-1-4503-6079-1},
 location = {Phoenix, Arizona},
 pages = {46--57},
 numpages = {12},
 url = {http://doi.acm.org/10.1145/3330345.3331057},
 doi = {10.1145/3330345.3331057},
 acmid = {3331057},
 publisher = {ACM},
 address = {New York, NY, USA},
}

Installing latest cmake

cmake version >=3.8 is required. (there's a problem with hunter using cmake 3.10.2)

  cd /tmp
  wget https://cmake.org/files/v3.10/cmake-3.10.1-Linux-x86_64.sh
  sudo sh cmake-3.10.1-Linux-x86_64.sh --prefix=/usr/local --exclude-subdir

you may also want to remove the default installation sudo apt-get remove cmake

you need to install from source if on ppc64le

Compile

To compile the project run the following commands

mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make

if you get errors about nvcc not supporting your gcc compiler, then you may want to use

cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_HOST_COMPILER=`which gcc-6` ..

Available Benchmarks

The following benchmakrs are currently available

GEMM

Name Description
CUBLAS_WMMA_GEMM CUBLAS GEMM w/ Tensor Cores. AType, BType = half, CType = float
CUBLAS_GEMM CUBLAS GEMM w/o Tensor Cores. AType, BType = half, CType = float
CUTLASS_WGEMM CUTLASS GEMM w/ Tensor Cores. AType, BType = half, CType = float
CUDA_WMMA_GEMM_NAIVE Naive CUDA GEMM w/ Tensor Cores. AType, BType = half, CType = float
CUDA_WMMA_GEMM_SHAREDMEM Shared memory CUDA GEMM w/ Tensor Cores. Atype, BType = half, CType = float
CUDA_WMMA_GEMM Optimized CUDA GEMM (from CUDA Samples). AType, BType = half, CType = float

Comparision

gemm_small gemm_medium gemm_large

HGEMM

Name Description
CUBLAS_WMMA_HGEMM CUBLAS HGEMM w/ Tensor Cores. AType, BType = half, CType = half
CUBLAS_HGEMM CUBLAS HGEMM w/o Tensor Cores. AType, BType = half, CType = half

Comparision

hgemm

GEMV

Name Description
CUBLAS_GEMV cublasSgemv y = alpha * Ax + beta * y. AType, xType, yType = float
CUDA_WMMA_GEMV_CUBLAS Use CUBLAS GEMM w/ Tensor Cores for GEMV. AType, xType = half, yType = float
CUDA_WMMA_GEMV_NAIVE Use Naive CUDA GEMM w/ Tensor Cores for GEMV. AType, xType = half, yType = float
CUDA_WMMA_GEMV_SHAREDMEM Use Shared memory CUDA GEMM w/ Tensor Cores for GEMV. AType, xType = half, yType = float

Comparision

gemv

PrefixSum

inType = half, outType = half

Segmented PrefixSum

Name Description
CUB_BLOCK_SEGMENTED_PREFIXSUM cub::BlockScan. segment_size = THREADS_PER_BLOCK * ITEMS_PER_THREAD
CUB_WARP_SEGMENTED_PREFIXSUM cub::WarpScan. segment_size = LOGICAL_THREADS_PER_WARP
THRUST_SEGMENTED_PREFIXSUM thrust::inclusive_scan_by_key, inType = half, outType = half
CUDA_WMMA_SEGMENTED_PREFIXSUM_16 use Tensor Cores. segment_size = 16. each warp calculates WMMA_TILES_PER_WARP * 16 segments
CUDA_WMMA_SEGMENTED_PREFIXSUM_16N use Tensor Cores. segment_size = 16n where n is a positive integer. each warp calculates 16 segment
CUDA_WMMA_SEGMENTED_PREFIXSUM_256 use Tensor Cores. segment_size = 16. each warp calculates WMMA_TILES_PER_WARP segments
CUDA_WMMA_SEGMENTED_PREFIXSUM_256N use Tensor Cores. segment_size = 256n where n is a positive integer. each warp calculates 1 segment

WWMMA_TILES_PER_WARP and WARPS_PER_BLOCK tuning for CUDA_WMMA_SEGMENTED_PREFIXSUM_16 and CUDA_WMMA_SEGMENTED_PREFIXSUM_256

(WMMA_TILES_PER_WARP = 2, WARPS_PER_BLOCK = 4) is the best.

tune_prefixsum

Comparision

seg_prefix_sum

Full PrefixSum

Name Description
CUB_FULL_PREFIXSUM cub::DeviceScan::ExclusiveSum. inType = half, outType = half
THRUST_FULL_PREFIXSUM thrust::inclusive_scan. inType = half, outType = half
CUDA_WMMA_FULL_PREFIXSUM_3KERS_256 use CUDA_WMMA_SEGMENTED_PREFIXSUM_256 as the first kernel and use cub::DeviceScan::ExclusiveSum for scanning the partial sums, the third kernel adds the partials sums. inType = half, outType = half
CUDA_WMMA_FULL_PREFIXSUM_3KERS use CUDA_WMMA_SEGMENTED_PREFIXSUM_256N as the first kernel and use cub::DeviceScan::ExclusiveSum for scanning the partial sums, the third kernel adds the partials sums. inType = half, outType = half
CUDA_WMMA_FULL_PREFIXSUM_CG one kernel. use cooperative groups grid sysnchrozation to scan the partial sums. inType = half, outType = half

Note: There's a bug in cub::WarpScan for LOGICAL_THREADS_PER_WARP = 16.

Comparision

prefix_sum

Reduction

Segmented Reduction

Name Description
CUB_BLOCK_SEGMENTED_REDUCTION cub::BlockReduce. segment_size = THREADS_PER_BLOCK * ITEMS_PER_THREAD
CUB_DEVICE_SEGMENTED_REDUCTION cub::DeviceSegmentedReduce::Sum
THRUST_SEGMENTED_REDUCTION thrust::reduce_by_key
CUDA_WMMA_SEGMENTED_REDUCTION_16 use Tensor Cores. segment_size = 16. each warp calculates WMMA_TILES_PER_WARP * 16 segments
CUDA_WMMA_SEGMENTED_REDUCTION_16N use Tensor Cores. segment_size = 16n where n is a positive integer. each warp calculates 16 segment
CUDA_WMMA_SEGMENTED_REDUCTION_256 use Tensor Cores. segment_size = 16. each warp calculates WMMA_TILES_PER_WARP segments
CUDA_WMMA_SEGMENTED_REDUCTION_256N use Tensor Cores. segment_size = 256n where n is a positive integer. each warp calculates 1 segment

WWMMA_TILES_PER_WARP and WARPS_PER_BLOCK tuning for CUDA_WMMA_SEGMENTED_REDUCTION_16 and CUDA_WMMA_SEGMENTED_REDUCTION_256

(WMMA_TILES_PER_WARP = 1, WARPS_PER_BLOCK = 8) is the best.

Comparision

seg256_reduction seg512_reduction seg1024_reduction

Full Reduction

Name Description
CUB_FULL_REDUCTION cub::DeviceReduce::Sum. inType = half, outType = half
THRUST_FULL_REDUCTION thrust::reduce. inType = half, outType = half
CUDA_WMMA_FULL_REDUCTION_2KERS use CUDA_WMMA_SEGMENTED_REDUCTION_256N as the first kernel and use cub::DeviceSegmentedReduce::Sum for reducing the all segments. inType = half, outType = half
CUDA_WMMA_FULL_REDUCTION_CG one kernel. use cooperative groups grid sysnchrozation. inType = half, outType = half
CUDA_WMMA_FULL_REDUCTION_ATOMIC_W_BLOCK_SYNC one kernel. use atomicAdd with block synchronization. inType = half, outType = half
CUDA_WMMA_FULL_REDUCTION_ATOMIC_W_ATOMIC_BALLOT one kernel. use atomicAdd with ballot. inType = half, outType = half

Comparision

reduction

Running Individual Benchmarks

you can benchmark each primitive individually using

./bench --benchmark_filter=[name_of_primitive]

for example

./bench --benchmark_filter=WMMA_GEMM

futher controls over the benchmarks are explained in the --help option

Benchmark all the primitives

./bench

The above will output to stdout somthing like

------------------------------------------------------------------------------
Benchmark                       Time           CPU Iterations UserCounters...
------------------------------------------------------------------------------
SGEMM/1000/1/1/-1/1             5 us          5 us     126475 K=1 M=1000 N=1 alpha=-1 beta=1
SGEMM/128/169/1728/1/0        539 us        534 us       1314 K=1.728k M=128 N=169 alpha=1 beta=0
SGEMM/128/729/1200/1/0       1042 us       1035 us        689 K=1.2k M=128 N=729 alpha=1 beta=0
SGEMM/192/169/1728/1/0        729 us        724 us        869 K=1.728k M=192 N=169 alpha=1 beta=0
SGEMM/256/169/1/1/1             9 us          9 us      75928 K=1 M=256 N=169 alpha=1 beta=1
SGEMM/256/729/1/1/1            35 us         35 us      20285 K=1 M=256 N=729 alpha=1 beta=1
SGEMM/384/169/1/1/1            18 us         18 us      45886 K=1 M=384 N=169 alpha=1 beta=1
SGEMM/384/169/2304/1/0       2475 us       2412 us        327 K=2.304k M=384 N=169 alpha=1 beta=0
SGEMM/50/1000/1/1/1            10 us         10 us      73312 K=1 M=50 N=1000 alpha=1 beta=1
SGEMM/50/1000/4096/1/0       6364 us       5803 us        100 K=4.096k M=50 N=1000 alpha=1 beta=0
SGEMM/50/4096/1/1/1            46 us         45 us      13491 K=1 M=50 N=4.096k alpha=1 beta=1
SGEMM/50/4096/4096/1/0      29223 us      26913 us         20 K=4.096k M=50 N=4.096k alpha=1 beta=0
SGEMM/50/4096/9216/1/0      55410 us      55181 us         10 K=9.216k M=50 N=4.096k alpha=1 beta=0
SGEMM/96/3025/1/1/1            55 us         51 us      14408 K=1 M=96 N=3.025k alpha=1 beta=1
SGEMM/96/3025/363/1/0        1313 us       1295 us        570 K=363 M=96 N=3.025k alpha=1 beta=0

Output as JSON using

./bench --benchmark_out_format=json --benchmark_out=test.json

or preferably

./bench --benchmark_out_format=json --benchmark_out=`hostname`.json

On Minsky With PowerAI

mkdir -p build && cd build && rm -fr * && cmake -DCMAKE_BUILD_TYPE=Release ..

Disable CPU frequency scaling

If you see this error:

***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.

you might want to disable the CPU frequency scaling while running the benchmark:

sudo cpupower frequency-set --governor performance
./mybench
sudo cpupower frequency-set --governor powersave

Create Figures

python plot/plot.py plot/spec/full_reduction.yml

or generate all figures

python plot/plot.py all

Run with Docker

Install nvidia-docker, then, list the available benchmarks.

nvidia-docker run  --rm raiproject/tensorcore_bench:latest bench --benchmark_list_tests

You can run benchmarks in the following way (probably with the --benchmark_filter flag).

nvidia-docker run --privileged --rm -v `readlink -f .`:/data -u `id -u`:`id -g` raiproject/tensorcore_bench:amd64-latest ./run_benchmarks.sh
  • --privileged is needed to set the NUMA policy for NUMA benchmarks.
  • -v `readlink -f .`:/data maps the current directory into the container as /data.
  • --benchmark_out=/data/\`hostname`.json tells the bench binary to write out to /data, which is mapped to the current directory.
  • -u `id -u`:`id -g` tells docker to run as user id -u and group id -g, which is the current user and group. This means that files that docker produces will be modifiable from the host system without root permission.
You can’t perform that action at this time.