cutlass_tilesparse

CUDA templates for tile-sparse matrix multiplication based on CUTLASS, NVIDA [1].

Introduction

Since Matrix Multiplication accounts for the largest part of the Neural Network computation, it is important to optimize Matrix Multiplication kernels for efficient Neural Network design.

In this project, we developed Tile-sparse Matrix multiplication, which was inspired by the tiling algorithm that is used to compute Matrix Multiplication on GPU. To utilize on-chip shared memory efficiently, matrices are partitioned into tiles, and each tile does the computation independently. The sparse matrix multiplication can be computed without noticeable overhead if the sparsity is assigned tile-wise manner. We implemented the CUDA-level kernel for tile-spare matrix multiplication by modifying CUTLASS, NVIDIA.

Tile-sparse matrix encoding

The tile-sparse matrix is encoded with CSC (Compressed Column Storage)-like format. The difference between the conventional CSC and the tile-sparse CSC is that the basic encoding unit is the tile, not the single data point in the tile-sparse CSC. As shown in the above figure, the “ptr” stores accumulated number of non-zero tiles for each column, the “indices” stores row indices of the tiles, and the “data” stores data on non-zero tiles.

Performance

Recently, OpenAI released Block-sparse GPU Kernels [2]. Similar to the proposed Tile-sparse, it computes block-wise sparse matrix. We compared the performance of the Block-sparse and the Tile-sparse kernels with TitanXp and CUDA9.0. As the above figure shows, the speed of the Tile-sparse is comparable with the Block-sparse scheme, when applied to a 4096x4096 weight matrix, minibatch size of 32 and block/tile size of 32x32. (Note that the relative speed can be changed according to the matrix size, minibatch size, and the tile size.)

Compared to the Block-sparse GPU which was written in the assembly language level, the Tile-sparse kernel was written in CUDA C++. We believe that such a feature can help researchers to update the kernel further for various flavors. Please contact me(yulhwa.kim@postech.ac.kr) for further information if necessary. Enjoy!

Makefile & Program Usage

It is as same as CUTLASS, NVIDIA, but options for the makefile are reduced as follow.

make <sgemm|dgemm> sm=<60|61> \
  [transpose=<nn>] [verbose=<0|1>] [keep=<0|1>]

Reference

[1] NVIDIA. 2017. CUTLASS. Available at: https://github.com/NVIDIA/cutlass.

[2] OpenAI. 2017. Blocksparse. Available at: https://github.com/openai/blocksparse.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
cutlass		cutlass
cutlass_test		cutlass_test
images		images
Doxyfile		Doxyfile
LICENSE.TXT		LICENSE.TXT
README.md		README.md
common.mk		common.mk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cutlass_tilesparse

Introduction

Tile-sparse matrix encoding

Performance

Makefile & Program Usage

Reference

About

Releases

Packages

Contributors 4

Languages

License

YulhwaKim/cutlass_tilesparse

Folders and files

Latest commit

History

Repository files navigation

cutlass_tilesparse

Introduction

Tile-sparse matrix encoding

Performance

Makefile & Program Usage

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages