Status | Proposed |
---|---|
RFC # | NNN (update when you have community PR #) |
Author(s) | Penporn Koanantakool (penporn@google.com) |
Sponsor | Rasmus Larsen (rmlarsen@google.com), Tatiana Shpeisman (shpeisman@google.com) |
Updated | 2019-05-19 |
Support the compressed sparse row (CSR) sparse matrix format in TensorFlow. Implement linear algebra and other common neural networks operations on both CPU and GPU devices for seamless user experience.
- Enable storing batched sparse matrices in CSR format in TensorFlow.
- Provide efficient linear algebra and other common neural network kernels for the CSR format on both CPU and GPU.
- Provide clean and easy to use Python APIs.
- Support backpropagation.
- Support for other sparse formats such as Block CSR, CSC, etc. Block CSR is a future work.
- Implement XLA / TPU kernels.
Sparse tensor representation has a significant impact on performance. Modern architectures are most suited for non-random, bulk memory accesses. Sparse formats that optimize for data locality and reuse can achieve large performance gains. TensorFlow currently stores sparse tensors in coordinate (COO) format, which is inefficient for sparsities observed in most deep learning and sparse linear algebra applications. The compressed sparse row (CSR) format is one of the most commonly used formats. It generally requires less storage and is faster than COO, sometimes by up to orders of magnitudes.
We propose supporting the CSR format in TensorFlow to accelerate sparse linear algebra and applicable deep learning applications in TensorFlow.
- Fast sparse linear algebra routines such as matrix multiplications, tensor contractions (convolutions), Cholesky, LU, and QR factorizations on CPU and GPU.
- Users can add existing CSR kernels from other libraries and directly use it without format conversion.
A k-dimensional sparse tensor is stored as a batch of sparse CSR matrices. The k-2 outermost dimensions are batch dimensions, and the two innermost dimensions are matrix dimensions. All batches share the same values, column indices, and row pointers arrays. The figure below shows how CSRSparseMatrix
stores a 2x3x4 sparse tensor A. Values
stores all nonzero values in rowmajor order. Column indices
stores the corresponding column indices of each nonzero in Values. Row pointers
stores the position of the beginning of each matrix row in Values. Batch pointers
stores the position of the beginning of each batch in Values
.
All kernel implementations are in C++, with a Python wrapper for Python APIs. During the experimental phase, CSRSparseMatrix
APIs will be in the tf.linalg.experimental.sparse
package.
See APIs in the Detailed Design section.
- Sparse linear algebra ops:
- Sparse matrix-vector multiplication (SpMV)
- Sparse-dense matrix multiplication (SpMM)
- Sparse-sparse matrix multiplication (SpGEMM)
- Sparse-sparse matrix addition (SpGEAM)
- Sparse matrix transpose
- Sparse Cholesky factorization
- Sparse LU factorization
- Sparse QR factorization
- Conversions ops:
- Convert to and from dense tensor
- Convert to and from SparseTensor.
General ops, borrowing APIs from numpy and scipy (sparse, sparse.csr_matrix) packages.
- Construction ops: eye and rand
- Unary ops that preserve sparsity structure:
- Unary ops that change sparsity structure:
- Binary element-wise ops (may change sparsity structure):
- Binary element-wise ops that changes sparsity structure:
- Reduction ops:
- Shape manipulation ops:
- getcol, getrow, reshape, slice, concatenate, stack
- Ops that returns indices:
We have considered several other sparse formats in addition to CSR.
- Compressed Sparse Column (CSC): CSC is another common and performant format. But CSR still has more kernel implementations in existing libraries. CSC is a column-major format and is less suitable for TensorFlow which uses row-major storage, similar to CSR.
- Doubly-Compressed Sparse Row (DCSR) and Doubly-Compressed Sparse Column (DCSC): DCSR and DCSC compress both dimensions of a matrix. DCSR is CSR with row pointers compressed. DCSC is CSC with column pointers compressed. These formats are more beneficial than CSR and CSC when matrices are hypersparse, i.e., have much fewer than one nonzero per row or column. Hypersparse matrices are not very common in TensorFlow’s current workload.
- Block Compressed Sparse Row (BSR or BCSR): BSR is the CSR format with fixed-size dense blocks for each nonzero position instead of just a scalar value.
- It is more efficient than CSR, but matrices with block sparsity occur less often in scientific computing. There aren’t as many efficient kernels provided in existing libraries yet. Our immediate goal is to enable efficient sparse linear algebra processing, so we picked CSR first.
- We plan to support BSR in a future proposal, as block sparsity is used increasingly in neural network weight pruning.
- Compressed Sparse Fiber (CSF): CSF is a generalization of CSR/CSC to multi-dimensional tensors. CSF can compress one user-specified dimension of the tensor. For example, on a matrix, CSF is CSR when compressing the row dimension, and CSC when compressing the column dimension. Even though CSF is used often in tensor factorizations in scientific computing, it does not have highly optimized kernels in vendor libraries yet.
- Tensor algebra compiler (Taco): Taco is a compiler for sparse tensor algebra. Its storage format is exponentially more generic compared to CSF.
- Taco supports any traversal order, and can compress any number of dimensions at once.
- For a rank k tensor, Taco can express k! * 2^k storage formats, even more so if the tensor is blocked. All the aforementioned formats (CSR, CSC, DCSR, DCSC, BSR, and CSF) can be represented with Taco.
- Taco has also been extended to support COO and other niche formats such as DIA, ELLPACK, etc.
- Taco can generate fast single-threaded CPU code for arbitrary kernels comprised of basic tensor algebra.
- Taco is not quite ready for production use yet. There aren’t vendor-supplied kernels and Taco’s efficient parallel CPU/GPU code generation is still under the work. We are keeping an eye on Taco as a candidate for TensorFlow’s generic sparse tensor support in the future.
- Hierarchical formats such as Hierarchically Semi-Separable (HSS) and Hierarchical COOrdinate (HiCOO): These format stores matrix/tensor partitions recursively. They are too application-specific for our scope.
We picked the CSR format because
- It is generally suitable for the sparsity levels and patterns observed in TensorFlow workloads.
- It has a vast amount of vendor-optimized kernels readily available.
- We expect significant speed improvements from CSRSparseMatrix over SparseTensor in general. We observed from 2x to over 100x speedups in a multithreaded SpMV benchmark on CPUs. Even in the rare cases where CSRSparseMatrix is slower, there shouldn’t be regressions as CSRSparseMatrix doesn’t automatically replace SparseTensor. (Users can keep using SparseTensor in these cases.)
- There will be microbenchmarks for each sparse linear algebra op.
- We will create a few end-to-end tests and benchmarks once the library is ready.
- CSRSparseMatrix does not add any new dependencies to TensorFlow.
- It depends on two existing dependencies: Eigen and CUDA.
- It wouldn’t affect any TensorFlow components or its dependents.
- CSRSparseMatrix would increase the binary size, build and test times, but we don’t expect the increase to be significant. The library shouldn’t affect TensorFlow’s startup time.
- The library will be maintained by penporn@, with possible help from rmlarsen@, ebrevdo@, and collaborators from other teams outside of TensorFlow.
CSRSparseMatrix is compatible with all platforms supported by TensorFlow.
- Make sure all the ops that CSRSparseMatrix flows through support CSRSparseMatrix. Avoid converting back-and-forth between dense and sparse tensors.
- If a tensor is not sparser than a certain threshold (usually over 90%, depending on tensor size, sparsity distribution, and hardware), keeping it as a dense tensor will achieve better performance.
The following snippet shows how CSRSparseMatrix can be used in Python.
import tf.linalg.experimental.sparse as csr
# CSR matrix creation. A can be a dense tensor or SparseTensor (COO format).
A_csr = csr.CSRSparseMatrix(A)
# Conversion to other formats.
A_dense = A_csr.to_dense() # To dense tensor, using a member method.
A_dense = csr.to_dense(A_csr) # To dense tensor, using API.
A_coo = A_csr.to_sparse_tensor() # To COO, using a member method.
A_coo = csr.to_sparse_tensor(A_csr) # To COO, using API.
# Generic matrix multiplication interface.
C = csr.matmul(A_csr, B) # Sparse-dense matmul.
C = csr.matmul(A, B_csr) # Dense-sparse matmul.
C_csr = csr.matmul(A_csr, B_csr) # Sparse-sparse matmul.
y = csr.matmul(A_csr, x) # Sparse matrix-vector multiplication (SpMV).
C_csr = A_csr * B_csr # Operator overloading.
# Matrix addition.
C = csr.add(A_csr, B) # A and B can be dense or sparse.
C = A + B_csr # Operator overloading.
# Transpose.
A_tranpose_csr = A_csr.transpose() # Through a member method of CSRSparseMatrix.
A_transpose_csr = csr.transpose(A_csr) # Directly through API.
# Sparse Cholesky.
L_csr = csr.cholesky(A_csr)
# Matrix creation.
I = csr.eye(5) # Creates a 5x5 Identity matrix.
A_csr = csr.rand(rows, cols, density) # Generates a random matrix.
# Set diagonal.
A_csr.setdiag(diag) # Through a member method.
C_csr = csr.setdiag(A_csr, diag) # Through API.
# Element-wise processing.
T_csr = csr.threshold(A_csr, threshmin=0) # Thresholding.
C_csr = csr.maximum(A_csr, B_csr) # Element-wise max between two matrices.
nonzero_idx = csr.nonzeros(A_csr) # Returns indices of nonzeros.
# Reduction.
s = csr.sum(A_csr, axis=0) # Sums A_csr along the row dimension.
# Concatenates matrices along a specified axis.
D_csr = csr.concat([A_csr, B_csr, C_csr], axis=1)
- This design will conform to the backward and forward compatibility requirements once it is moved outside the experimental package. It only adds new functionalities without making changes to existing features.
- How this proposal interacts with other parts of the TensorFlow Ecosystem:
- TFLite: TFLite already supports the CSR format. TensorFlow should be able to pass the format to TFLite without problems.
- Distribution strategies: Don’t plan on interacting with this in this initial phase. Could overlap with DTensor.
- tf.function: Should work just like any other ops.
- GPU: We plan to make all CSRSparseMatrix operations work on GPUs.
- TPU: We don’t plan on supporting CSRSparseMatrix on TPUs yet.
- SavedModel: Should work just like any other ops/tensors.
- The library will be rolled out as an experimental package first (tf.linalg.experimental.sparse).
- There might be backward incompatible changes while the library is in the experimental phase.
- Once out of the experimental phase, the package will have official TensorFlow APIs and will conform to TensorFlow’s backward and forward compatibility requirements.
There are CSRSparseMatrix classes on both C++ and Python sides. The C++ CSRSparseMatrix object is stored as a blob in TensorFlow’s Variant tensor. The Python CSRSparseMatrix class is a wrapper of the Variant tensor with some basic matrix manipulation functions such as conversions. The figure below shows the relationship between both classes.
The C++ CSRSparseMatrix class has the following properties:
Property | Description |
---|---|
dtype | The data type of the values. |
dense_shape |
The shape of the tensor.
|
batch_pointers |
Batch offset pointers into col_indices and values.
|
row_pointers |
Row offset pointers into col_indices and values.
|
col_indices |
Column indices of each nonzero.
|
values |
Value of each nonzero.
|
The Python CSRSparseMatrix class is a subclass of SparseMatrix, which stores common sparse matrix properties. Other new sparse formats can be added as subclasses of SparseMatrix. CSRSparseMatrix has the following properties:
Property | Description |
---|---|
shape | The shape of the tensor. |
dtype | The data type of the content values. |
csr_matrix | A DT_VARIANT tensor storing the C++ CSRSparseMatrix object blob. Auto-generated Python APIs for CSRSparseMatrix kernels take this as input. |
Variant tensors
are perceived as scalars in TensorFlow. For proper shape inferences, we store CSRSparseMatrix
’s shape and data type in a shape inference primitive, ShapeAndType, and access them through input_handle_shapes_and_types and set_output_handle_shapes_and_types during shape inference.
CSRSparseMatrix
class APIs
__init__(input, indices=None, name=None)
:input
can be SparseTensor or dense tensor.to_dense()
: Convert to dense tensor.to_sparse_tensor()
: Convert to SparseTensor.conj()
: Returns the conjugated matrix.transpose()
: Returns the transposed matrix.hermitian_transpose()
: Returns the Hermitian transpose of the matrix.
Generic Linear algebra API
matmul(a, b, transpose_a, transpose_b)
a
andb
can be sparse or dense.- Also handles SpMV
CSRSparseMatrix
-specific APIs
sparse_matrix_add(a, b, alpha, beta)
sparse_matrix_ordering_amd(a)
: Returns the approximate minimum degree (AMD) permutation vector of sparse matrixa
. Used in sparse Cholesky factorization.sparse_matrix_sparse_cholesky(a, ordering_amd)
: Cholesky factorization.sparse_matrix_sparse_qr(a)
: Returns QR factorization and possibly a permutation matrixP
.sparse_matrix_sparse_solve(A, b, triangular=false)
: SolvesAx = b
.sparse_matrix_softmax(logits)
sparse_matrix_softmax_grad(softmax, softmax_grad)
Other sparse APIs follow NumPy and SciPy APIs. See links in Supported Operations.
- Any comments on the APIs?
- Are there any more operations that we should support?
- Should we add
CSRSparseMatrix
support to existing standard ops as well, e.g.,tf.math.{add,asin,atan,ceil}
, etc? - Would love to hear about more use cases.
- For neural networks, would CSR be useful for you (while Block CSR is still a future work)?
- Should we make CSRSparseMatrix a CompositeTensor? Would the effort be worth it since we will transition to the new TFRT/MLIR backend soon? How should this be prioritized?
- Should
SparseMatrix
replaceSparseTensor
?