

# **Balanced CSR Sparse Matrix-Vector Product on Graphics Processors**

Goran Flegar, Enrique S. Quintana-Ortí



- GPU-accelerated sparse linear algebra library
  - Focus: linear systems









- GPU-accelerated sparse linear algebra library
  - Focus: linear systems
  - Iterative, Krylov-subspace based linear solvers
    - SpMV
    - BLAS-1 operations









- GPU-accelerated sparse linear algebra library
  - Focus: linear systems
  - Iterative, Krylov-subspace based linear solvers
    - SpMV
    - BLAS-1 operations
  - Sparse matrix formats & SpMV
    - accelerate each iteration of the solver









- GPU-accelerated sparse linear algebra library
  - Focus: linear systems
  - Iterative, Krylov-subspace based linear solvers
    - SpMV
    - BLAS-1 operations
  - Sparse matrix formats & SpMV
    - accelerate each iteration of the solver
  - Preconditioners
    - reduce the number of iterations









- GPU-accelerated sparse linear algebra library
  - Focus: linear systems
  - Iterative, Krylov-subspace based linear solvers
    - SpMV
    - BLAS-1 operations
  - Sparse matrix formats & SpMV
    - accelerate each iteration of the solver
  - Preconditioners
    - reduce the number of iterations









 3.2
 0
 1.2
 0

 0
 0
 0.4
 0

 2.7
 1.3
 0
 4.1

 0.1
 0
 0
 2.7





| 3.2 | 0   | 1.2 | 0   | 3.2 1.2     |
|-----|-----|-----|-----|-------------|
| 0   | 0   | 0.4 | 0   | 0.4         |
| 2.7 | 1.3 | 0   | 4.1 | 2.7 1.3 4.1 |
| 0.1 | 0   | 0   | 2.7 | 0.1 2.7     |

| 3.2 0 1.2 0   | 3.2 1.2     | 0 2   |
|---------------|-------------|-------|
| 0 0.4 0       | 0.4         | 2     |
| 2.7 1.3 0 4.1 | 2.7 1.3 4.1 | 0 1 3 |
| 0.1 0 0 2.7   | 0.1 2.7     | 0 3   |









... leads to CSC



... leads to CSC



"Standard" approach

| 3.2 | 1.2 | 0.4 | 2.7 | 1.3 | 4.1 | 0.1 | 2.7 | Values (val)            |
|-----|-----|-----|-----|-----|-----|-----|-----|-------------------------|
| 0   | 2   | 2   | 0   | 1   | 3   | 0   | 3   | Column indexes (colidx) |
| 0   | 2   | 3   | 6   |     |     |     |     | Row pointers (rowptr)   |

3.2 | 1.2 | 0.4 | 2.7 | 1.3 | 4.1 | 0.1 | 2.7 | Values (val)

0 2 2 0 1 3 0 3 Column indexes (colidx)

0 2 3 6 Row pointers (rowptr)

y := Ax

```
3.2 1.2 0.4 2.7 1.3 4.1 0.1 2.7 Values (val)

0 2 2 0 1 3 0 3 Column indexes (colidx)

0 2 3 6 Row pointers (rowptr)

y := Ax
void SpMV_CSR(int m, int *rowptr, int *colidx, float *val, float *x, float *y) {
for (int i = 0; i < m; ++i) {
```

for (int j = rowptr[i]; j < rowptr[i+1]; ++j)</pre>

y[i] += val[j] \* x [ colidx[j] ];

```
UNIVERSITA
JAUME I
```

```
3.2 1.2 0.4 2.7 1.3 4.1 0.1 2.7 Values (val)

0 2 2 0 1 3 0 3 Column indexes (colidx)

0 2 3 6 Row pointers (rowptr)

y := Ax
void Spmv_CSR(int m, int *rowptr, int *colidx, float *val, float *x, float *y) {
    for (int i = 0; i < m; ++i) {
        for (int j = rowptr[i]; j < rowptr[i+1]; ++j) Bell & Garland '08
        y[i] += val[j] * x [ colidx[j] ]; • parallelize outer loop
```

~ cuSPARSE SpMV



for (int j = rowptr[i]; j < rowptr[i+1]; ++j)</pre>

y[i] += val[j] \* x [ colidx[j] ];

Load imbalance! Non-coalescence!

parallelize outer loop

~ cuSPARSE SpMV



$$y := Ax$$

Load imbalance! Non-coalescence!

#### Specialized formats

- HYB ELL + COO
- SELL-P good memory access, parallelizes well
- ... a few new ones every year



$$y := Ax$$

Load imbalance! Non-coalescence!

#### Specialized formats

- HYB ELL + COO
- SELL-P good memory access, parallelizes well
- ... a few new ones every year

**Library code – SpMV is not the only kernel!** 





\* GTX 1080





Can we do better than HYB using CSR?

\* GTX 1080





Can we do better than HYB using CSR?

55x speedup

YES!

\* GTX 1080





Can we do better than HYB using CSR?

55x speedup

YES!

New CSR-I algorithm (and format).

\* GTX 1080



10

nz = 961368

15

 $\times 10^4$ 



Can we do better than HYB using CSR?

55x speedup

#### YES!

New **CSR-I** algorithm (and format).

Only need to add a small amount of data to vanilla CSR. Add - but not modify the existing format!



```
void SpMV_CSR(int m, int *rowptr, int *colidx, float *val, float *x, float *y) {
  for (int i = 0; i < m; ++i) {
    for (int j = rowptr[i]; j < rowptr[i+1]; ++j)
       y[i] += val[j] * x [ colidx[j] ];
}
</pre>
```

```
void SpMV_CSR(int m, int *rowptr, int *colidx, float *val, float *x, float *y) {
   for (int i = 0; i < m; ++i) {
     for (int j = rowptr[i]; j < rowptr[i+1]; ++j)</pre>
        y[i] += val[j] * x [ colidx[j] ];
                               Merge the two loops into one.
  void SpMV_CSR(int m, int *rowptr, int *colidx, float *val, float *x, float *y) {
    int row = -1, next_row = 0, nnz = rowptr[m];
    for (int i = 0; i < nnz; ++i) {
        while (i >= next_row) next_row = rowptr[++row+1];
        y[row] += val[i] * x[ colidx[i] ];
 }}
                               Split the loop into equal chunks.
  const int T = thread_count;
  void SpMV_CSRI(int m, int *rowptr, int *colidx, float *val, float *x, float *y) {
    int row = -1, next_row = 0, nnz = rowptr[m];
    for (int k = 0; k < T; ++k) {
      for (int i = k*nnz / T; i < (k+1)*nnz / T; ++i) {
6
7
        while (i >= next_row) next_row = rowptr[++row+1];
        y[row] += val[i] * x[ colidx[i] ];
 | }}}
```

```
void SpMV_CSR(int m, int *rowptr, int *colidx, float *val, float *x, float *y) {
   for (int i = 0; i < m; ++i) {
      for (int j = rowptr[i]; j < rowptr[i+1]; ++j)</pre>
        y[i] += val[j] * x [ colidx[j] ];
                                Merge the two loops into one.
  void SpMV_CSR(int m, int *rowptr, int *colidx, float *val, float *x, float *y) {
    int row = -1, next_row = 0, nnz = rowptr[m];
    for (int i = 0; i < nnz; ++i) {
        while (i >= next_row) next_row = rowptr[++row+1];
        y[row] += val[i] * x[ colidx[i] ];
 }}
                               Split the loop into equal chunks.
  const int T = thread_count;
  void SpMV_CSRI(int m, int *rowptr, int *colidx, float *val, float *x, float *y) {
3
    int row = -1, next_row = 0, nnz = rowptr[m];
   for (int k = 0; k < T; ++k) { | Parallelize this!}
5
      for (int i = k*nnz / T; i < (k+1)*nnz / T; ++i) {
6
7
        while (i >= next_row) next_row = rowptr[++row+1];
        y[row] += val[i] * x[ colidx[i] ];
 | }}}
```

```
void SpMV_CSR(int m, int *rowptr, int *colidx, float *val, float *x, float *y) {
   for (int i = 0; i < m; ++i) {
      for (int j = rowptr[i]; j < rowptr[i+1]; ++j)</pre>
        y[i] += val[j] * x [ colidx[j] ];
                                Merge the two loops into one.
  void SpMV_CSR(int m, int *rowptr, int *colidx, float *val, float *x, float *y) {
    int row = -1, next_row = 0, nnz = rowptr[m];
    for (int i = 0; i < nnz; ++i) {
        while (i >= next_row) next_row = rowptr[++row+1];
        y[row] += val[i] * x[ colidx[i] ];
 }}
                               Split the loop into equal chunks.
  const int T = thread_count;
  void SpMV_CSRI(int m, int *rowptr, int *colidx, float *val, float *x, float *y) {
    int row = -1, next_row = 0, nnz = rowptr[m];
   for (int k = 0; k < T; ++k) { | Parallelize this!}
5
      for (int i = k*nnz / T; i < (k+1)*nnz / T; ++i) {
6
7
        while (i >= next_row) next_row = rowptr[++row+1];
       y[row] += val[i] * x[ colidx[i] ];
 | }}}
```

Race conditions!

```
void SpMV_CSR(int m, int *rowptr, int *colidx, float *val, float *x, float *y) {
   for (int i = 0; i < m; ++i) {
      for (int j = rowptr[i]; j < rowptr[i+1]; ++j)
        y[i] += val[j] * x [ colidx[j] ];
                                Merge the two loops into one.
  void SpMV_CSR(int m, int *rowptr, int *colidx, float *val, float *x, float *y) {
    int row = -1, next_row = 0, nnz = rowptr[m];
    for (int i = 0; i < nnz; ++i) {
        while (i >= next_row) next_row = rowptr[++row+1];
        y[row] += val[i] * x[ colidx[i] ];
 }}
                               Split the loop into equal chunks.
  const int T = thread_count;
  void SpMV_CSRI(int m, int *rowptr, int *colidx, float *val, float *x, float *y) {
3
    int row = -1, next_row = 0, nnz = rowptr[m];
   for (int k = 0; k < T; ++k) { | Parallelize this!}
5
      for (int i = k*nnz / T; i < (k+1)*nnz / T; ++i) {
6
7
        while (i >= next_row) next_row = rowptr[++row+1];
        y[row] += val[i] * x[ colidx[i] ];
 }}}

    Use atomics

      Race conditions! • Accumulate partial
                         result into registers
```

```
void SpMV_CSR(int m, int *rowptr, int *colidx, float *val, float *x, float *y) {
   for (int i = 0; i < m; ++i) {
      for (int j = rowptr[i]; j < rowptr[i+1]; ++j)
        y[i] += val[j] * x [ colidx[j] ];
                                Merge the two loops into one.
  void SpMV_CSR(int m, int *rowptr, int *colidx, float *val, float *x, float *y) {
    int row = -1, next_row = 0, nnz = rowptr[m];
    for (int i = 0; i < nnz; ++i) {
        while (i >= next_row) next_row = rowptr[++row+1];
        y[row] += val[i] * x[ colidx[i] ];
 }}
                               Split the loop into equal chunks.
  const int T = thread_count;
  void SpMV_CSRI(int m, int *rowptr, int *colidx, float *val, float *x, float *y) {
3
    int row = -1, next_row = 0, nnz = rowptr[m];
   for (int k = 0; k < T; ++k) { | Parallelize this!}
5
      for (int i = k*nnz / T; i < (k+1)*nnz / T; ++i) {
                                                               State between outer
6
7
        while (i >= next_row) next_row = rowptr[++row+1];
                                                               loop iterations!
        y[row] += val[i] * x[ colidx[i] ];
 | }}}

    Use atomics

      Race conditions! • Accumulate partial
                         result into registers
```

```
void SpMV_CSR(int m, int *rowptr, int *colidx, float *val, float *x, float *y) {
   for (int i = 0; i < m; ++i) {
      for (int j = rowptr[i]; j < rowptr[i+1]; ++j)
        y[i] += val[j] * x [ colidx[j] ];
                                Merge the two loops into one.
  void SpMV_CSR(int m, int *rowptr, int *colidx, float *val, float *x, float *y) {
    int row = -1, next_row = 0, nnz = rowptr[m];
    for (int i = 0; i < nnz; ++i) {
        while (i >= next_row) next_row = rowptr[++row+1];
        v[row] += val[i] * x[ colidx[i] ];
 }}
                               Split the loop into equal chunks.
  const int T = thread_count;
  void SpMV_CSRI(int m, int *rowptr, int *colidx, float *val, float *x, float *y) {
3
    int row = -1, next_row = 0, nnz = rowptr[m];
    for (int k = 0; k < T; ++k) { | Parallelize this!}
5
      for (int i = k*nnz / T; i < (k+1)*nnz / T; ++i) {
                                                               State between outer
6
7
        while (i >= next_row) next_row = rowptr[++row+1];
                                                               loop iterations!
        y[row] += val[i] * x[ colidx[i] ];
 | }}}

    Use atomics

                                                              Precompute starting value
      Race conditions! • Accumulate partial
                                                              of "row" for each thread.
                         result into registers
```

CUDA thread = 1 lane of a 32-wide SIMD unit (warp)

CUDA thread = 1 lane of a 32-wide SIMD unit (warp)

Spreading out threads causes strided memory access.

CUDA thread = 1 lane of a 32-wide SIMD unit (warp)

Spreading out threads causes strided memory access.



CUDA thread = 1 lane of a 32-wide SIMD unit (warp)

Spreading out threads causes strided memory access.



CUDA thread = 1 lane of a 32-wide SIMD unit (warp)

Spreading out threads causes strided memory access.



CUDA thread = 1 lane of a 32-wide SIMD unit (warp)

Spreading out threads causes strided memory access.



CUDA thread = 1 lane of a 32-wide SIMD unit (warp)

Spreading out threads causes strided memory access.



### **Performance of CSR-I**

### 100 matrices from SuiteSparse



# Speed-up / slowdown over cuSPARSE CSR



### Speed-up / slowdown over cuSPARSE CSR



#### No format conversion!

- try both, and use the fastest later on!
- sometimes 1 cuSPARSE SpMV = 100 CSR-I SpMVs

CSR-I designed for irregular patterns



CSR-I designed for irregular patterns

How to measure irregularity?

Deviation of row lengths from the mean.

CSR-I designed for irregular patterns

How to measure irregularity?

Deviation of row lengths from the mean.

Is "5" regular or irregular?

Depends on the density of the matrix (mean #rows)

CSR-I designed for irregular patterns

How to measure irregularity?

Deviation of row lengths from the mean.

Is "5" regular or irregular?

Depends on the density of the matrix (mean #rows)

Scatter plot of speedup vs normalized std. dev.



CSR-I designed for irregular patterns

How to measure irregularity?

Deviation of row lengths from the mean.

Is "5" regular or irregular?

Depends on the density of the matrix (mean #rows)

Scatter plot of speedup vs normalized std. dev.



# **Combining both approaches**



### **Conclusion**

Use atomics and warp shuffles to tackle irregular sparsity patterns.

Determine a priori when CSR-I is faster than standard algorithm.

- keep cuSPARSE performance for regular patterns
- CSR-I for irregular ones

### Thank you! Questions?

All functionalities are part of the MAGMA-sparse project.

#### **MAGMA SPARSE**

ROUTINES BiCG, BiCGSTAB, Block-Asynchronous Jacobi, CG,

CGS, GMRES, IDR, Iterative refinement, LOBPCG,

LSQR, QMR, TFQMR

PRECONDITIONERS ILU / IC, Jacobi, ParlLU, ParlLUT, Block Jacobi, ISAI

KERNELS SpMV, SpMM

DATA FORMATS CSR, ELL, SELL-P, CSR5, HYB

http://icl.cs.utk.edu/magma/



github.com/gflegar/talks/europar 2017

This research is based on a cooperation between Hartwig Anzt, Jack Dongarra (University of Tennessee), Goran Flegar and Enrique S. Quintana-Ortí (Universidad Jaume I).



