# SpMV ...

Brian A. Page Univ. of Notre Dame 326 Cushing Hall Notre Dame, IN (U.S.) bpage1@nd.edu

#### **ABSTRACT**

TBD

### **Categories and Subject Descriptors**

[Theory of computation]: Parallel computing models; [Computing methodologies]: Parallel programming languages; [Software and its engineering]: Multithreading, Multiprocessing, Massively parallel systems, Ultra-large-scale systems, Concurrent programming structures

#### **General Terms**

Architecture

## **Keywords**

multi-threading, parallel systems, mobile threads, memory architectures, performance, PGAS

# 1. PRIOR WORK AND ANALYTIC MODELS

The HPCG benchmark [1] is one that is dominated timewise by SpMV and similar kernels. Fig. 1 diagrams data taken from recent HPCG reports<sup>1</sup>. The x-axis is the peak flops of the reported system; the y-axis is the ratio of the sustained HPCG flops to the peak bandwidth of the systems's memory (derived by determining the processing chips used and looking up their characteristics). The color and shape refer to different types of chips and systems, with the red squares representing system built from server-class chips, and the purple representing system using GPUs.

As can be seen, this ratio is independent of the peak system flops capability. In fact it is relatively flat at about 0.1 flops per byte of memory bandwidth for heavyweight server class processor chips, and somewhat less than 0.1 for GPUs and other architectures. Since SpMV is the bulk of HPCG,

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

SC '15 Austin, Texas USA

Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...\$15.00.

this is an indication that SpMV is relatively independent of core floating point capability, and instead highly dependent on chip memory bandwidth.

A recent complexity analysis of the HPCG benchmark [2] dove into HPCG performance as a function of system parameter on a kernel-by-kernel basis. The particular implementation of HPCG that was studied assumed that a sub-matrix of the total matrix was processed in each MPI rank as executed by a single core. The study rolled these numbers up into total execution time for the whole benchmark as a function of just memory bandwidth and a few network parameters. The model was extremely accurate when compared to measured HPCG data on several benchmarks.

The analysis of just the SpMV kernel within HPCG focused on just the in-core time, and computed that each subrow as executed by a single thread on a single core required a net of the following bytes fetched from memory<sup>2</sup>, where  $nnz_{row}$  is the average number of non-zeros per row in the row as processed by each core:

$$20 + 20 * nnz_{row}$$
 (1)

Since each non-zero represents two flops (an add and a multiply), dividing this into  $2*nnz_{row}$  yields an estimate of the bytes of bandwidth needed from memory for each flop:

$$2*nnz_{row}/(20 + 20*nnz_{row}) = 1/(10 + 10/nnz_{row})$$
 (2)

For a  $nnz_{row}$  of 27 this is about 0.096 flops per byte of bandwidth. This correlates well with HPCG, as the non-SpMV parts of HPCG require slightly more bytes per flop. Approximately 10 bytes must be accessed from memory for each flops executed.

Multiplying this by the actual sustainable memory bandwidth of a node should then estimate the sustainable flops per second for SpMV running in all the cores in that node. [2] uses in its projections the bandwidth number returned by using the Triad STREAM benchmark The first three rows of Table 1 summarize the characteristics of the three chips used in systems modelled by [2], including the ration of the reported STREAM bandwidth to the maximum memory bandwidth as projected by the chip's characteristics.

#### 2. REFERENCES

<sup>2</sup>The paper computed a value of 27 for the average number of non-zeros per row partition, and each non-zero required two 8-byte fetches of floating point data and one 4-byte index reference, with another 20 bytes for starting the computation of a new row.

<sup>&</sup>lt;sup>1</sup>http://www.hpcg-benchmark.org/

Figure 1: HPCG Flops per Byte of Memory Bandwidth vs. Peak flops.

|                                            | Chip Parameters |          |                |        | Node Parameters |        |                   |       | SpMV Specific |           |          |
|--------------------------------------------|-----------------|----------|----------------|--------|-----------------|--------|-------------------|-------|---------------|-----------|----------|
|                                            |                 | Total    | Peak           | Peak   |                 | Peak   | STREAM            |       |               | Estimated | Measured |
| Chip                                       |                 | Memory   | $\mathrm{B/W}$ | Flops  |                 | B/W    | $_{\mathrm{B/W}}$ |       |               | SpMV      | SpMV     |
| Type                                       | Cores           | Channels | (GB/s)         | (GF/s) | Chips           | (GB/s) | (GB/s)            | Ratio | $nnz_{row}$   | (GF/s)    | (GF/a)   |
| Chips used in Reference for HPCG Benchmark |                 |          |                |        |                 |        |                   |       |               |           |          |
| E5-2670                                    | 8               | 4        | 51.2           | 166.4  | 2               | 102.4  | 75.28             | 73.5% | 27            |           |          |
| 6276                                       | 16              | 4        | 51.2           | 147.2  | 2               | 102.4  | 54.4              | 53.1% | 27            |           |          |
| X5560                                      | 4               | 3        | 32             | 44.8   | 2               | 64     | 27.44             | 42.9% | 27            |           |          |
| Chips used in Reference for SpMV Benchmark |                 |          |                |        |                 |        |                   |       |               |           |          |
| X5650                                      | 6               | 3        | 32             | 63.84  | 2               | 64     | N/A               | N/A   | 6.98          |           | 1.9      |
| E5-2660                                    | 8               | 4        | 51,2           | 140.8  | 2               | 102.4  | N/A               | N/A   | 6.98          |           | 5.3      |
| Chips used in this paper.                  |                 |          |                |        |                 |        |                   |       |               |           |          |
| E5-2650v2                                  | 8               | 4        | 59.7           | 166.4  |                 |        |                   |       |               |           |          |

Table 1: SpMV Projection Based on System Parameters.

- J. Dongarra and M. Heroux. Toward a new metric for ranking high performance computing systems. Sandia Report SAND2013 4744, Sandia National Labs, June 2013.
- [2] V. Marjanović, J. Gracia, and C. W. Glass. Performance modeling of the hpcg benchmark. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, pages 172–192. Springer International Publishing, 2014.