# Performance analysis of a stencil code in modern C++

Victor Eijkhout, Yojan Chitkara, Daksh Chaplot

Abstract—In this paper we evaluate multiple C++ shared memory programming models with respect to both ease of expression, and resulting performance. We do this by implementing the mathematical algorithm known as the 'power method' in a variety of ways, especially focusing on using modern C++ technniques.

### I. Introduction

In high performance scientific computing, parallelization of loop-based codes is an important step towards efficient execution, and OpenMP has proved a valuable tool for this. While for C++ programmers it is certainly possible to write OpenMP in a C-like style, recent versions of the C++ standard, and certain libraries, have made it possible to code in a more modern style. We can characterize this mode as 'separation of concerns', where the data set and the data-parallel operation applied to each element are specified separately. In this report we test various modes of expressing a common operation.

For our sample 'mini-application' we use the power method (commonly used in eigenvalue calculations, but also *mutatis mutandis* standing for such algorithms as Krylov methods) on the matrix of a Finite Difference stencil. Computational stencils on a Cartesian domain are probably the minimal realistic example of

eijkhout@tacc.utexas.edu, Texas Advanced Computing Center, The University of Texas at Austin.

the PDE code; they also stand for convolutions in graphics and Machine Learning applications.

This papers explores such different expressions of the power method in various ways. First of all, we use recent C++ mechanisms such as <code>mdspan</code> and the <code>cartesian\_product</code> of the range library to formulate the power method as a separation of the algorithm and data-space concern. Then we explore the Kokkos library and the Sycl library / compiler extension, both of which bear out our separation of concerns in that they explicitly apply a lambda expression to an index set.

For our practical tests we primarily compare the different programming models on a modern Intel processor. The results show that the modern C++ mechanisms carry a certain performance disadvantage. Interestingly, this seems to be less for the GNU compiler than for the Intel compiler. We also test the 'C-style' code on a number of different processors. Here the results show that modern processors increasingly mitigate the traditional limitations of a bandwidth-bound code.

# II. ALGORITHM ASPECTS

We briefly discuss the power method, mostly going into its computational aspects. For the mathematical side, see HPC book, section 19.3.

Computationally, the power method is an attractive paradigmatic example in that it exhibits the most common parallelism patterns:

- independent or 'convenient' parallelism;
- global reduction operations
- point-to-point communications.

As follows. The method is given by

Let A a matrix of interest Let x be a random vector For iterations until convergence compute the product  $y \leftarrow Ax$ compute the norm  $\gamma = ||y||$ normalize  $x \leftarrow y/\gamma$ 

where in the limit  $\gamma$  will be the largest eigenvalue of A, and x the corresponding eigenvector

The basis to parallelizing the power method lies in the distribution of x. Each processing element p, whether that be a core or processor, takes care (in a sense that we will later defined) of a disjoint set of indices  $I_p$ . The operations then are:

- The normalization step has independent parallelism: each component  $y_i = x_i/\gamma$  can be independently computed, so each process p can compute  $y_i$  for  $i \in I_p$  fully independently;
- Computing  $\gamma$  is a all-reduction: each processing element p first computes  $\gamma_p = \sqrt{\sum_{i \in I_p} y_i^2}$  independently, and  $\gamma = \sqrt{\sum_p \gamma_p}$  is then formed by combination, and made available to all p;
- Finally, for the operator A we assume a sparse matrix, for instance a Finite Difference (FD) stencil for a Partial Diffential Equation (PDE), or a *convolution* kernel. This is characterized by each component y<sub>i</sub> of the output needing several components x<sub>j</sub> of the input. In our case, putting a two-dimensional indexing on the vectors:

$$y_{ij} = 4x_{ij} - x_{i-1,j} - x_{i+1,j} - x_{i,j-1} - x_{i,j+1}$$

# A. Loop parallelism

In our power method application a first example of a parallel loop is scaling an array by a factor.

```
for ( idxint i=0; i<m; i++ )
  for ( idxint j=0; j<n; j++ )
   out[ IINDEX(i,j,m,n,b) ] = in[
        IINDEX(i,j,m,n,b) ] * factor;</pre>
```

where m, n are the size of the domain and b is the width of a halo region.

Some remarks:

- 1) We use template parameters *idxint* and *real* for integers and floating point quantities, respectively.
- 2) The loops range only over the  $m \times n$  interior part of the domain, which is allocated with size  $(m+2b) \times (n+2b)$ .
- 3) We are using a traditional macro to translate from two-dimensional to onedimensional indexing, skipping the border points:

```
// seq.cpp
#define IINDEX( i,j,m,n,b ) ((i)+b)
  *(n+2*b) + (j)+b
```

Later we will discuss other indexing schemes.

4) For the initial implementation, the out, in data are double\* pointers, for instance obtained as the data member of a std::vector. Later we will explore implementing the data as an mdspan object so we can use the two-dimensional [i, j] type indexing.

#### B. Reduction

The  $\ell_2$  reduction of our example application looks in code like:

```
for ( idxint i=0; i<m; i++ )
  for ( idxint j=0; j<n; j++ ) {
    auto v = out[ IINDEX(i, j, m, n, b) ];
    sum of squares += v*v;</pre>
```



Fig. 1. Structure of a reduction of four items on two threads, where u is the user-supplied initial value, and i is the natural initial value for the reduction operator.

```
}
return std::sqrt(sum of_squares);
```

OpenMP resolves the race condition on the reduction variable by giving each thread a local copy of <code>sum\_of\_squares</code>, summing into that, and adding the local copies together in the end. This is illustrated in figure 1.

# C. Five-point stencil

The evaluation of the 5-point difference operator is perfectly parallel in that there are no dependencies between the points of the output vector. However, the loop is more complicated than the scaling operation above:

While this operation is somewhat like a 'transform' range operation, the complexity of the right-hand-side indexing precludes using an actual range algorithm. Therefore, in later versions of this code we will range over the indices, rather than the actual data.

### III. PARALLELIZATION STRATEGIES

There are various ways we can parallelize our example application.

# A. OpenMP loop parallelism

Above in section II-A we already showed the simplest parallelization scheme: we loop two-dimensionally over the index space, and translate that through a C Preprocessor (CPP) macro to one-dimensional indexing in a traditional container. The loops are then parallelized by OpenMP.

In graphs to follow we indicate by 'oned' the mode where we only parallelize the outer loop. For instance, the norm calculation is:

```
#pragma omp parallel for reduction
   (+:sum_of_squares)
for ( idxint i=0; i<m; i++ )
   for ( idxint j=0; j<n; j++ ) {
      auto v = out[ IINDEX(i,j,m,n,b) ];
      sum_of_squares += v*v;
   }
return std::sqrt(sum_of_squares);
};</pre>
```

We designate by 'clps' this same code, but adding collapse(2) to each OpenMP loop nest.

### B. mdspan indexing

While the scaling and the norm calculation can be expressed as range algorithms, doing so for the five-point stencil update is trickier, so we take a two-step approach:

- We access data through an mdspan, so that have multidimensional indexing; see figure 2
- 2) We define the iteration space as a cartesian\_product range view:

```
const auto& s = data2d();
int b = this->border();
idxint
```

#### Class data:

Fig. 2. Use of mdspan for 2D data access

```
lo_m = static_cast<idxint>(b),
hi_m = static_cast<idxint>(s.
    extent(0)-b),
lo_n = static_cast<idxint>(b),
hi_n = static_cast<idxint>(s.
    extent(1)-b);
range2d = rng::views::
    cartesian_product
( rng::views::iota(lo_m,hi_m),
    rng::views::iota(lo_n,hi_n) )
;
```

This allows us to write, cleanly and compactly:

```
// span.cpp
auto out = this->data2d();
const auto in = other.data2d();
#pragma omp parallel for
for ( auto ij : this->inner() ) {
   auto [i,j] = ij;
   out[ i,j] = 4*in[ i,j ]
        - in[ i-1,j ] - in[ i+1,j ] - in[ i,j
        -1 ] - in[ i,j+1 ];
}
```

Unlike the OpenMP double loop, we now have a single loop spanning the twodimensional domain. Also, note that this domain is the interior of a larger domain.

In graphs to follow we indicate this mode by 'span'.

In the tests below it will become apparent that the two-dimensional iteration space is not without performance problems. We tackle this by having a variant, 'iota', which splits the space orthogonally:

#### Accessor:

```
//! pointer to the data as 2D array
auto@ data2d() {
  return cartesian_data; };
const auto@ data2d() const {
  return cartesian_data; };
```

```
#pragma omp parallel for reduction
   (+:sum_of_squares)
for ( auto i : this->inneri() )
   for ( auto j : this->innerj() ) {
      auto v = array[i, j];
      sum_of_squares += v*v;
   }
  return std::sqrt(sum_of_squares);
};
```

#### C. DIY cartesian product

The cartesian\_product view is quite general so one could wonder about overhead. We write a custom iterator over a contiguous domain. It maintains internally *i*, *j* coordinates which are updated by the operator++. A naive implementation uses integer division, so we also make an optimized implementation without division or branching.

#### 

# D. Kokkos and Sycl

We also use the Kokkos and Sycl libraries in their 'host' mode. In graphs to follow we

};

indicate these modes by 'kokkos' and 'sycl' respectively.

1) Sycl: Sycl is an open standard that targets heterogeneous parallelism through strict standard C++. It has mechanisms for memory management between host CPU and devices (both GPU and FPGA), and for expressing common parallel algorithms.

The preferred mechanism for handling memory coherence is through buffers. Host memory is wrapped in a buffer structure:

```
std::vector<real> Mat_A(msize*nsize);
buffer<real,2> Buf_a(Mat_A.data(),
    range<2>(msize,nsize));
```

This memory can then transparently be accessed in a kernel, whether run on host or device, where coherence is ensured by the runtime. This turns out to be as efficient as more explicit mechanisms.

```
q.submit([&] (handler &h) {
  accessor D_a(Buf_a,h,write_only);
  h.parallel_for
    (range<2>(msize-2,nsize-2),
    [=] (auto index) {
     auto row = index.get_id(0) + 1;
     auto col = index.get_id(1) + 1;
     D_a[row][col] = 1.;
    });
}).wait();
```

Note that Sycl has a true two-dimensional indexing mode.

We see that the <code>parallel\_for</code> construct resembles a C++ range algorithm: it combines a range – though of the index space, not the data space – and a lambda expression to be applied at each point.

Sycl is alone among the models studied here in that it requires index shifting in order to range over the interior of the index space.

2) Kokkos: Kokkos is the execution layer of the Trilinos project. It is, like Sycl, a data-parallel programming mode that supports host

and device execution with the same code. Unlike Sycl there is no explicit queue; instead, objects are explicitly associated with the host or device data space.

```
// diff2d.cpp
using MemSpace = Kokkos::HostSpace;
using Layout = Kokkos::LayoutRight;
using HostMatrixType = Kokkos::View<
    real**, Layout, MemSpace>;
HostMatrixType x("x", msize,nsize);
```

Like Sycl, there are explicit parallelism constructs. These closely resemble C++ range algorithms, specifying an explicit index space over which to iterate, and the lambda expression to apply at each index.

```
Kokkos::parallel_for
  ("Update x",
   Kokkos::MDRangePolicy<Kokkos::Rank
   <2>>
        ({1, 1}, {msize-1, nsize-1}),
   KOKKOS_LAMBDA(int i, int j) {
        x(i, j) = Ax(i, j) / norm;
   });
```

# IV. TIMING COMPARISON AND DISCUSSION

We perform different comparisons:

- different parallelization models on a given compiler and CPU;
- different Intel processor generations, given a model and compiler;
- different compilers, given a model and CPU.

Throughout, the compiler options used are fairly vanilla:

```
// GCC:
-O2 -g -DNDEBUG -std=gnu++23 -fopenmp
// Intel
-O2 -g -DNDEBUG -std=gnu++2b -fiopenmp
```

This was largely done to accommodate the tests on multiple processor types. Optimizing compiler options would introduce another level of complication to our tests.

# A. Comparing parallelization models

Let's compare various implementation strategies. Test given here are on an *Intel Sapphire Rapids* dual socket node with 112 cores total. We compare the Intel 2024 (figure 3) and GCC 13 (figure 4) compilers.

We make the following observations.

1) Intel compiler: First of all, we see that the naive 'oned' OpenMP implementation is clearly the fastest. All of the schemes that use two-dimensional iteration are slower. An indication of the reason for this is outlined in section IV-D: much of the computing time actually goes into incrementing the two-dimensional iterator.

If we accept a constant loss of efficiency for these schemes, we see that most schemes even have a descending efficiency with increasing core count. The reason for this is not clear, especially since the 'clps' and 'diy' implementations do not show this.

Interestingly, on low core counts the 'sycl' and 'kokkos' implementations are relatively competitive. They probably employ an efficient translation to an OpenMP backend, but this makes their gradual loss of efficiency all the more puzzling.

2) GNU compiler: For the GNU compiler the differences between the various models are much smaller. Moreover, most models have constant efficiency, kokkos being the exception.

# B. Comparing processor generations

In figure 5 we compare the following generations of Intel (and one NVidia) processors:

• *Intel Sky Lake* in the *TACC Stampede2* cluster, in a two-socket configuration with a total of 48 cores;

- Intel Cascade Lake in the TACC Frontera cluster, in a two-socket configuration with a total of 56 cores;
- *Intel Ice Lake* in the *TACC Stampede3* cluster, in a two-socket configuration with a total of 80 cores;
- Intel Sapphire Rapids in the TACC Stampede3 cluster, in a two-socket configuration with a total of 112 cores and High Bandwidth Memory;
- *Intel Granite Rapids* in the *TACC Stampede3* cluster, in a two-socket configuration with a total of 240 cores;
- NVidia Grace in the TACC Vista cluster in a two-socket configuration with 144 cores total.

We see that older processor generations stop scaling at approximately half the full core count. Newer processors do not suffer from this limitation. Also they have a higher bandwidth. Interestingly, the new NVidia 'Grace' superchip has a similar bandwidth as the high bandwidth Intel Sapphire Rapids.

- 1) Runtime: In the left graph we see that the runtimes are roughly equal for low core counts; the main difference is that Sky Lake and Cascade Lake reach their maximum performance well short of the total core count, while Ice Lake does not show this behavior. The Ice Lake and later generations keep scaling with increasing core count, the both Rapids generations even almost linearly.
- 2) Bandwidth: Our bandwidth measure is not observed directly, but rather derived by considering the amount of data loads in the algorithm:
  - The five-point stencil application loads 5 elements from the input, loads the output vector, and writes it; this would come to 7 data accesses per *i*, *j* point calculation.
  - However, subsequent points from each i



Fig. 3. Comparing implementation strategies, Intel 2024 compiler on a 112-core Sapphire Rapids node.



Fig. 4. Comparing implementation strategies, Gcc 2024 compiler on a 112-core Sapphire Rapids node.

- or i-1 or i+1 line come from the same cacheline, so effectively we 3 accesses from the input vector, for 5 total.
- Added to this, lines easily fit in L2 cache, so after the line for one i+1 value has been loaded, it will be used as the i line in the next iteration, and the i-1 line in the iteration thereafter. Thus, we really
- have only 1 DRAM access for the input vector per i, j point calculation, for a total of 3 access.
- Finally, the Ice Lake processor can, in certain circumstances, convert the calculation to a 'streaming store', so the load of the output vector doesn't need to be counted.

The right graph of figure 5 shows a a pos-



Fig. 5. Comparing different processors (OpenMP). Time (left) and bandwidth (right) as function of core count.

teriori measure for attained bandwidth.

We see that the Sky Lake and Cascade Lake processors bear out the old truism that the total bandwidth is less than each core's individual bandwidth summed over the cores.

However, newer processor generations scale further, and for the Sapphire Rapids and Granite Rapids even fairly linear. The SPR node has Intel's 'High Bandwidth Memory', given a much higher absolute number.

### C. Comparing compilers

Figures 6 and 7 show a comparison between the Intel 2024 and GCC 13 compilers. While sometimes Intel has a slight edge over GCC, it sometimes performs considerably worse. Inspection of the generated code would be needed to determine the cause of this.

# D. Analysis

One immediate conclusion from the above tests is that it's hard to beat the simple-minded OpenMP implementation with only the outer loop parallelized, although differences are less pronounced for the Gnu compiler than the Intel one.

Doing profiling gives us a clue. Here is the output of running the mdspan / cartesian\_product version (Intel 24 compiler):

```
%% make run_perf VARIANTS=span NSIZE =10000 ECHO=1
```

```
55.60% [.] std::ranges::
cartesian_product_view<std::
ranges::iota_view<long, long>, std
::ranges::iota_view<long, long>
>::_Iterator<true>::operator+=
18.73%
       [.] __divti3
       [.] linalg::
11.33%
bordered_array_span<float>::
central_difference_from
 5.37% [.] linalg::
bordered_array_span<float>::
scale_interior
5.01% [.] linalg::
bordered_array_span<float>::
12norm
2.69%
      [.] ___divti3@plt
```

We see that more than half of the time goes into index calculations, and in particular integer division. On the other hand, benchmarks of



Fig. 6. Comparing Intel to GCC on 'oned' (left) and 'clps' (right) scheme



Fig. 7. Comparing Intel to GCC on 'span' (left) and 'kokkos' (right) scheme

the *mdspan* construct by itself [1] show that it performs as well as the linearized indexing scheme.

VTune profiling tells us something similar (with some post-processing):

```
std::ranges::cartesian_product_view<
    std::ranges::iota_view<long, long
>, std::ranges::iota_view<long,
    long>>::_Iterator<(bool)1>::
```

func@0x404390 8.09852

\_\_\_divti3 8.09808

\_\_udivmodti4 4.61700 linalg::bordered\_array\_span<float >::bordered\_array\_span 0.87979 linalg:: bordered\_array\_span<float>::set 0.87910

\_\_kmp\_get\_global\_thread\_id\_reg 0.00000

cfree 0.03664

### V. CONCLUSION

We have implemented a stencil code, which stands for many scientific applications, using various modern C++ mechanisms. The tentative conclusion is that the modern mechanisms may be simpler to code, but this elegance comes with a certain perforance penalty, more so for the Intel compiler than the GCC compiler. Profiling suggests that the major obstacle is the computation of the (i,j) coordinates from the index range. Close inspection of the generated assembly code would be needed to find how the models (and compilers) differ in this respect, and how the common case of a Cartesian grid can be optimized.

Acknowledgement: This work was supported by the Intel OneAPI Center of Excellence, and the TACC STAR Scholars program, funded by generous gifts from TACC industry partners, including Intel, Shell, Exxon, and Chevron.

Open Source: The code and all build/run scripts for this benchmark can be found at https://github.com/VictorEijkhout/diff2d\_benchmark. Pull requests for other implementations are invited.

#### REFERENCES

[1] D. S. Hollman, B. A. Lelbach, H. C. Edwards, M. Hoemmen, D. Sunderland, and

C. R. Trott, "mdspan in c++: A case study in the integration of performance portable features into international language standards," in 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), vol. 74. IEEE, Nov. 2019, pp. 60–70. [Online]. Available: http://dx.doi.org/10.1109/P3HPC49587.2019.00011 [Cited on page 9.]