Horizontal add function for VectorizedArray #15240

bergbauer · 2023-05-19T13:29:21Z

Add a new function horizontal_add() to VectorizedArray which uses vertical adds of the lower and higher part of a VectorizedArray until a 128 bit VectorizedArray is reached. Then use SSE2 intrinsics to do the horizontal add and return the horizontal sum.

We can use this function when we do vectorization over quadrature points like in FEPointEvaluation::finish_integrate_fast() which is more efficient than the code generated from the for loops on current master.

@peterrum @kronbichler

bangerth · 2023-05-19T18:12:18Z

include/deal.II/base/vectorization.h

+  /**
+   * Returns horizontal sum of data field.
+   */
+  DEAL_II_ALWAYS_INLINE
+  Number
+  horizontal_add()


I'm not familiar with the term "horizontal sum" or "horizontal add", and I suspect that others might not either. Would you mind adding a definition, perhaps in a formula, what this operation does to the elements of the array?

The name comes from Intel, say this one: https://www.felixcloutier.com/x86/phaddw:phaddd
Regarding the name, I agree that this is not a very intuitive name. I do not like the std::accumulate name too much, either, because to me accumulate would be across some iterator range starting with some initial entry, where it is here the sum within a vector. I would prefer more verbose name, sum_within_vector or accumulate_within_vector. Another question without having checked myself: Does the SVE (scalable vector extensions of ARM) have a similar operation, and what name are they using? They tend to have good names, so it might worth checking.

As @kronbichler says, the name comes from Intel and their ...hadd... intrinsics. They call what our operator+= implements a vertical addition (where each lane of a VectorizedArray does the addition which results in new VectorizedArray with same width), and the summation over the lanes (with the underlying floating type as a result) a horizontal addition.

SVE has an instruction called ADDV: Tree-based floating-point addition reduction which comes close to what happens.

I could live with the names horizontal_add/horizontal_sum or accumulate but we can also use a more verbose name here. (And/or a comment)

If going by SVE, we would pick 'addition_reduction', which is also an option. I prefer add_within_vector, but I am fine with both horizontal_add and accumulate... as well.

Why not just vec.sum()?

But to be clear, I don't actually care about the name as long as the documentation has a formula that explains what the function does :-)

As discussed today, let's go with sum(). That is essentially what happens. Any further suggestions for documentation? Should we mention that we use a tree-like reduction?

Tree-like reduction is good to mention, yes.

I'll defer to @kronbichler in this case. But as a general rule, documentation should state what a function does, not how it does it. The latter is subject to change, and is not of interest to the reader of the documentation anyway. The how can be documented in in-comment commentary, but it should generally not be part of the documentation.

include/deal.II/matrix_free/fe_point_evaluation.h

include/deal.II/base/vectorization.h

include/deal.II/matrix_free/fe_point_evaluation.h

peterrum · 2023-05-20T06:25:05Z

@bergbauer Have you adopted this code from somewhere. If yes, could you add a comment with the reference?

kronbichler · 2023-05-21T10:02:09Z

Have you adopted this code from somewhere. If yes, could you add a comment with the reference?

Even though this code might exist in some variation somewhere else, the actual realization is following the usual x86-intrinsics API capabilities in a straight-forward way. It is in fact very close to what I would have written without internet sources. In fact, the interesting part is how to reduce on the 128 bit vectors, where the code for double variables is the exchange between the lower and upper part here

dealii/include/deal.II/base/vectorization.h

Line 3692 in 67a7e25

out[2 * i + 1].data = _mm_unpackhi_pd(u0, u1);

or for the float variables there is the code

dealii/include/deal.II/base/vectorization.h

Lines 4120 to 4125 in 67a7e25

    
           __m128 v0           = _mm_shuffle_ps(u0, u1, 0x44); 
        
           __m128 v1           = _mm_shuffle_ps(u0, u1, 0xee); 
        
           __m128 v2           = _mm_shuffle_ps(u2, u3, 0x44); 
        
           __m128 v3           = _mm_shuffle_ps(u2, u3, 0xee); 
        
           out[4 * i + 0].data = _mm_shuffle_ps(v0, v2, 0x88); 
        
           out[4 * i + 1].data = _mm_shuffle_ps(v0, v2, 0xdd);

(The code in the vectorized transpose is more general because it needs to have all lanes correct, whereas @bergbauer's code only needs to have the lowest lane correct in the end.)

include/deal.II/base/vectorization.h

kronbichler · 2023-05-24T19:53:44Z

include/deal.II/base/vectorization.h

@@ -656,6 +656,16 @@ class VectorizedArray
    base_ptr[offsets[0]] = data;
  }

+  /**
+   * Returns sum over entries of data field.


Can you please expand the text somewhat with the underlying algorithm, i.e.,

Suggested change

* Returns sum over entries of data field.

* Returns sum over entries of the data field, $\sum_{i=1}^{\text{size}()} this->data[i]$.

Alternatively, you can also use a @code section as done a few lines up.

peterrum · 2023-05-24T19:57:31Z

include/deal.II/base/vectorization.h

+  sum()
+  {
+    return data;
+  }
+


What about moving down the implementations to the other ones?

Do you think it is more intuitive to put it to the other ones? I kept it here as this is the special case with width == 1 where nothing has to be done except returning data.

These classes are a bit of outliers compared to the rest of deal.II because they contain many very short implementations and that all implementations are done at the place of declaration, rather than out-of-line definitions further down as in most other places in deal.II. I suggest to move the implementations in-line if possible in terms of the dependencies AVX-512 -> AVX -> SSE2, otherwise it would be good to move the implementation down. We have a bigger topic to tackle in terms of #11719.

Inline implementation works if i arrange the specializations in a different order, see ba78f77 .

kronbichler

Let's go with this variant for now.

kronbichler · 2023-05-25T07:11:24Z

/rebuild

Horizontal add function for VectorizedArray

53341c0

peterrum self-requested a review May 19, 2023 14:18

bangerth reviewed May 19, 2023

View reviewed changes

peterrum reviewed May 20, 2023

View reviewed changes

kronbichler reviewed May 21, 2023

View reviewed changes

include/deal.II/base/vectorization.h Outdated Show resolved Hide resolved

Remove ctors and assign data directly

736f651

bergbauer commented May 22, 2023

View reviewed changes

include/deal.II/base/vectorization.h Outdated Show resolved Hide resolved

kronbichler reviewed May 22, 2023

View reviewed changes

include/deal.II/base/vectorization.h Outdated Show resolved Hide resolved

bergbauer and others added 3 commits May 22, 2023 15:28

Use intrinsics for vertical addition

3a2b35e

Use get_lower/get_upper

ef6357f

Rename to sum()

904772b

kronbichler approved these changes May 24, 2023

View reviewed changes

peterrum approved these changes May 24, 2023

View reviewed changes

bergbauer added 2 commits May 24, 2023 22:13

Improve docu

4bf61f2

Rearrange specializations and implement sum inline

ba78f77

kronbichler added Reviewed and ready to merge SIMD ready to test labels May 25, 2023

kronbichler approved these changes May 25, 2023

View reviewed changes

masterleinad merged commit 7fd2229 into dealii:master May 25, 2023
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Horizontal add function for VectorizedArray #15240

Horizontal add function for VectorizedArray #15240

bergbauer commented May 19, 2023

bangerth May 19, 2023

kronbichler May 21, 2023

bergbauer May 22, 2023

kronbichler May 22, 2023

bangerth May 22, 2023

bergbauer May 24, 2023

kronbichler May 24, 2023

bangerth May 24, 2023

peterrum commented May 20, 2023

kronbichler commented May 21, 2023

kronbichler May 24, 2023

bergbauer May 24, 2023

peterrum May 24, 2023

bergbauer May 24, 2023

kronbichler May 24, 2023

bergbauer May 24, 2023

kronbichler left a comment

kronbichler commented May 25, 2023

	* Returns sum over entries of data field.
	* Returns sum over entries of the data field, $\sum_{i=1}^{\text{size}()} this->data[i]$.

Horizontal add function for VectorizedArray #15240

Horizontal add function for VectorizedArray #15240

Conversation

bergbauer commented May 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterrum commented May 20, 2023

kronbichler commented May 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kronbichler left a comment

Choose a reason for hiding this comment

kronbichler commented May 25, 2023