New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Horizontal add function for VectorizedArray #15240
Conversation
include/deal.II/base/vectorization.h
Outdated
/** | ||
* Returns horizontal sum of data field. | ||
*/ | ||
DEAL_II_ALWAYS_INLINE | ||
Number | ||
horizontal_add() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with the term "horizontal sum" or "horizontal add", and I suspect that others might not either. Would you mind adding a definition, perhaps in a formula, what this operation does to the elements of the array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name comes from Intel, say this one: https://www.felixcloutier.com/x86/phaddw:phaddd
Regarding the name, I agree that this is not a very intuitive name. I do not like the std::accumulate
name too much, either, because to me accumulate would be across some iterator range starting with some initial entry, where it is here the sum within a vector. I would prefer more verbose name, sum_within_vector
or accumulate_within_vector
. Another question without having checked myself: Does the SVE (scalable vector extensions of ARM) have a similar operation, and what name are they using? They tend to have good names, so it might worth checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @kronbichler says, the name comes from Intel and their ...hadd...
intrinsics. They call what our operator+=
implements a vertical addition (where each lane of a VectorizedArray
does the addition which results in new VectorizedArray
with same width), and the summation over the lanes (with the underlying floating type as a result) a horizontal addition.
SVE has an instruction called ADDV: Tree-based floating-point addition reduction which comes close to what happens.
I could live with the names horizontal_add/horizontal_sum
or accumulate
but we can also use a more verbose name here. (And/or a comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If going by SVE, we would pick 'addition_reduction', which is also an option. I prefer add_within_vector
, but I am fine with both horizontal_add
and accumulate...
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just vec.sum()
?
But to be clear, I don't actually care about the name as long as the documentation has a formula that explains what the function does :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed today, let's go with sum()
. That is essentially what happens. Any further suggestions for documentation? Should we mention that we use a tree-like reduction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tree-like reduction is good to mention, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll defer to @kronbichler in this case. But as a general rule, documentation should state what a function does, not how it does it. The latter is subject to change, and is not of interest to the reader of the documentation anyway. The how can be documented in in-comment commentary, but it should generally not be part of the documentation.
@bergbauer Have you adopted this code from somewhere. If yes, could you add a comment with the reference? |
Even though this code might exist in some variation somewhere else, the actual realization is following the usual x86-intrinsics API capabilities in a straight-forward way. It is in fact very close to what I would have written without internet sources. In fact, the interesting part is how to reduce on the 128 bit vectors, where the code for double variables is the exchange between the lower and upper part here dealii/include/deal.II/base/vectorization.h Line 3692 in 67a7e25
dealii/include/deal.II/base/vectorization.h Lines 4120 to 4125 in 67a7e25
|
include/deal.II/base/vectorization.h
Outdated
@@ -656,6 +656,16 @@ class VectorizedArray | |||
base_ptr[offsets[0]] = data; | |||
} | |||
|
|||
/** | |||
* Returns sum over entries of data field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please expand the text somewhat with the underlying algorithm, i.e.,
* Returns sum over entries of data field. | |
* Returns sum over entries of the data field, $\sum_{i=1}^{\text{size}()} this->data[i]$. |
Alternatively, you can also use a @code
section as done a few lines up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
sum() | ||
{ | ||
return data; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about moving down the implementations to the other ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it is more intuitive to put it to the other ones? I kept it here as this is the special case with width == 1
where nothing has to be done except returning data
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These classes are a bit of outliers compared to the rest of deal.II because they contain many very short implementations and that all implementations are done at the place of declaration, rather than out-of-line definitions further down as in most other places in deal.II. I suggest to move the implementations in-line if possible in terms of the dependencies AVX-512 -> AVX -> SSE2, otherwise it would be good to move the implementation down. We have a bigger topic to tackle in terms of #11719.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inline implementation works if i arrange the specializations in a different order, see ba78f77 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go with this variant for now.
/rebuild |
Add a new function
horizontal_add()
toVectorizedArray
which uses vertical adds of the lower and higher part of a VectorizedArray until a 128 bitVectorizedArray
is reached. Then use SSE2 intrinsics to do the horizontal add and return the horizontal sum.We can use this function when we do vectorization over quadrature points like in
FEPointEvaluation::finish_integrate_fast()
which is more efficient than the code generated from the for loops on current master.@peterrum @kronbichler