Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horizontal add function for VectorizedArray #15240

Merged
merged 7 commits into from May 25, 2023

Conversation

bergbauer
Copy link
Contributor

Add a new function horizontal_add() to VectorizedArray which uses vertical adds of the lower and higher part of a VectorizedArray until a 128 bit VectorizedArray is reached. Then use SSE2 intrinsics to do the horizontal add and return the horizontal sum.

We can use this function when we do vectorization over quadrature points like in FEPointEvaluation::finish_integrate_fast() which is more efficient than the code generated from the for loops on current master.

@peterrum @kronbichler

@peterrum peterrum self-requested a review May 19, 2023 14:18
Comment on lines 659 to 664
/**
* Returns horizontal sum of data field.
*/
DEAL_II_ALWAYS_INLINE
Number
horizontal_add()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the term "horizontal sum" or "horizontal add", and I suspect that others might not either. Would you mind adding a definition, perhaps in a formula, what this operation does to the elements of the array?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name comes from Intel, say this one: https://www.felixcloutier.com/x86/phaddw:phaddd
Regarding the name, I agree that this is not a very intuitive name. I do not like the std::accumulate name too much, either, because to me accumulate would be across some iterator range starting with some initial entry, where it is here the sum within a vector. I would prefer more verbose name, sum_within_vector or accumulate_within_vector. Another question without having checked myself: Does the SVE (scalable vector extensions of ARM) have a similar operation, and what name are they using? They tend to have good names, so it might worth checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @kronbichler says, the name comes from Intel and their ...hadd... intrinsics. They call what our operator+= implements a vertical addition (where each lane of a VectorizedArray does the addition which results in new VectorizedArray with same width), and the summation over the lanes (with the underlying floating type as a result) a horizontal addition.

SVE has an instruction called ADDV: Tree-based floating-point addition reduction which comes close to what happens.

I could live with the names horizontal_add/horizontal_sum or accumulate but we can also use a more verbose name here. (And/or a comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If going by SVE, we would pick 'addition_reduction', which is also an option. I prefer add_within_vector, but I am fine with both horizontal_add and accumulate... as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just vec.sum()?

But to be clear, I don't actually care about the name as long as the documentation has a formula that explains what the function does :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed today, let's go with sum(). That is essentially what happens. Any further suggestions for documentation? Should we mention that we use a tree-like reduction?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tree-like reduction is good to mention, yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer to @kronbichler in this case. But as a general rule, documentation should state what a function does, not how it does it. The latter is subject to change, and is not of interest to the reader of the documentation anyway. The how can be documented in in-comment commentary, but it should generally not be part of the documentation.

include/deal.II/matrix_free/fe_point_evaluation.h Outdated Show resolved Hide resolved
include/deal.II/base/vectorization.h Outdated Show resolved Hide resolved
include/deal.II/base/vectorization.h Outdated Show resolved Hide resolved
include/deal.II/matrix_free/fe_point_evaluation.h Outdated Show resolved Hide resolved
@peterrum
Copy link
Member

@bergbauer Have you adopted this code from somewhere. If yes, could you add a comment with the reference?

@kronbichler
Copy link
Member

Have you adopted this code from somewhere. If yes, could you add a comment with the reference?

Even though this code might exist in some variation somewhere else, the actual realization is following the usual x86-intrinsics API capabilities in a straight-forward way. It is in fact very close to what I would have written without internet sources. In fact, the interesting part is how to reduce on the 128 bit vectors, where the code for double variables is the exchange between the lower and upper part here

out[2 * i + 1].data = _mm_unpackhi_pd(u0, u1);
or for the float variables there is the code
__m128 v0 = _mm_shuffle_ps(u0, u1, 0x44);
__m128 v1 = _mm_shuffle_ps(u0, u1, 0xee);
__m128 v2 = _mm_shuffle_ps(u2, u3, 0x44);
__m128 v3 = _mm_shuffle_ps(u2, u3, 0xee);
out[4 * i + 0].data = _mm_shuffle_ps(v0, v2, 0x88);
out[4 * i + 1].data = _mm_shuffle_ps(v0, v2, 0xdd);
(The code in the vectorized transpose is more general because it needs to have all lanes correct, whereas @bergbauer's code only needs to have the lowest lane correct in the end.)

@@ -656,6 +656,16 @@ class VectorizedArray
base_ptr[offsets[0]] = data;
}

/**
* Returns sum over entries of data field.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please expand the text somewhat with the underlying algorithm, i.e.,

Suggested change
* Returns sum over entries of data field.
* Returns sum over entries of the data field, $\sum_{i=1}^{\text{size}()} this->data[i]$.

Alternatively, you can also use a @code section as done a few lines up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Comment on lines +664 to +668
sum()
{
return data;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about moving down the implementations to the other ones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it is more intuitive to put it to the other ones? I kept it here as this is the special case with width == 1 where nothing has to be done except returning data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These classes are a bit of outliers compared to the rest of deal.II because they contain many very short implementations and that all implementations are done at the place of declaration, rather than out-of-line definitions further down as in most other places in deal.II. I suggest to move the implementations in-line if possible in terms of the dependencies AVX-512 -> AVX -> SSE2, otherwise it would be good to move the implementation down. We have a bigger topic to tackle in terms of #11719.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline implementation works if i arrange the specializations in a different order, see ba78f77 .

Copy link
Member

@kronbichler kronbichler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go with this variant for now.

@kronbichler
Copy link
Member

/rebuild

@masterleinad masterleinad merged commit 7fd2229 into dealii:master May 25, 2023
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants