New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use vector intrinsics for tensor operations. #16848
Conversation
ea89155
to
cf9fd37
Compare
I didnt have time to test it in depth this week, but what I see so far is not very encouraging. Same benchmark setup as described here. Timings with f844528 (no alignment or intrinsics):
Timings with dd200c8 (with alignment and intrinsics):
What is noticeable is that both results are significantly faster than the results I posted last week, but that is independent of the tensor alignment and probably related to the improvements to MappingCartesian and other recent PRs. The particle algorithms are still slower with alignment (reasonable, if they are memory bandwidth limited). However, what is more concerning is that also the Stokes and temperature assembly is slower (Stokes slightly, 2%; temperature significantly, 20%). I would think this could be noise, but solver timings and matrix setup is almost identical, so maybe there is more to it? |
Well, that's kind of disappointing :-( Here are some tests of my own, using a slightly modified version of step-22 as an example. For 2d computations, for which we do not use padding elements and so the memory traffic should remain the same, I get the following without this patch on the last refinement cycle (3 successive runs shown to give a measure of variability):
And with this patch:
The operations I would expect to benefit from improvements with tensors are "assemble" and "refine mesh". For these, we have best times of 2.36 and 0.492 seconds (without this patch), and 2.28 and 0.488 (with this patch), respectively. These are improvements on the order of 3.5% and 1%. That's better than nothing but also not a whole lot. |
Here are the 3d results. Before the patch:
After the patch:
Again the minimum times for "assemble" and "refine mesh", respectively:
So we're getting slower by a bit (though about within the noise level). In other words, this is again disappointing, but comports with what @gassmoeller observes. |
I'll leave this open for now, in case others want to discuss this some more. But my take is that this isn't going anywhere. I have one other idea (not using padding elements + alignment), but creating the |
Yes, several of my PRs in the last week were specifically motivated by what I saw when looking at the benchmark in #16796 (comment). |
OK, let's close this. It was an interesting experiment, but not worth it. |
This is the patch I've been working toward in #16465. It uses @kronbichler 's idea to align the tensor class as appropriate, and then in the relevant functions does a
reinterpret_cast
to aVectorizedArray
type on which we can call vector intrinsics to do the actual work.There are a number of commits to this PR:
Tensor<1,2,float>
case because no vector intrinsics are available for that, see VectorizedArray<float,2> does not exist. #16827.constexpr
if we doreinterpret_cast
s. This is step back to what we had done before, but I think that the practical impact to no longer being able to call certain functions inconstexpr
contexts is relatively limited. I had to remove a number of checks from tests for this, but we never used this anywhere in the library itself.symmetric_tensor.h
that were previously not warned about, see Mark DEAL_II_ALWAYS_INLINE functions as 'inline'. #16838.There is one open issue I need to figure out how to address: We liberally use
boost::small_vector<Point<dim>,N>
, which internally makes sure that the data array is properly aligned -- or at least tries to. But there is a bug somewhere (in boost, I believe) in that it only ensures the proper alignment for the internal static storage, but forgets about it when having to do dynamic memory allocation. We only run into this in one test,mappings/mapping_q_manifold_02.debug
where we use a Q5 mapping with 208 evaluation points. For the moment, I'm increasing the static size of the array in which this happens to 216 in commit 11, but that is something I need to look more into. I just wanted to post the patch to see what people have to say about.@gassmoeller If you're curious, run your benchmark on it!