Use vector intrinsics for tensor operations. #16848

bangerth · 2024-04-04T21:08:28Z

This is the patch I've been working toward in #16465. It uses @kronbichler 's idea to align the tensor class as appropriate, and then in the relevant functions does a reinterpret_cast to a VectorizedArray type on which we can call vector intrinsics to do the actual work.

There are a number of commits to this PR:

The first 3 re-do the commits from the revert-patch Revert aligning tensors #16823 which were previously added in Align Tensor<rank,dim,Number> to make vectorization possible. #16771 and No longer output the size of tensors into output file of a test. #16816.
The fourth commit disables vectorization for the Tensor<1,2,float> case because no vector intrinsics are available for that, see VectorizedArray<float,2> does not exist. #16827.
The fifth commit adds the padding elements we need to run size 2 or size 4 vector intrinsics on data of size 2,3,4.
The sixth and seventh commit deal with the issue that we can no longer mark functions as constexpr if we do reinterpret_casts. This is step back to what we had done before, but I think that the practical impact to no longer being able to call certain functions in constexpr contexts is relatively limited. I had to remove a number of checks from tests for this, but we never used this anywhere in the library itself.
Commit eight fixes some things in symmetric_tensor.h that were previously not warned about, see Mark DEAL_II_ALWAYS_INLINE functions as 'inline'. #16838.
Commit 9 removes duplicate code and so makes conversion to vectorization simpler.
Commit 10 does the actual conversion to using vector intrinsics. This is the one you should be looking at if you want to see the heart of this patch.

There is one open issue I need to figure out how to address: We liberally use boost::small_vector<Point<dim>,N>, which internally makes sure that the data array is properly aligned -- or at least tries to. But there is a bug somewhere (in boost, I believe) in that it only ensures the proper alignment for the internal static storage, but forgets about it when having to do dynamic memory allocation. We only run into this in one test, mappings/mapping_q_manifold_02.debug where we use a Q5 mapping with 208 evaluation points. For the moment, I'm increasing the static size of the array in which this happens to 216 in commit 11, but that is something I need to look more into. I just wanted to post the patch to see what people have to say about.

@gassmoeller If you're curious, run your benchmark on it!

gassmoeller · 2024-04-05T19:20:50Z

I didnt have time to test it in depth this week, but what I see so far is not very encouraging. Same benchmark setup as described here.

Timings with f844528 (no alignment or intrinsics):

+----------------------------------------------+------------+------------+
| Total wallclock time elapsed since start     |      77.5s |            |
|                                              |            |            |
| Section                          | no. calls |  wall time | % of total |
+----------------------------------+-----------+------------+------------+
| Assemble Stokes system           |         3 |      9.34s |        12% |
| Assemble temperature system      |         3 |      1.53s |         2% |
| Build Stokes preconditioner      |         3 |      5.29s |       6.8% |
| Build temperature preconditioner |         3 |     0.445s |      0.57% |
| Initialization                   |         1 |    0.0822s |      0.11% |
| Particles: Advect                |         6 |      10.9s |        14% |
| Particles: Copy                  |         3 |       8.2s |        11% |
| Particles: Generate              |         1 |      10.4s |        13% |
| Particles: Initialization        |         1 |  4.46e-05s |         0% |
| Particles: Initialize properties |         1 |         4s |       5.2% |
| Particles: Sort                  |         6 |      23.2s |        30% |
| Postprocessing                   |         3 |    0.0728s |         0% |
| Setup dof systems                |         1 |    0.0587s |         0% |
| Setup initial conditions         |         1 |      14.5s |        19% |
| Setup matrices                   |         1 |     0.977s |       1.3% |
| Solve Stokes system              |         3 |      2.62s |       3.4% |
| Solve temperature system         |         3 |     0.142s |      0.18% |
+----------------------------------+-----------+------------+------------+

Timings with dd200c8 (with alignment and intrinsics):

+----------------------------------------------+------------+------------+
| Total wallclock time elapsed since start     |      86.4s |            |
|                                              |            |            |
| Section                          | no. calls |  wall time | % of total |
+----------------------------------+-----------+------------+------------+
| Assemble Stokes system           |         3 |      9.53s |        11% |
| Assemble temperature system      |         3 |      1.86s |       2.2% |
| Build Stokes preconditioner      |         3 |      5.41s |       6.3% |
| Build temperature preconditioner |         3 |     0.455s |      0.53% |
| Initialization                   |         1 |    0.0823s |         0% |
| Particles: Advect                |         6 |      13.1s |        15% |
| Particles: Copy                  |         3 |      9.64s |        11% |
| Particles: Generate              |         1 |      10.8s |        12% |
| Particles: Initialization        |         1 |  6.39e-05s |         0% |
| Particles: Initialize properties |         1 |      4.47s |       5.2% |
| Particles: Sort                  |         6 |        27s |        31% |
| Postprocessing                   |         3 |     0.105s |      0.12% |
| Setup dof systems                |         1 |    0.0628s |         0% |
| Setup initial conditions         |         1 |      15.4s |        18% |
| Setup matrices                   |         1 |     0.972s |       1.1% |
| Solve Stokes system              |         3 |       2.6s |         3% |
| Solve temperature system         |         3 |     0.144s |      0.17% |
+----------------------------------+-----------+------------+------------+

What is noticeable is that both results are significantly faster than the results I posted last week, but that is independent of the tensor alignment and probably related to the improvements to MappingCartesian and other recent PRs. The particle algorithms are still slower with alignment (reasonable, if they are memory bandwidth limited). However, what is more concerning is that also the Stokes and temperature assembly is slower (Stokes slightly, 2%; temperature significantly, 20%). I would think this could be noise, but solver timings and matrix setup is almost identical, so maybe there is more to it?

bangerth · 2024-04-06T17:10:23Z

Well, that's kind of disappointing :-(

Here are some tests of my own, using a slightly modified version of step-22 as an example. For 2d computations, for which we do not use padding elements and so the memory traffic should remain the same, I get the following without this patch on the last refinement cycle (3 successive runs shown to give a measure of variability):

Refinement cycle 5
   Number of active cells: 128608
   Number of degrees of freedom: 1175133 (1044242+130891)
   Assembling...
   Computing preconditioner...
   Solving...  11 outer CG Schur complement iterations for pressure


+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      94.9s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assemble                        |         1 |      2.36s |       2.5% |
| compute preconditioner          |         1 |      68.6s |        72% |
| output results                  |         1 |      1.37s |       1.4% |
| refine mesh                     |         1 |       0.5s |      0.53% |
| setup                           |         1 |      3.16s |       3.3% |
| solve                           |         1 |      18.9s |        20% |
+---------------------------------+-----------+------------+------------+

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      95.6s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assemble                        |         1 |      2.42s |       2.5% |
| compute preconditioner          |         1 |        69s |        72% |
| output results                  |         1 |       1.4s |       1.5% |
| refine mesh                     |         1 |     0.501s |      0.52% |
| setup                           |         1 |      3.18s |       3.3% |
| solve                           |         1 |      19.1s |        20% |
+---------------------------------+-----------+------------+------------+

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      95.5s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assemble                        |         1 |      2.39s |       2.5% |
| compute preconditioner          |         1 |      69.1s |        72% |
| output results                  |         1 |      1.37s |       1.4% |
| refine mesh                     |         1 |     0.492s |      0.51% |
| setup                           |         1 |      3.19s |       3.3% |
| solve                           |         1 |      18.9s |        20% |
+---------------------------------+-----------+------------+------------+

And with this patch:

Refinement cycle 5
   Number of active cells: 128608
   Number of degrees of freedom: 1175133 (1044242+130891)
   Assembling...
   Computing preconditioner...
   Solving...  11 outer CG Schur complement iterations for pressure


+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |        95s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assemble                        |         1 |      2.28s |       2.4% |
| compute preconditioner          |         1 |      68.8s |        72% |
| output results                  |         1 |      1.34s |       1.4% |
| refine mesh                     |         1 |     0.493s |      0.52% |
| setup                           |         1 |      3.16s |       3.3% |
| solve                           |         1 |      18.9s |        20% |
+---------------------------------+-----------+------------+------------+

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      96.1s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assemble                        |         1 |      2.31s |       2.4% |
| compute preconditioner          |         1 |      69.6s |        72% |
| output results                  |         1 |      1.34s |       1.4% |
| refine mesh                     |         1 |     0.488s |      0.51% |
| setup                           |         1 |      3.27s |       3.4% |
| solve                           |         1 |      19.1s |        20% |
+---------------------------------+-----------+------------+------------+

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      96.1s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assemble                        |         1 |      2.32s |       2.4% |
| compute preconditioner          |         1 |      69.3s |        72% |
| output results                  |         1 |      1.38s |       1.4% |
| refine mesh                     |         1 |     0.495s |      0.51% |
| setup                           |         1 |      3.18s |       3.3% |
| solve                           |         1 |      19.4s |        20% |
+---------------------------------+-----------+------------+------------+

The operations I would expect to benefit from improvements with tensors are "assemble" and "refine mesh". For these, we have best times of 2.36 and 0.492 seconds (without this patch), and 2.28 and 0.488 (with this patch), respectively. These are improvements on the order of 3.5% and 1%. That's better than nothing but also not a whole lot.

bangerth · 2024-04-06T17:38:04Z

Here are the 3d results. Before the patch:

Refinement cycle 3
   Number of active cells: 3168
   Number of degrees of freedom: 93176 (89043+4133)
   Assembling...
   Computing preconditioner...
   Solving...  15 outer CG Schur complement iterations for pressure


+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |        66s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assemble                        |         1 |      2.06s |       3.1% |
| compute preconditioner          |         1 |       1.2s |       1.8% |
| output results                  |         1 |    0.0892s |      0.14% |
| refine mesh                     |         1 |    0.0384s |         0% |
| setup                           |         1 |     0.688s |         1% |
| solve                           |         1 |        62s |        94% |
+---------------------------------+-----------+------------+------------+

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      66.2s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assemble                        |         1 |       2.1s |       3.2% |
| compute preconditioner          |         1 |      1.19s |       1.8% |
| output results                  |         1 |    0.0893s |      0.13% |
| refine mesh                     |         1 |    0.0417s |         0% |
| setup                           |         1 |     0.698s |       1.1% |
| solve                           |         1 |      62.1s |        94% |
+---------------------------------+-----------+------------+------------+

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |        66s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assemble                        |         1 |         2s |         3% |
| compute preconditioner          |         1 |      1.15s |       1.7% |
| output results                  |         1 |    0.0907s |      0.14% |
| refine mesh                     |         1 |    0.0404s |         0% |
| setup                           |         1 |     0.681s |         1% |
| solve                           |         1 |        62s |        94% |
+---------------------------------+-----------+------------+------------+

After the patch:

Refinement cycle 3
   Number of active cells: 3168
   Number of degrees of freedom: 93176 (89043+4133)
   Assembling...
   Computing preconditioner...
   Solving...  15 outer CG Schur complement iterations for pressure


+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      66.6s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assemble                        |         1 |      2.05s |       3.1% |
| compute preconditioner          |         1 |      1.16s |       1.7% |
| output results                  |         1 |    0.0938s |      0.14% |
| refine mesh                     |         1 |    0.0469s |         0% |
| setup                           |         1 |     0.656s |      0.98% |
| solve                           |         1 |      62.6s |        94% |
+---------------------------------+-----------+------------+------------+

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      66.1s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assemble                        |         1 |      2.09s |       3.2% |
| compute preconditioner          |         1 |      1.21s |       1.8% |
| output results                  |         1 |    0.0926s |      0.14% |
| refine mesh                     |         1 |    0.0533s |         0% |
| setup                           |         1 |     0.679s |         1% |
| solve                           |         1 |        62s |        94% |
+---------------------------------+-----------+------------+------------+

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      66.1s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assemble                        |         1 |      2.08s |       3.1% |
| compute preconditioner          |         1 |      1.21s |       1.8% |
| output results                  |         1 |    0.0933s |      0.14% |
| refine mesh                     |         1 |    0.0488s |         0% |
| setup                           |         1 |      0.68s |         1% |
| solve                           |         1 |        62s |        94% |
+---------------------------------+-----------+------------+------------+

Again the minimum times for "assemble" and "refine mesh", respectively:

Without this patch: 2.00 and 0.0384 seconds
With this patch: 2.05 and 0.0469 seconds

So we're getting slower by a bit (though about within the noise level). In other words, this is again disappointing, but comports with what @gassmoeller observes.

bangerth · 2024-04-06T17:40:36Z

I'll leave this open for now, in case others want to discuss this some more. But my take is that this isn't going anywhere.

I have one other idea (not using padding elements + alignment), but creating the VectorizedArray objects on the fly via unaligned loads of exactly dim elements. I'll report back in a separate PR once I have that working.

kronbichler · 2024-04-06T18:12:24Z

What is noticeable is that both results are significantly faster than the results I posted last week, but that is independent of the tensor alignment and probably related to the improvements to MappingCartesian and other recent PRs.

Yes, several of my PRs in the last week were specifically motivated by what I saw when looking at the benchmark in #16796 (comment).

bangerth · 2024-04-10T20:35:41Z

OK, let's close this. It was an interesting experiment, but not worth it.

bangerth added the ready to test label Apr 4, 2024

bangerth force-pushed the realign branch from 1a66b77 to da1bc7e Compare April 4, 2024 22:43

bangerth added 6 commits April 4, 2024 19:41

No longer output the size of tensors into output file of a test.

0b88c8d

Add a changelog entry for the alignment of tensors.

8f58607

Align Tensor<rank,dim,Number> to make vectorization possible.

7529a6e

Disable vectorization for Tensor<1,2,float>.

db15acf

Add padding elements to class Tensor to allow for vectorization.

d58b390

Adjust tests for functions that can no longer be 'constexpr'.

0fa5bed

bangerth force-pushed the realign branch 3 times, most recently from ea89155 to cf9fd37 Compare April 5, 2024 02:36

bangerth added 9 commits April 4, 2024 21:06

Remove 'constexpr' from Tensor functions that can be vectorized.

0964e28

Mark DEAL_II_ALWAYS_INLINE functions as 'inline'.

e431385

Simplify some functions by just deferring to member functions.

f5fd1a3

Vectorize some functions in class Tensor.

cf8f8f7

Work around an alignment issue by making a buffer large enough.

9cbeab9

Add a changelog entry.

2b93278

Ensure alignment appropriate for std::array, not Number.

d17204f

Use correct array size.

4677be0

Disable -Wstrict-aliasing.

dd200c8

bangerth force-pushed the realign branch from cf9fd37 to dd200c8 Compare April 5, 2024 03:06

bangerth mentioned this pull request Apr 6, 2024

Simplify some functions by just deferring to member functions. #16864

Merged

bangerth mentioned this pull request Apr 6, 2024

Vectorize class Tensor without requiring alignment or padding. #16865

Closed

bangerth added this to the Release 9.6 milestone Apr 6, 2024

bangerth closed this Apr 10, 2024

bangerth deleted the realign branch April 10, 2024 20:35

bangerth mentioned this pull request Apr 10, 2024

Clean up the Tensor class. #16465

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use vector intrinsics for tensor operations. #16848

Use vector intrinsics for tensor operations. #16848

bangerth commented Apr 4, 2024

gassmoeller commented Apr 5, 2024

bangerth commented Apr 6, 2024

bangerth commented Apr 6, 2024

bangerth commented Apr 6, 2024

kronbichler commented Apr 6, 2024

bangerth commented Apr 10, 2024

Use vector intrinsics for tensor operations. #16848

Use vector intrinsics for tensor operations. #16848

Conversation

bangerth commented Apr 4, 2024

gassmoeller commented Apr 5, 2024

bangerth commented Apr 6, 2024

bangerth commented Apr 6, 2024

bangerth commented Apr 6, 2024

kronbichler commented Apr 6, 2024

bangerth commented Apr 10, 2024