Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use vector intrinsics for tensor operations. #16848

Closed
wants to merge 15 commits into from

Conversation

bangerth
Copy link
Member

@bangerth bangerth commented Apr 4, 2024

This is the patch I've been working toward in #16465. It uses @kronbichler 's idea to align the tensor class as appropriate, and then in the relevant functions does a reinterpret_cast to a VectorizedArray type on which we can call vector intrinsics to do the actual work.

There are a number of commits to this PR:

There is one open issue I need to figure out how to address: We liberally use boost::small_vector<Point<dim>,N>, which internally makes sure that the data array is properly aligned -- or at least tries to. But there is a bug somewhere (in boost, I believe) in that it only ensures the proper alignment for the internal static storage, but forgets about it when having to do dynamic memory allocation. We only run into this in one test, mappings/mapping_q_manifold_02.debug where we use a Q5 mapping with 208 evaluation points. For the moment, I'm increasing the static size of the array in which this happens to 216 in commit 11, but that is something I need to look more into. I just wanted to post the patch to see what people have to say about.

@gassmoeller If you're curious, run your benchmark on it!

@bangerth bangerth force-pushed the realign branch 3 times, most recently from ea89155 to cf9fd37 Compare April 5, 2024 02:36
@gassmoeller
Copy link
Member

I didnt have time to test it in depth this week, but what I see so far is not very encouraging. Same benchmark setup as described here.

Timings with f844528 (no alignment or intrinsics):

+----------------------------------------------+------------+------------+
| Total wallclock time elapsed since start     |      77.5s |            |
|                                              |            |            |
| Section                          | no. calls |  wall time | % of total |
+----------------------------------+-----------+------------+------------+
| Assemble Stokes system           |         3 |      9.34s |        12% |
| Assemble temperature system      |         3 |      1.53s |         2% |
| Build Stokes preconditioner      |         3 |      5.29s |       6.8% |
| Build temperature preconditioner |         3 |     0.445s |      0.57% |
| Initialization                   |         1 |    0.0822s |      0.11% |
| Particles: Advect                |         6 |      10.9s |        14% |
| Particles: Copy                  |         3 |       8.2s |        11% |
| Particles: Generate              |         1 |      10.4s |        13% |
| Particles: Initialization        |         1 |  4.46e-05s |         0% |
| Particles: Initialize properties |         1 |         4s |       5.2% |
| Particles: Sort                  |         6 |      23.2s |        30% |
| Postprocessing                   |         3 |    0.0728s |         0% |
| Setup dof systems                |         1 |    0.0587s |         0% |
| Setup initial conditions         |         1 |      14.5s |        19% |
| Setup matrices                   |         1 |     0.977s |       1.3% |
| Solve Stokes system              |         3 |      2.62s |       3.4% |
| Solve temperature system         |         3 |     0.142s |      0.18% |
+----------------------------------+-----------+------------+------------+

Timings with dd200c8 (with alignment and intrinsics):

+----------------------------------------------+------------+------------+
| Total wallclock time elapsed since start     |      86.4s |            |
|                                              |            |            |
| Section                          | no. calls |  wall time | % of total |
+----------------------------------+-----------+------------+------------+
| Assemble Stokes system           |         3 |      9.53s |        11% |
| Assemble temperature system      |         3 |      1.86s |       2.2% |
| Build Stokes preconditioner      |         3 |      5.41s |       6.3% |
| Build temperature preconditioner |         3 |     0.455s |      0.53% |
| Initialization                   |         1 |    0.0823s |         0% |
| Particles: Advect                |         6 |      13.1s |        15% |
| Particles: Copy                  |         3 |      9.64s |        11% |
| Particles: Generate              |         1 |      10.8s |        12% |
| Particles: Initialization        |         1 |  6.39e-05s |         0% |
| Particles: Initialize properties |         1 |      4.47s |       5.2% |
| Particles: Sort                  |         6 |        27s |        31% |
| Postprocessing                   |         3 |     0.105s |      0.12% |
| Setup dof systems                |         1 |    0.0628s |         0% |
| Setup initial conditions         |         1 |      15.4s |        18% |
| Setup matrices                   |         1 |     0.972s |       1.1% |
| Solve Stokes system              |         3 |       2.6s |         3% |
| Solve temperature system         |         3 |     0.144s |      0.17% |
+----------------------------------+-----------+------------+------------+

What is noticeable is that both results are significantly faster than the results I posted last week, but that is independent of the tensor alignment and probably related to the improvements to MappingCartesian and other recent PRs. The particle algorithms are still slower with alignment (reasonable, if they are memory bandwidth limited). However, what is more concerning is that also the Stokes and temperature assembly is slower (Stokes slightly, 2%; temperature significantly, 20%). I would think this could be noise, but solver timings and matrix setup is almost identical, so maybe there is more to it?

@bangerth
Copy link
Member Author

bangerth commented Apr 6, 2024

Well, that's kind of disappointing :-(

Here are some tests of my own, using a slightly modified version of step-22 as an example. For 2d computations, for which we do not use padding elements and so the memory traffic should remain the same, I get the following without this patch on the last refinement cycle (3 successive runs shown to give a measure of variability):

Refinement cycle 5
   Number of active cells: 128608
   Number of degrees of freedom: 1175133 (1044242+130891)
   Assembling...
   Computing preconditioner...
   Solving...  11 outer CG Schur complement iterations for pressure


+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      94.9s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assemble                        |         1 |      2.36s |       2.5% |
| compute preconditioner          |         1 |      68.6s |        72% |
| output results                  |         1 |      1.37s |       1.4% |
| refine mesh                     |         1 |       0.5s |      0.53% |
| setup                           |         1 |      3.16s |       3.3% |
| solve                           |         1 |      18.9s |        20% |
+---------------------------------+-----------+------------+------------+

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      95.6s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assemble                        |         1 |      2.42s |       2.5% |
| compute preconditioner          |         1 |        69s |        72% |
| output results                  |         1 |       1.4s |       1.5% |
| refine mesh                     |         1 |     0.501s |      0.52% |
| setup                           |         1 |      3.18s |       3.3% |
| solve                           |         1 |      19.1s |        20% |
+---------------------------------+-----------+------------+------------+

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      95.5s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assemble                        |         1 |      2.39s |       2.5% |
| compute preconditioner          |         1 |      69.1s |        72% |
| output results                  |         1 |      1.37s |       1.4% |
| refine mesh                     |         1 |     0.492s |      0.51% |
| setup                           |         1 |      3.19s |       3.3% |
| solve                           |         1 |      18.9s |        20% |
+---------------------------------+-----------+------------+------------+

And with this patch:

Refinement cycle 5
   Number of active cells: 128608
   Number of degrees of freedom: 1175133 (1044242+130891)
   Assembling...
   Computing preconditioner...
   Solving...  11 outer CG Schur complement iterations for pressure


+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |        95s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assemble                        |         1 |      2.28s |       2.4% |
| compute preconditioner          |         1 |      68.8s |        72% |
| output results                  |         1 |      1.34s |       1.4% |
| refine mesh                     |         1 |     0.493s |      0.52% |
| setup                           |         1 |      3.16s |       3.3% |
| solve                           |         1 |      18.9s |        20% |
+---------------------------------+-----------+------------+------------+

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      96.1s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assemble                        |         1 |      2.31s |       2.4% |
| compute preconditioner          |         1 |      69.6s |        72% |
| output results                  |         1 |      1.34s |       1.4% |
| refine mesh                     |         1 |     0.488s |      0.51% |
| setup                           |         1 |      3.27s |       3.4% |
| solve                           |         1 |      19.1s |        20% |
+---------------------------------+-----------+------------+------------+

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      96.1s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assemble                        |         1 |      2.32s |       2.4% |
| compute preconditioner          |         1 |      69.3s |        72% |
| output results                  |         1 |      1.38s |       1.4% |
| refine mesh                     |         1 |     0.495s |      0.51% |
| setup                           |         1 |      3.18s |       3.3% |
| solve                           |         1 |      19.4s |        20% |
+---------------------------------+-----------+------------+------------+

The operations I would expect to benefit from improvements with tensors are "assemble" and "refine mesh". For these, we have best times of 2.36 and 0.492 seconds (without this patch), and 2.28 and 0.488 (with this patch), respectively. These are improvements on the order of 3.5% and 1%. That's better than nothing but also not a whole lot.

@bangerth
Copy link
Member Author

bangerth commented Apr 6, 2024

Here are the 3d results. Before the patch:

Refinement cycle 3
   Number of active cells: 3168
   Number of degrees of freedom: 93176 (89043+4133)
   Assembling...
   Computing preconditioner...
   Solving...  15 outer CG Schur complement iterations for pressure


+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |        66s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assemble                        |         1 |      2.06s |       3.1% |
| compute preconditioner          |         1 |       1.2s |       1.8% |
| output results                  |         1 |    0.0892s |      0.14% |
| refine mesh                     |         1 |    0.0384s |         0% |
| setup                           |         1 |     0.688s |         1% |
| solve                           |         1 |        62s |        94% |
+---------------------------------+-----------+------------+------------+

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      66.2s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assemble                        |         1 |       2.1s |       3.2% |
| compute preconditioner          |         1 |      1.19s |       1.8% |
| output results                  |         1 |    0.0893s |      0.13% |
| refine mesh                     |         1 |    0.0417s |         0% |
| setup                           |         1 |     0.698s |       1.1% |
| solve                           |         1 |      62.1s |        94% |
+---------------------------------+-----------+------------+------------+

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |        66s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assemble                        |         1 |         2s |         3% |
| compute preconditioner          |         1 |      1.15s |       1.7% |
| output results                  |         1 |    0.0907s |      0.14% |
| refine mesh                     |         1 |    0.0404s |         0% |
| setup                           |         1 |     0.681s |         1% |
| solve                           |         1 |        62s |        94% |
+---------------------------------+-----------+------------+------------+

After the patch:

Refinement cycle 3
   Number of active cells: 3168
   Number of degrees of freedom: 93176 (89043+4133)
   Assembling...
   Computing preconditioner...
   Solving...  15 outer CG Schur complement iterations for pressure


+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      66.6s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assemble                        |         1 |      2.05s |       3.1% |
| compute preconditioner          |         1 |      1.16s |       1.7% |
| output results                  |         1 |    0.0938s |      0.14% |
| refine mesh                     |         1 |    0.0469s |         0% |
| setup                           |         1 |     0.656s |      0.98% |
| solve                           |         1 |      62.6s |        94% |
+---------------------------------+-----------+------------+------------+

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      66.1s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assemble                        |         1 |      2.09s |       3.2% |
| compute preconditioner          |         1 |      1.21s |       1.8% |
| output results                  |         1 |    0.0926s |      0.14% |
| refine mesh                     |         1 |    0.0533s |         0% |
| setup                           |         1 |     0.679s |         1% |
| solve                           |         1 |        62s |        94% |
+---------------------------------+-----------+------------+------------+

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      66.1s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assemble                        |         1 |      2.08s |       3.1% |
| compute preconditioner          |         1 |      1.21s |       1.8% |
| output results                  |         1 |    0.0933s |      0.14% |
| refine mesh                     |         1 |    0.0488s |         0% |
| setup                           |         1 |      0.68s |         1% |
| solve                           |         1 |        62s |        94% |
+---------------------------------+-----------+------------+------------+

Again the minimum times for "assemble" and "refine mesh", respectively:

  • Without this patch: 2.00 and 0.0384 seconds
  • With this patch: 2.05 and 0.0469 seconds

So we're getting slower by a bit (though about within the noise level). In other words, this is again disappointing, but comports with what @gassmoeller observes.

@bangerth
Copy link
Member Author

bangerth commented Apr 6, 2024

I'll leave this open for now, in case others want to discuss this some more. But my take is that this isn't going anywhere.

I have one other idea (not using padding elements + alignment), but creating the VectorizedArray objects on the fly via unaligned loads of exactly dim elements. I'll report back in a separate PR once I have that working.

@kronbichler
Copy link
Member

What is noticeable is that both results are significantly faster than the results I posted last week, but that is independent of the tensor alignment and probably related to the improvements to MappingCartesian and other recent PRs.

Yes, several of my PRs in the last week were specifically motivated by what I saw when looking at the benchmark in #16796 (comment).

@bangerth
Copy link
Member Author

OK, let's close this. It was an interesting experiment, but not worth it.

@bangerth bangerth closed this Apr 10, 2024
@bangerth bangerth deleted the realign branch April 10, 2024 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants