Enforce inlining in simple tensor functions #8428

kronbichler · 2019-07-29T16:22:50Z

We had previously added DEAL_II_ALWAYS_INLINE in some of the tensor functions but not all simple ones (by simple I mean those that can be expanded into a few lines of machine code). In #8064 some DEAL_II_ALWAYS_INLINE decorations got removed.

The reason why I add them is that the compiler sometimes generates just bad code. The interesting case are some complicated situations where the heuristics of the compiler do not trigger the right decisions, replacing simple calls to one-liners (e.g. the constructor to Tensor<0,dim,Number>) by a lot of function calls. These complicated situations can be, e.g., when the same operations on tensors are done by several callers, say a function templated on the polynomial degree in matrix-free computations where the quadrature point loop does some operations with tensors in the same way for all degrees. The bad thing is that once the heuristics of the compiler get messed up, it seems to take any opportunity to do strange things. These decisions seem to be varying from one gcc major version to another, and from one vectorization width to another. Note that this is with release build and -O3, so pretty optimized where I would hope the compiler to generate the best possible machine code.

So I took the chance to help the compiler in adding a few more DEAL_II_ALWAYS_INLINE here and there to convince the compiler about the right choice. Note in particular that I had to remove the default implementation for Tensor() because even that is taken as an opportunity not to inline. I have added some comment why we do not use Tensor() = default; so that we do not remove it again.

This was observed in some nonlinear elasticity code of @davydden when I looked at the machine code. This change helps performance by around 8% (with gcc-9 and AVX-512) and 20% (with gcc-7 and AVX2), so it really seems to be useful.

One thing I do not like about this PR is that clang-format messes up the indentation in our DEAL_II_ALWAYS_INLINE constexpr DEAL_II_CUDA_HOST_DEV statements (or whatever they are called), and that we are not really consistent in the order of these specifiers. I have tried to use the same concept as before (using DEAL_II_ALWAYS_INLINE before constexpr but after DEAL_II_CONSTEXPR inline), but it's a mess.

kronbichler · 2019-07-29T16:24:14Z

I am still waiting for the testsuite results, so I will need to give it another look tomorrow when I'm back to office. I wanted to post it already now because it's am important topic from a performance point of view and I want to give it a bit of time here.

kronbichler · 2019-07-29T16:26:40Z

include/deal.II/base/numbers.h

-    static constexpr DEAL_II_CUDA_HOST_DEV const T &
-                                                 value(const T &t)
+    static DEAL_II_ALWAYS_INLINE constexpr DEAL_II_CUDA_HOST_DEV const T &
+                                                                       value(const T &t)


Here is an example of weird indentation of clang-format. While I understand what it tries to do, it just gets nasty (and beyond the 80 char limit several times).

can you force a line break by inserting /* bla */ before const for example?

I tried a few things and while I could improve the situation sometimes, this was not consistently the case. I therefore suggest we leave the discussion about improving indentation to a more systematic approach - I have opened #8437 for that.

tjhei · 2019-07-29T17:28:40Z

The reason why I add them is that the compiler sometimes generates just bad code.

That is terrifying. Not only can we accidentally cause a performance regression in the order of 10% by making small changes, they also might happen with a certain compiler version? Can you think of ways to improve the situation? Is our Tensor implementation just too big and heavy?

masterleinad · 2019-07-29T17:54:30Z

include/deal.II/base/tensor.h

@@ -456,7 +456,7 @@ class Tensor
   * @note This function can also be used in CUDA device code.
   */
  constexpr DEAL_II_CUDA_HOST_DEV
-  Tensor() = default;
+  Tensor();


Can't we just have

DEAL_II_ALWAYS_INLINE Tensor() = default;

?

I will try that tomorrow.

It works - that's the situation when I really appreciate code review because "I should have realized that on my own" but I have not.

kronbichler · 2019-07-30T06:48:41Z

That is terrifying. Not only can we accidentally cause a performance regression in the order of 10% by making small changes, they also might happen with a certain compiler version? Can you think of ways to improve the situation? Is our Tensor implementation just too big and heavy?

This PR tries to reduce the chance of the compiler to mess things up and introduce these slow-downs. In the current state, I do indeed observe clean code for both compilers I looked into with minimal performance differences. This is because I know that function inlining makes sense at this point because these things end up in the hot path and any possible gain from making the loop body a bit smaller or reducing the memory consumption during compilation is certainly not helping for these small functions. And the real point with inlining is not only the function call overhead on its own but also the downstream optimization possibility like keeping things in registers, avoiding to update the registers with pointers, and so on.

Our tensor implementation is certainly quite a torture for the compiler, but from a code maintainability perspective is good to have those central things implemented in a modular way. So no, I cannot really think of a way to improve things other than forcing inlining of functions that we observe are in the bottleneck. It's a question of heuristics: When the compiler sees that the constructor of Tensor<2,3,VectorizedArray<double>> is used several dozen times, it probably just "thinks" that keeping a single implementation must be better from a code size perspective. What it does not realize is that many duplications (at least in the context where I looked) come from templates implied by FEEvaluation that we pre-compile for a set of degrees - and we do not run different degrees close enough with each others to benefit from instruction caching.

This is necessary because compiler heuristics get bad in complicated situation, replacing simple calls to one-liners when inlined into a lot of overhead through function calls. A complicated situation is, e.g., when the same operations on tensors are done by several callers, say a function templated on the polynomial degree in matrix-free computations where the quadrature point loop does some operations with tensors in the same way for all degrees.

kronbichler · 2019-07-30T07:40:22Z

What it does not realize is that many duplications (at least in the context where I looked) come from templates implied by FEEvaluation that we pre-compile for a set of degrees - we do not run different degrees close enough with each others to benefit from instruction caching

One thing to add here: I try with feedback-driven optimization (FDO) for these kind of things every other year or so, but I never really got a benefit from it. It seems these decisions are not even among the ones which the compiler determines to be likely critical. It could also be that FDO is overwhelmed by the number of variants of functions I typically compile, or that I did not use it correctly.

Edit: Just for completeness, GCC calls this profile-guided optimization (PGO).

drwells · 2019-07-31T20:07:27Z

include/deal.II/base/symmetric_tensor.h

-    DEAL_II_CONSTEXPR inline Accessor<rank_, dim, constness, P - 1, Number>
-      Accessor<rank_, dim, constness, P, Number>::
-      operator[](const unsigned int i)
+    DEAL_II_CONSTEXPR inline DEAL_II_ALWAYS_INLINE


Do we need inline and DEAL_II_ALWAYS_INLINE? We don't put both together elsewhere.

Actually, it looks like most of the declarations after this one put both together (just not the ones before this)

Since these are templates, declaring them as inline does probably not change anything.

For templates inline is indeed not necessary. Should we remove them in the tensor files (it would certainly not hurt readability here)? However, on general header file implementations we cannot merge DEAL_II_ALWAYS_INLINE with inline because the former might expand to nothing.

I did not realize that DEAL_II_ALWAYS_INLINE could be blank; this makes sense.

bangerth · 2019-08-01T03:54:47Z

For the formatting, does clang-format set a preprocessor #define we can test? If that were the case, we could just expand DEAL_II_ALWAYS_INLINE to nothing and let clang-format ignore it at least as far as indentation is concerned...

masterleinad · 2019-08-02T18:05:08Z

Let's merge. We already have an issue related to the clang-format problems here.

kronbichler added Tensors ready to test labels Jul 29, 2019

kronbichler commented Jul 29, 2019

View reviewed changes

masterleinad reviewed Jul 29, 2019

View reviewed changes

kronbichler force-pushed the force_inlining branch from 24f3a72 to 1156d35 Compare July 30, 2019 07:26

kronbichler added 2 commits July 30, 2019 09:27

Force inlining in several kinematics functions

dad9641

kronbichler force-pushed the force_inlining branch from 1156d35 to dad9641 Compare July 30, 2019 07:27

Simplify symmetrize(Tensor<2,dim>)

2698957

This was referenced Jul 31, 2019

Look over clang-format indentation with many specifiers #8437

Closed

Run benchmarks on 40 cores of Cascade Lake davydden/large-strain-matrix-free#75

Merged

drwells reviewed Jul 31, 2019

View reviewed changes

masterleinad approved these changes Jul 31, 2019

View reviewed changes

masterleinad merged commit 1e61b06 into dealii:master Aug 2, 2019

kronbichler deleted the force_inlining branch September 16, 2019 08:08

jppelteret mentioned this pull request Feb 1, 2020

Improve some operations in symmetric tensor and tensor #9457

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce inlining in simple tensor functions #8428

Enforce inlining in simple tensor functions #8428

kronbichler commented Jul 29, 2019 •

edited

kronbichler commented Jul 29, 2019

kronbichler Jul 29, 2019

tjhei Jul 29, 2019

kronbichler Jul 31, 2019

tjhei commented Jul 29, 2019

masterleinad Jul 29, 2019

kronbichler Jul 29, 2019

kronbichler Jul 30, 2019 •

edited

kronbichler commented Jul 30, 2019

kronbichler commented Jul 30, 2019 •

edited

drwells Jul 31, 2019

drwells Jul 31, 2019

masterleinad Jul 31, 2019

kronbichler Aug 1, 2019

drwells Aug 1, 2019

bangerth commented Aug 1, 2019

masterleinad commented Aug 2, 2019

Enforce inlining in simple tensor functions #8428

Enforce inlining in simple tensor functions #8428

Conversation

kronbichler commented Jul 29, 2019 • edited

kronbichler commented Jul 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tjhei commented Jul 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kronbichler Jul 30, 2019 • edited

Choose a reason for hiding this comment

kronbichler commented Jul 30, 2019

kronbichler commented Jul 30, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bangerth commented Aug 1, 2019

masterleinad commented Aug 2, 2019

kronbichler commented Jul 29, 2019 •

edited

kronbichler Jul 30, 2019 •

edited

kronbichler commented Jul 30, 2019 •

edited