-
Notifications
You must be signed in to change notification settings - Fork 708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enforce inlining in simple tensor functions #8428
Conversation
I am still waiting for the testsuite results, so I will need to give it another look tomorrow when I'm back to office. I wanted to post it already now because it's am important topic from a performance point of view and I want to give it a bit of time here. |
static constexpr DEAL_II_CUDA_HOST_DEV const T & | ||
value(const T &t) | ||
static DEAL_II_ALWAYS_INLINE constexpr DEAL_II_CUDA_HOST_DEV const T & | ||
value(const T &t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is an example of weird indentation of clang-format
. While I understand what it tries to do, it just gets nasty (and beyond the 80 char limit several times).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you force a line break by inserting /* bla */
before const
for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried a few things and while I could improve the situation sometimes, this was not consistently the case. I therefore suggest we leave the discussion about improving indentation to a more systematic approach - I have opened #8437 for that.
That is terrifying. Not only can we accidentally cause a performance regression in the order of 10% by making small changes, they also might happen with a certain compiler version? Can you think of ways to improve the situation? Is our Tensor implementation just too big and heavy? |
include/deal.II/base/tensor.h
Outdated
@@ -456,7 +456,7 @@ class Tensor | |||
* @note This function can also be used in CUDA device code. | |||
*/ | |||
constexpr DEAL_II_CUDA_HOST_DEV | |||
Tensor() = default; | |||
Tensor(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just have
DEAL_II_ALWAYS_INLINE Tensor() = default;
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try that tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works - that's the situation when I really appreciate code review because "I should have realized that on my own" but I have not.
This PR tries to reduce the chance of the compiler to mess things up and introduce these slow-downs. In the current state, I do indeed observe clean code for both compilers I looked into with minimal performance differences. This is because I know that function inlining makes sense at this point because these things end up in the hot path and any possible gain from making the loop body a bit smaller or reducing the memory consumption during compilation is certainly not helping for these small functions. And the real point with inlining is not only the function call overhead on its own but also the downstream optimization possibility like keeping things in registers, avoiding to update the registers with pointers, and so on. Our tensor implementation is certainly quite a torture for the compiler, but from a code maintainability perspective is good to have those central things implemented in a modular way. So no, I cannot really think of a way to improve things other than forcing inlining of functions that we observe are in the bottleneck. It's a question of heuristics: When the compiler sees that the constructor of |
24f3a72
to
1156d35
Compare
This is necessary because compiler heuristics get bad in complicated situation, replacing simple calls to one-liners when inlined into a lot of overhead through function calls. A complicated situation is, e.g., when the same operations on tensors are done by several callers, say a function templated on the polynomial degree in matrix-free computations where the quadrature point loop does some operations with tensors in the same way for all degrees.
1156d35
to
dad9641
Compare
One thing to add here: I try with feedback-driven optimization (FDO) for these kind of things every other year or so, but I never really got a benefit from it. It seems these decisions are not even among the ones which the compiler determines to be likely critical. It could also be that FDO is overwhelmed by the number of variants of functions I typically compile, or that I did not use it correctly. Edit: Just for completeness, GCC calls this profile-guided optimization (PGO). |
DEAL_II_CONSTEXPR inline Accessor<rank_, dim, constness, P - 1, Number> | ||
Accessor<rank_, dim, constness, P, Number>:: | ||
operator[](const unsigned int i) | ||
DEAL_II_CONSTEXPR inline DEAL_II_ALWAYS_INLINE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need inline
and DEAL_II_ALWAYS_INLINE
? We don't put both together elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it looks like most of the declarations after this one put both together (just not the ones before this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these are templates, declaring them as inline
does probably not change anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For templates inline
is indeed not necessary. Should we remove them in the tensor files (it would certainly not hurt readability here)? However, on general header file implementations we cannot merge DEAL_II_ALWAYS_INLINE
with inline
because the former might expand to nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not realize that DEAL_II_ALWAYS_INLINE
could be blank; this makes sense.
For the formatting, does |
Let's merge. We already have an issue related to the |
We had previously added
DEAL_II_ALWAYS_INLINE
in some of the tensor functions but not all simple ones (by simple I mean those that can be expanded into a few lines of machine code). In #8064 someDEAL_II_ALWAYS_INLINE
decorations got removed.The reason why I add them is that the compiler sometimes generates just bad code. The interesting case are some complicated situations where the heuristics of the compiler do not trigger the right decisions, replacing simple calls to one-liners (e.g. the constructor to
Tensor<0,dim,Number>
) by a lot of function calls. These complicated situations can be, e.g., when the same operations on tensors are done by several callers, say a function templated on the polynomial degree in matrix-free computations where the quadrature point loop does some operations with tensors in the same way for all degrees. The bad thing is that once the heuristics of the compiler get messed up, it seems to take any opportunity to do strange things. These decisions seem to be varying from one gcc major version to another, and from one vectorization width to another. Note that this is with release build and-O3
, so pretty optimized where I would hope the compiler to generate the best possible machine code.So I took the chance to help the compiler in adding a few more
DEAL_II_ALWAYS_INLINE
here and there to convince the compiler about the right choice.Note in particular that I had to remove thedefault
implementation forTensor()
because even that is taken as an opportunity not to inline. I have added some comment why we do not useTensor() = default;
so that we do not remove it again.This was observed in some nonlinear elasticity code of @davydden when I looked at the machine code. This change helps performance by around 8% (with gcc-9 and AVX-512) and 20% (with gcc-7 and AVX2), so it really seems to be useful.
One thing I do not like about this PR is that
clang-format
messes up the indentation in ourDEAL_II_ALWAYS_INLINE constexpr DEAL_II_CUDA_HOST_DEV
statements (or whatever they are called), and that we are not really consistent in the order of these specifiers. I have tried to use the same concept as before (usingDEAL_II_ALWAYS_INLINE
beforeconstexpr
but afterDEAL_II_CONSTEXPR inline
), but it's a mess.