Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce inlining in simple tensor functions #8428

Merged
merged 3 commits into from
Aug 2, 2019

Conversation

kronbichler
Copy link
Member

@kronbichler kronbichler commented Jul 29, 2019

We had previously added DEAL_II_ALWAYS_INLINE in some of the tensor functions but not all simple ones (by simple I mean those that can be expanded into a few lines of machine code). In #8064 some DEAL_II_ALWAYS_INLINE decorations got removed.

The reason why I add them is that the compiler sometimes generates just bad code. The interesting case are some complicated situations where the heuristics of the compiler do not trigger the right decisions, replacing simple calls to one-liners (e.g. the constructor to Tensor<0,dim,Number>) by a lot of function calls. These complicated situations can be, e.g., when the same operations on tensors are done by several callers, say a function templated on the polynomial degree in matrix-free computations where the quadrature point loop does some operations with tensors in the same way for all degrees. The bad thing is that once the heuristics of the compiler get messed up, it seems to take any opportunity to do strange things. These decisions seem to be varying from one gcc major version to another, and from one vectorization width to another. Note that this is with release build and -O3, so pretty optimized where I would hope the compiler to generate the best possible machine code.

So I took the chance to help the compiler in adding a few more DEAL_II_ALWAYS_INLINE here and there to convince the compiler about the right choice. Note in particular that I had to remove the default implementation for Tensor() because even that is taken as an opportunity not to inline. I have added some comment why we do not use Tensor() = default; so that we do not remove it again.

This was observed in some nonlinear elasticity code of @davydden when I looked at the machine code. This change helps performance by around 8% (with gcc-9 and AVX-512) and 20% (with gcc-7 and AVX2), so it really seems to be useful.

One thing I do not like about this PR is that clang-format messes up the indentation in our DEAL_II_ALWAYS_INLINE constexpr DEAL_II_CUDA_HOST_DEV statements (or whatever they are called), and that we are not really consistent in the order of these specifiers. I have tried to use the same concept as before (using DEAL_II_ALWAYS_INLINE before constexpr but after DEAL_II_CONSTEXPR inline), but it's a mess.

@kronbichler
Copy link
Member Author

I am still waiting for the testsuite results, so I will need to give it another look tomorrow when I'm back to office. I wanted to post it already now because it's am important topic from a performance point of view and I want to give it a bit of time here.

static constexpr DEAL_II_CUDA_HOST_DEV const T &
value(const T &t)
static DEAL_II_ALWAYS_INLINE constexpr DEAL_II_CUDA_HOST_DEV const T &
value(const T &t)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an example of weird indentation of clang-format. While I understand what it tries to do, it just gets nasty (and beyond the 80 char limit several times).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you force a line break by inserting /* bla */ before const for example?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a few things and while I could improve the situation sometimes, this was not consistently the case. I therefore suggest we leave the discussion about improving indentation to a more systematic approach - I have opened #8437 for that.

@tjhei
Copy link
Member

tjhei commented Jul 29, 2019

The reason why I add them is that the compiler sometimes generates just bad code.

That is terrifying. Not only can we accidentally cause a performance regression in the order of 10% by making small changes, they also might happen with a certain compiler version? Can you think of ways to improve the situation? Is our Tensor implementation just too big and heavy?

@@ -456,7 +456,7 @@ class Tensor
* @note This function can also be used in CUDA device code.
*/
constexpr DEAL_II_CUDA_HOST_DEV
Tensor() = default;
Tensor();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just have

DEAL_II_ALWAYS_INLINE Tensor() = default;

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try that tomorrow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works - that's the situation when I really appreciate code review because "I should have realized that on my own" but I have not.

@kronbichler
Copy link
Member Author

That is terrifying. Not only can we accidentally cause a performance regression in the order of 10% by making small changes, they also might happen with a certain compiler version? Can you think of ways to improve the situation? Is our Tensor implementation just too big and heavy?

This PR tries to reduce the chance of the compiler to mess things up and introduce these slow-downs. In the current state, I do indeed observe clean code for both compilers I looked into with minimal performance differences. This is because I know that function inlining makes sense at this point because these things end up in the hot path and any possible gain from making the loop body a bit smaller or reducing the memory consumption during compilation is certainly not helping for these small functions. And the real point with inlining is not only the function call overhead on its own but also the downstream optimization possibility like keeping things in registers, avoiding to update the registers with pointers, and so on.

Our tensor implementation is certainly quite a torture for the compiler, but from a code maintainability perspective is good to have those central things implemented in a modular way. So no, I cannot really think of a way to improve things other than forcing inlining of functions that we observe are in the bottleneck. It's a question of heuristics: When the compiler sees that the constructor of Tensor<2,3,VectorizedArray<double>> is used several dozen times, it probably just "thinks" that keeping a single implementation must be better from a code size perspective. What it does not realize is that many duplications (at least in the context where I looked) come from templates implied by FEEvaluation that we pre-compile for a set of degrees - and we do not run different degrees close enough with each others to benefit from instruction caching.

This is necessary because compiler heuristics get bad in complicated situation, replacing simple
calls to one-liners when inlined into a lot of overhead through function calls. A complicated
situation is, e.g., when the same operations on tensors are done by several callers, say a
function templated on the polynomial degree in matrix-free computations where the quadrature
point loop does some operations with tensors in the same way for all degrees.
@kronbichler
Copy link
Member Author

kronbichler commented Jul 30, 2019

What it does not realize is that many duplications (at least in the context where I looked) come from templates implied by FEEvaluation that we pre-compile for a set of degrees - we do not run different degrees close enough with each others to benefit from instruction caching

One thing to add here: I try with feedback-driven optimization (FDO) for these kind of things every other year or so, but I never really got a benefit from it. It seems these decisions are not even among the ones which the compiler determines to be likely critical. It could also be that FDO is overwhelmed by the number of variants of functions I typically compile, or that I did not use it correctly.

Edit: Just for completeness, GCC calls this profile-guided optimization (PGO).

DEAL_II_CONSTEXPR inline Accessor<rank_, dim, constness, P - 1, Number>
Accessor<rank_, dim, constness, P, Number>::
operator[](const unsigned int i)
DEAL_II_CONSTEXPR inline DEAL_II_ALWAYS_INLINE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need inline and DEAL_II_ALWAYS_INLINE? We don't put both together elsewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it looks like most of the declarations after this one put both together (just not the ones before this)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are templates, declaring them as inline does probably not change anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For templates inline is indeed not necessary. Should we remove them in the tensor files (it would certainly not hurt readability here)? However, on general header file implementations we cannot merge DEAL_II_ALWAYS_INLINE with inline because the former might expand to nothing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not realize that DEAL_II_ALWAYS_INLINE could be blank; this makes sense.

@bangerth
Copy link
Member

bangerth commented Aug 1, 2019

For the formatting, does clang-format set a preprocessor #define we can test? If that were the case, we could just expand DEAL_II_ALWAYS_INLINE to nothing and let clang-format ignore it at least as far as indentation is concerned...

@masterleinad
Copy link
Member

Let's merge. We already have an issue related to the clang-format problems here.

@masterleinad masterleinad merged commit 1e61b06 into dealii:master Aug 2, 2019
@kronbichler kronbichler deleted the force_inlining branch September 16, 2019 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants