New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEFaceEvaluation: efficiency of gather_evaluate + integrate_scatter #10921
Comments
I would have thought that the compiler inlines the code since the type of the lamdbas is templated. |
In any case, we need to do a more comprehensive analysis of what instructions get generated and how to avoid them. I would like to finish #9794 first because that might also reveal something similar, so better address things only once. |
This is an important issue. But I would suggest to postpone. There will be additional modifications coming to make element-centric loops working for gradients and Hermite basis. |
@kronbichler If I remember |
It did definitely help, but we need to look at the assembly code at some point before the release, so we should keep this open. |
I now made my detailed analysis for the effect on a comparison between running with the templated polynomial degree against "fe_degree=-1` with the pre-compiled code from the deal.II library. Overall, I think that our goal was accomplished very well by the cleanup in #13056 and related pull requests. I can summarize my findings for a slightly changed version of step-59 (polynomial degree 4, 3D problem, double + float numbers):
If we really wanted to improve this more, we could replace the stepping into algorithms by a jump table, thus cutting off maybe some 50 instructions per call to What I also saw in my analysis is the fact that despite our efforts last fall and previously, the boilerplate and selection code in the dealii/include/deal.II/matrix_free/evaluation_kernels.h Lines 4595 to 4603 in cd34b08
So to summarize, this issue has been resolved in a good way, there are no further needs at this point. |
After implementing #10811 and the later re-structuring like #10904, I see that the Laplacian evaluated with
gather_evaluate
andintegrate_scatter
on the faces runs more slowly withFE_DGQHermite
than withFE_DGQ
, at least in 2D and with degree 3. This should not be the case because the Hermite case should access only half the vector data and also do somewhat fewer operations otherwise. We need to see over the performance before the next release. At least the gcc compiler generates way too many integer instructions to pass data among the various functions; it might be that we need to force inlining for all lambdas we use.The text was updated successfully, but these errors were encountered: