FEFaceEvaluation: efficiency of gather_evaluate + integrate_scatter #10921

kronbichler · 2020-09-16T07:52:56Z

After implementing #10811 and the later re-structuring like #10904, I see that the Laplacian evaluated with gather_evaluate and integrate_scatter on the faces runs more slowly with FE_DGQHermite than with FE_DGQ, at least in 2D and with degree 3. This should not be the case because the Hermite case should access only half the vector data and also do somewhat fewer operations otherwise. We need to see over the performance before the next release. At least the gcc compiler generates way too many integer instructions to pass data among the various functions; it might be that we need to force inlining for all lambdas we use.

The text was updated successfully, but these errors were encountered:

peterrum · 2020-09-16T08:00:31Z

it might be that we need to force inlining for all lambdas we use.

I would have thought that the compiler inlines the code since the type of the lamdbas is templated.

kronbichler · 2020-09-16T08:02:33Z

In any case, we need to do a more comprehensive analysis of what instructions get generated and how to avoid them. I would like to finish #9794 first because that might also reveal something similar, so better address things only once.

peterrum · 2021-05-01T08:33:44Z

This is an important issue. But I would suggest to postpone. There will be additional modifications coming to make element-centric loops working for gradients and Hermite basis.

peterrum · 2022-04-22T19:13:41Z

@kronbichler If I remember FEEvaluationData helped here, didn't it?

kronbichler · 2022-04-22T20:06:09Z

It did definitely help, but we need to look at the assembly code at some point before the release, so we should keep this open.

kronbichler · 2022-05-26T20:49:33Z

I now made my detailed analysis for the effect on a comparison between running with the templated polynomial degree against "fe_degree=-1` with the pre-compiled code from the deal.II library. Overall, I think that our goal was accomplished very well by the cleanup in #13056 and related pull requests. I can summarize my findings for a slightly changed version of step-59 (polynomial degree 4, 3D problem, double + float numbers):

The run time for the solver for 4M DoFs on 2 MPI ranks went from 3.32 +- 0.03 seconds to 3.35 +- 0.04 seconds. This is less than 1% of change. While systematic, the statistical distributions overlap, so this is now hard to discern from noise. Of course, in 2D and for lower degrees the change is more visible, but I could not see more than 2-4% of a difference, which is still pretty small.
The executable size (release mode) is about half as big, 1.3M vs 2.3M. This is expected, since we rely on code inside the libdeal_II.so file.
When checking the instruction counts, I saw that the stepping into the different pre-compiled options is very efficient now, spending only 5-12 instructions per step, compared to several dozens at the time of the 9.3 release. Overall, the increase in the instruction count for FEFaceEvaluation::integrate_scatter on the float numbers is from 4.123 billion instructions to 4.243 billion instructions, an increase by around 2.5%. The biggest part is now not the actual stepping into the functions, but rather the compiler generating slightly bigger prologue/epilogue sections when the entry point to the function calls is not as well-defined (pushing more registers to the stack for restoring them at entry), and similar minor optimizations a compiler can do when the call context is very well-defined. Or to put it differently, we waste around 3.5% of the instructions between the place where we initiate the integrate_scatter call and once we start the actual work, compare to 1% for the "compile-everything-at-once" approach.

If we really wanted to improve this more, we could replace the stepping into algorithms by a jump table, thus cutting off maybe some 50 instructions per call to gather_evaluate/integrate_scatter. In the grand scheme of things, I cannot see this making a big difference.

What I also saw in my analysis is the fact that despite our efforts last fall and previously, the boilerplate and selection code in the

dealii/include/deal.II/matrix_free/evaluation_kernels.h

Lines 4595 to 4603 in cd34b08

    
           fe_face_evaluation_process_and_io( 
        
             Processor &                            proc, 
        
             const unsigned int                     n_components, 
        
             const EvaluationFlags::EvaluationFlags evaluation_flag, 
        
             typename Processor::Number2_ *         global_vector_ptr, 
        
             const std::vector<ArrayView<const typename Processor::Number2_>> *sm_ptr, 
        
             const EvaluationData &                                            fe_eval, 
        
             typename Processor::VectorizedArrayType_ *                        temp1) 
        
           {

function still eats quite some instructions. For AVX-8 and float (8-wide), I see around 200-250 instructions to set up the various options, before the vector access work starts. This was not the purpose of this issue, and it is likely impossible to do much, given the functionality available in that function and the implementation effort to fix it. If anything, we need to fix it at a higher level, e.g. by having a "evaluate + integrate on cells and faces" super function like the low-level form used here: https://github.com/kronbichler/multigrid/blob/9e2a8781cad9da616166e6936d60915780b34ade/common/laplace_operator_dg.h#L932-L1657

So to summarize, this issue has been resolved in a good way, there are no further needs at this point.

kronbichler added the Matrix-free label Sep 16, 2020

kronbichler added this to the Release 9.3 milestone Sep 16, 2020

kronbichler mentioned this issue Sep 17, 2020

Interfaces to FEEvaluationImpl & co #10928

Closed

peterrum modified the milestones: Release 9.3, Release 10.0 May 1, 2021

kronbichler closed this as completed May 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEFaceEvaluation: efficiency of gather_evaluate + integrate_scatter #10921

FEFaceEvaluation: efficiency of gather_evaluate + integrate_scatter #10921

kronbichler commented Sep 16, 2020 •

edited

peterrum commented Sep 16, 2020

kronbichler commented Sep 16, 2020

peterrum commented May 1, 2021

peterrum commented Apr 22, 2022

kronbichler commented Apr 22, 2022

kronbichler commented May 26, 2022

FEFaceEvaluation: efficiency of gather_evaluate + integrate_scatter #10921

FEFaceEvaluation: efficiency of gather_evaluate + integrate_scatter #10921

Comments

kronbichler commented Sep 16, 2020 • edited

peterrum commented Sep 16, 2020

kronbichler commented Sep 16, 2020

peterrum commented May 1, 2021

peterrum commented Apr 22, 2022

kronbichler commented Apr 22, 2022

kronbichler commented May 26, 2022

kronbichler commented Sep 16, 2020 •

edited