Help compiler exploit hanging-node symmetries by reduced reg loads #13000
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a pure performance optimization: The current vectorized code path of the hanging-node evaluation routines exploits the symmetry in shape functions between the two subface interpolation matrices. However, we leave it up to the compiler to actually exploit this by fewer memory-register moves, which is not done for high degrees as there are too many loads in between the two places where the data is re-used (whereas at least clang-13 does the optimization for
p=3
, so I did not notice immediately). Hence, make the loop explicit, similar to what we do indealii/include/deal.II/matrix_free/tensor_product_kernels.h
Lines 1155 to 1160 in 43550bc