New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce overhead in calls to tensor product value function #15182
Conversation
07006fd
to
8cfb2f2
Compare
const std::size_t n_batches = | ||
n_q_points_scalar / n_lanes_internal + | ||
(n_q_points_scalar % n_lanes_internal > 0 ? 1 : 0); | ||
(n_q_points_scalar + n_lanes_internal - 1) / n_lanes_internal; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you're doing here is rounding up n_q_points_scalar/n_lanes_internal
. Can you put this into a comment?
n_shapes, | ||
solution_renumbered); | ||
|
||
if (evaluation_flags & EvaluationFlags::values) | ||
{ | ||
for (unsigned int v = 0; v < stride && q + v < n_q_points_scalar; ++v) | ||
for (unsigned int v = 0; | ||
v < stride && (stride > 1 ? q + v < n_q_points_scalar : true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition (here and below) stumped me for a bit. I think it might be easier to read as
v < stride && (stride > 1 ? q + v < n_q_points_scalar : true); | |
v < stride && (stride == 1 || (q + v < n_q_points_scalar)); |
f59c98d
to
b5a8c60
Compare
Thanks for the comments, I adapted according to the suggestions. |
This is a follow-up to #15137, reducing the integer code overhead a code has to go through. There are three main changes:
ArrayView
to plain pointers in the internal interfaces, which reduces the setup cost by 2 instructions and one register to passed around between function calls.stride==1
, the loop termination criterionq + v < n_q_points_scalar;
is the same as in the outer loop and thus always true. Avoid one additional branch instruction by spelling out this asv < stride && (stride > 1 ? q + v < n_q_points_scalar : true);
- I admit this is not super readable and there are maybe better suggestions to make sure the data rearrangement, which is visible in profilers, does not incur a loop if all we add is one element.do_interpolate_xy
withDEAL_II_ALWAYS_INLINE
. We had a discussion in Template loop bounds for flexible evaluate/integrate function #14972 (comment) but it seems that at least my compiler (clang-16) does a bad job in guessing the cost of moving the variables in and out of registers; instead, the outer functionevaluate_tensor_product_value_and_gradient_shapes
should be the function to collect the code. From a code size perspective, I do not see an advantage of not to inline, as the only function callingdo_interpolate_xy
is that other function.