Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce overhead in calls to tensor product value function #15182

Merged
merged 2 commits into from May 9, 2023

Conversation

kronbichler
Copy link
Member

This is a follow-up to #15137, reducing the integer code overhead a code has to go through. There are three main changes:

  • We switch from ArrayView to plain pointers in the internal interfaces, which reduces the setup cost by 2 instructions and one register to passed around between function calls.
  • The compiler can't see that, for the case stride==1, the loop termination criterion q + v < n_q_points_scalar; is the same as in the outer loop and thus always true. Avoid one additional branch instruction by spelling out this as v < stride && (stride > 1 ? q + v < n_q_points_scalar : true); - I admit this is not super readable and there are maybe better suggestions to make sure the data rearrangement, which is visible in profilers, does not incur a loop if all we add is one element.
  • I opted to mark do_interpolate_xy with DEAL_II_ALWAYS_INLINE. We had a discussion in Template loop bounds for flexible evaluate/integrate function #14972 (comment) but it seems that at least my compiler (clang-16) does a bad job in guessing the cost of moving the variables in and out of registers; instead, the outer function evaluate_tensor_product_value_and_gradient_shapes should be the function to collect the code. From a code size perspective, I do not see an advantage of not to inline, as the only function calling do_interpolate_xy is that other function.

Comment on lines 1536 to +1539
const std::size_t n_batches =
n_q_points_scalar / n_lanes_internal +
(n_q_points_scalar % n_lanes_internal > 0 ? 1 : 0);
(n_q_points_scalar + n_lanes_internal - 1) / n_lanes_internal;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you're doing here is rounding up n_q_points_scalar/n_lanes_internal. Can you put this into a comment?

n_shapes,
solution_renumbered);

if (evaluation_flags & EvaluationFlags::values)
{
for (unsigned int v = 0; v < stride && q + v < n_q_points_scalar; ++v)
for (unsigned int v = 0;
v < stride && (stride > 1 ? q + v < n_q_points_scalar : true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition (here and below) stumped me for a bit. I think it might be easier to read as

Suggested change
v < stride && (stride > 1 ? q + v < n_q_points_scalar : true);
v < stride && (stride == 1 || (q + v < n_q_points_scalar));

@kronbichler
Copy link
Member Author

Thanks for the comments, I adapted according to the suggestions.

@bangerth bangerth merged commit a321577 into dealii:master May 9, 2023
14 checks passed
@kronbichler kronbichler deleted the reduce_overhead2 branch August 10, 2023 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants