Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensor product evaluation at arbitrary points: Optimize linear case #15303

Merged
merged 3 commits into from Jun 16, 2023

Conversation

kronbichler
Copy link
Member

This PR provides two separate commits, one for evaluate and one for integrate, to optimize the number of arithmetic operations in the FEPointEvaluation backend for linear shape functions. This is based on ideas discussed in #15273 (comment) and reduces the number of multiplications in favor of a few additions. Overall, the instruction count goes down considerably, especially in 3D, as we avoid redundant computations in some places, and replace some multiplications by additions. Even though most modern hardware provides similar execution capabilities for both, multiplications involve more transistors and generate more heat than additions, so it feels preferable also in case performance is similar.

I initially feared that we might have a longer critical path due to dependencies, but at least on my AMD hardware, the performance is consistently better. @bergbauer can you have a look here, too? We might want to drop one of the two commits if we discover that there are clear slowdowns. Note that we also have the case where we evaluate tensor-valued arguments, where the number of arithmetic operations is not as high.

@@ -3272,49 +3272,59 @@ namespace internal
else if (dim == 1)
{
// gradient
result[0] = values[1] - values[0];
result[0] = Number3(values[1]) - Number3(values[0]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you choose to cast both values separately?

Suggested change
result[0] = Number3(values[1]) - Number3(values[0]);
result[0] = Number3(values[1] - values[0]);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I chose this way because we use a cast of Number3(values[0]) a line below, so the goal was to tell the compiler to only cast values[0] once. As background, the underlying type of values[0] is typically double or Point<dim>, while is most often Number3 = VectorizedArray<double> or Tensor<1, dim, VectorizedArray<double>>, respectively. Hence, the cast involves a broadcast, which might come from memory or from a value already in registers.
This choice is up to debate, as my compiler still chooses to load values[1] and values[0] as scalar variables, build a scalar difference, and then broadcast it (as well as the scalar values[0]). I think we can pick both, and if you prefer, we go with the scalar difference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the compiler does it like this anyway I think we should write the high-level code the same way. We need two casts either way as I see it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need two casts either way. The difference is that on Intel, broadcast from memory consumes less precious execution resources than scalar load + bcast from register. Look at https://www.agner.org/optimize/instruction_tables.pdf page 308 for VBROADCASTSD z,m64 (load from memory + broadcast), which executes at 2 instructions per cycle (RCP 0.5) on execution ports 2 and 3, whereas VBROADCASTSD v,x (broadcast from register) executes at 1 instruction per cycle (RCP 1) on execution port 5, which is the port also executing FMAs. Now this is Intel specific (I don't know if Sapphire Rapids still has this behavior), and AMD does it right, see e.g. page 134 for Zen 4 (RCP 0.5 both from memory and registers and not as disturbing for the execution pipes of arithmetic work), so we can stick with the better instruction. (Or let the compiler decide on the right instruction, it seems it will exchange them to whatever is deemed more beneficial anyway.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now switched to the single cast as suggested.

@bergbauer
Copy link
Contributor

I will test this on my Intel and my AMD machine.

@peterrum peterrum self-requested a review June 5, 2023 20:29
@bergbauer
Copy link
Contributor

On our Intel server CPU (Cascade Lake) this branch is slightly faster (~2.5% in 3D) on my AMD workstation (Zen 2) the difference in timings is in the noise. (GCC for both systems)

@@ -3526,28 +3536,31 @@ namespace internal
}
else if (dim == 1)
{
return (1. - p[0]) * values[0] + p[0] * values[1];
return Number3(values[0]) + p[0] * Number3(values[1] - values[0]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the different cast strategy for values only and values and gradients intentional? If yes can you explain why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I forgot about it.

@kronbichler kronbichler added this to the Release 9.5 milestone Jun 12, 2023
@peterrum peterrum merged commit c03ccfa into dealii:master Jun 16, 2023
14 checks passed
@kronbichler kronbichler deleted the improve_instr_scheduling branch August 10, 2023 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants