New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensor product evaluation at arbitrary points: Optimize linear case #15303
Conversation
@@ -3272,49 +3272,59 @@ namespace internal | |||
else if (dim == 1) | |||
{ | |||
// gradient | |||
result[0] = values[1] - values[0]; | |||
result[0] = Number3(values[1]) - Number3(values[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you choose to cast both values separately?
result[0] = Number3(values[1]) - Number3(values[0]); | |
result[0] = Number3(values[1] - values[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I chose this way because we use a cast of Number3(values[0])
a line below, so the goal was to tell the compiler to only cast values[0]
once. As background, the underlying type of values[0]
is typically double
or Point<dim>
, while is most often Number3 = VectorizedArray<double>
or Tensor<1, dim, VectorizedArray<double>>
, respectively. Hence, the cast involves a broadcast, which might come from memory or from a value already in registers.
This choice is up to debate, as my compiler still chooses to load values[1]
and values[0]
as scalar variables, build a scalar difference, and then broadcast it (as well as the scalar values[0]
). I think we can pick both, and if you prefer, we go with the scalar difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the compiler does it like this anyway I think we should write the high-level code the same way. We need two casts either way as I see it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need two casts either way. The difference is that on Intel, broadcast from memory consumes less precious execution resources than scalar load + bcast from register. Look at https://www.agner.org/optimize/instruction_tables.pdf page 308 for VBROADCASTSD z,m64
(load from memory + broadcast), which executes at 2 instructions per cycle (RCP 0.5) on execution ports 2 and 3, whereas VBROADCASTSD v,x
(broadcast from register) executes at 1 instruction per cycle (RCP 1) on execution port 5, which is the port also executing FMAs. Now this is Intel specific (I don't know if Sapphire Rapids still has this behavior), and AMD does it right, see e.g. page 134 for Zen 4 (RCP 0.5 both from memory and registers and not as disturbing for the execution pipes of arithmetic work), so we can stick with the better instruction. (Or let the compiler decide on the right instruction, it seems it will exchange them to whatever is deemed more beneficial anyway.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now switched to the single cast as suggested.
I will test this on my Intel and my AMD machine. |
bda77ae
to
59a9361
Compare
On our Intel server CPU (Cascade Lake) this branch is slightly faster (~2.5% in 3D) on my AMD workstation (Zen 2) the difference in timings is in the noise. (GCC for both systems) |
@@ -3526,28 +3536,31 @@ namespace internal | |||
} | |||
else if (dim == 1) | |||
{ | |||
return (1. - p[0]) * values[0] + p[0] * values[1]; | |||
return Number3(values[0]) + p[0] * Number3(values[1] - values[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the different cast strategy for values only and values and gradients intentional? If yes can you explain why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I forgot about it.
59a9361
to
5e4a172
Compare
5e4a172
to
11b9318
Compare
This PR provides two separate commits, one for
evaluate
and one forintegrate
, to optimize the number of arithmetic operations in theFEPointEvaluation
backend for linear shape functions. This is based on ideas discussed in #15273 (comment) and reduces the number of multiplications in favor of a few additions. Overall, the instruction count goes down considerably, especially in 3D, as we avoid redundant computations in some places, and replace some multiplications by additions. Even though most modern hardware provides similar execution capabilities for both, multiplications involve more transistors and generate more heat than additions, so it feels preferable also in case performance is similar.I initially feared that we might have a longer critical path due to dependencies, but at least on my AMD hardware, the performance is consistently better. @bergbauer can you have a look here, too? We might want to drop one of the two commits if we discover that there are clear slowdowns. Note that we also have the case where we evaluate tensor-valued arguments, where the number of arithmetic operations is not as high.