Tensor product evaluation at arbitrary points: Optimize linear case #15303

kronbichler · 2023-06-05T10:18:09Z

This PR provides two separate commits, one for evaluate and one for integrate, to optimize the number of arithmetic operations in the FEPointEvaluation backend for linear shape functions. This is based on ideas discussed in #15273 (comment) and reduces the number of multiplications in favor of a few additions. Overall, the instruction count goes down considerably, especially in 3D, as we avoid redundant computations in some places, and replace some multiplications by additions. Even though most modern hardware provides similar execution capabilities for both, multiplications involve more transistors and generate more heat than additions, so it feels preferable also in case performance is similar.

I initially feared that we might have a longer critical path due to dependencies, but at least on my AMD hardware, the performance is consistently better. @bergbauer can you have a look here, too? We might want to drop one of the two commits if we discover that there are clear slowdowns. Note that we also have the case where we evaluate tensor-valued arguments, where the number of arithmetic operations is not as high.

bergbauer · 2023-06-05T11:32:48Z

include/deal.II/matrix_free/tensor_product_kernels.h

@@ -3272,49 +3272,59 @@ namespace internal
    else if (dim == 1)
      {
        // gradient
-        result[0] = values[1] - values[0];
+        result[0] = Number3(values[1]) - Number3(values[0]);


Why do you choose to cast both values separately?

Suggested change

result[0] = Number3(values[1]) - Number3(values[0]);

result[0] = Number3(values[1] - values[0]);

Good question. I chose this way because we use a cast of Number3(values[0]) a line below, so the goal was to tell the compiler to only cast values[0] once. As background, the underlying type of values[0] is typically double or Point<dim>, while is most often Number3 = VectorizedArray<double> or Tensor<1, dim, VectorizedArray<double>>, respectively. Hence, the cast involves a broadcast, which might come from memory or from a value already in registers.
This choice is up to debate, as my compiler still chooses to load values[1] and values[0] as scalar variables, build a scalar difference, and then broadcast it (as well as the scalar values[0]). I think we can pick both, and if you prefer, we go with the scalar difference.

If the compiler does it like this anyway I think we should write the high-level code the same way. We need two casts either way as I see it.

Yes, we need two casts either way. The difference is that on Intel, broadcast from memory consumes less precious execution resources than scalar load + bcast from register. Look at https://www.agner.org/optimize/instruction_tables.pdf page 308 for VBROADCASTSD z,m64 (load from memory + broadcast), which executes at 2 instructions per cycle (RCP 0.5) on execution ports 2 and 3, whereas VBROADCASTSD v,x (broadcast from register) executes at 1 instruction per cycle (RCP 1) on execution port 5, which is the port also executing FMAs. Now this is Intel specific (I don't know if Sapphire Rapids still has this behavior), and AMD does it right, see e.g. page 134 for Zen 4 (RCP 0.5 both from memory and registers and not as disturbing for the execution pipes of arithmetic work), so we can stick with the better instruction. (Or let the compiler decide on the right instruction, it seems it will exchange them to whatever is deemed more beneficial anyway.)

I now switched to the single cast as suggested.

bergbauer · 2023-06-05T11:33:47Z

I will test this on my Intel and my AMD machine.

bergbauer · 2023-06-06T14:42:30Z

On our Intel server CPU (Cascade Lake) this branch is slightly faster (~2.5% in 3D) on my AMD workstation (Zen 2) the difference in timings is in the noise. (GCC for both systems)

bergbauer · 2023-06-06T14:44:24Z

include/deal.II/matrix_free/tensor_product_kernels.h

@@ -3526,28 +3536,31 @@ namespace internal
      }
    else if (dim == 1)
      {
-        return (1. - p[0]) * values[0] + p[0] * values[1];
+        return Number3(values[0]) + p[0] * Number3(values[1] - values[0]);


Is the different cast strategy for values only and values and gradients intentional? If yes can you explain why?

Good point, I forgot about it.

kronbichler added Matrix-free ready to test labels Jun 5, 2023

bergbauer reviewed Jun 5, 2023

View reviewed changes

kronbichler force-pushed the improve_instr_scheduling branch from bda77ae to 59a9361 Compare June 5, 2023 18:40

peterrum self-requested a review June 5, 2023 20:29

bergbauer reviewed Jun 6, 2023

View reviewed changes

kronbichler force-pushed the improve_instr_scheduling branch from 59a9361 to 5e4a172 Compare June 6, 2023 18:39

bergbauer approved these changes Jun 7, 2023

View reviewed changes

kronbichler added this to the Release 9.5 milestone Jun 12, 2023

kronbichler added 3 commits June 15, 2023 20:22

Improve instruction scheduling for linear tensor product values

0ede535

Apply optimizations to integrate

b7d7f28

Reduce the number of casts

11b9318

kronbichler force-pushed the improve_instr_scheduling branch from 5e4a172 to 11b9318 Compare June 15, 2023 18:23

peterrum approved these changes Jun 15, 2023

View reviewed changes

peterrum merged commit c03ccfa into dealii:master Jun 16, 2023
14 checks passed

kronbichler deleted the improve_instr_scheduling branch August 10, 2023 16:39

kronbichler mentioned this pull request Sep 13, 2023

Optimize FEPointEvaluation/NM::MappingInfo #15971

Merged

bergbauer mentioned this pull request Oct 13, 2023

FEPointEvaluation: Fix integrate linear for face path #16137

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor product evaluation at arbitrary points: Optimize linear case #15303

Tensor product evaluation at arbitrary points: Optimize linear case #15303

kronbichler commented Jun 5, 2023

bergbauer Jun 5, 2023

kronbichler Jun 5, 2023

bergbauer Jun 5, 2023

kronbichler Jun 5, 2023

kronbichler Jun 5, 2023

bergbauer commented Jun 5, 2023

bergbauer commented Jun 6, 2023

bergbauer Jun 6, 2023

kronbichler Jun 6, 2023

	result[0] = Number3(values[1]) - Number3(values[0]);
	result[0] = Number3(values[1] - values[0]);

Tensor product evaluation at arbitrary points: Optimize linear case #15303

Tensor product evaluation at arbitrary points: Optimize linear case #15303

Conversation

kronbichler commented Jun 5, 2023

bergbauer Jun 5, 2023

Choose a reason for hiding this comment

kronbichler Jun 5, 2023

Choose a reason for hiding this comment

bergbauer Jun 5, 2023

Choose a reason for hiding this comment

kronbichler Jun 5, 2023

Choose a reason for hiding this comment

kronbichler Jun 5, 2023

Choose a reason for hiding this comment

bergbauer commented Jun 5, 2023

bergbauer commented Jun 6, 2023

bergbauer Jun 6, 2023

Choose a reason for hiding this comment

kronbichler Jun 6, 2023

Choose a reason for hiding this comment