Remove the cell dof indices cache #14068

kronbichler · 2022-06-28T05:40:10Z

Part of #2000.

This PR finally removes the cell_dof_indices_cache. It does not really make sense without #14066 and #14063 because the resulting code would be much slower than necessary, but the present PR is technically independent of the other PR.

The advantage of this PR is that it simplifies the data structures of DoFHandler in the sense that we keep a single copy of the unknowns, and access of cell and face DoFs is of the same performance now. While this is slower (around 3x the number of instructions) than the old cache, the function to retrieve dof_indices is essentially never the real bottleneck. More importantly, the main use of fast access to indices needs MPI-local indices (which cannot be part of DoFHandler, see #2000), which is available through the MatrixFree/DoFInfo component, which can be seen as the more appropriate cache.

Thoughts? Are there benchmark cases we should look at to assess the proposed direction? Maybe in the hp case with some non-trivial mesh?

peterrum

It does not really make sense without #14066 and #14063 because the resulting code would be much slower than necessary, but the present PR is technically independent of the other PR.

Generally, I am favor of removing the cache. But as far as I see most of your optimizations target hypercube cells. But what is the result on, simplex and mixed meshes: is it "much slower"? Are we talking about a few percent of reading the indices, which anyway dominated by FEValues. You could run the benchmark in https://github.com/peterrum/dealii-simplex-throughput/blob/master/throughput_01.cc and/or use it as a starting point.

@marcfehling Could you run the tests from your paper to see how fast loading DoF indices is currently.

kronbichler · 2022-06-29T08:11:52Z

I agree, we should run a benchmark and have a look at the proposed changes in #14066. I won't get to it this week, but this PR is not super urgent anyway.

kronbichler · 2022-07-08T10:08:11Z

I did have a look at the simplex-throughput benchmark. While the proposed change are certainly visible, the contribution to the overall instruction count is less than 0.5% (18 million out of 4 billion instructions), and also little compared to MatrixFree::reinit (less than 10% with the current inefficient version, probably less than 3% with some optimization). The problem is that the simplex code paths have a large number of inefficiencies here and there, so this gets completely dominated by those, see also #14117. I will fix the simplex case for the line indices, but I can't fix the other places, which are unrelated to this PR anyway.

masterleinad · 2022-07-26T20:03:10Z

include/deal.II/dofs/dof_accessor.templates.h

-      this->present_level,
-      this->index(),
-      this->get_fe().n_dofs_per_cell());
+  boost::container::small_vector<types::global_dof_index, 27> dof_indices(


How did you come up with 27?

I assume this is for quadratic hex elements (HEX27) in 3D.

Could we make this value a static constant somewhere? That would make it simpler to change in the future.

Sure, my question mostly is why this is a good value.

Would be great to have a comment here explaining this.

I agree, I will add a new static constexpr variable to the dof accessor class for this. The choice 27 is indeed motivated by getting the most common low-order cases fast (and for the larger ones, the dof index does not matter much). I wanted to have up to scalar Q_2 (27 in 3D) and vector Q_1 (24) covered, plus 2D Stokes with Taylor-Hood (22).

I now added an alias to keep the definition in a single place, and added a comment on what the motivation for that was.

marcfehling · 2022-07-26T20:30:15Z

@marcfehling Could you run the tests from your paper to see how fast loading DoF indices is currently.

Sorry, I did miss your message. I will run some comparing tests and will post the results here soon.

marcfehling · 2022-09-06T19:36:10Z

Again, sorry that I gave this PR such a low priority. I will run my examples this week and provide you with timing results.

marcfehling · 2022-09-17T04:56:07Z

I've run the two scenarios from our parallel-hp paper: the Laplace problem on the L-shape domain, and the Stokes problem on the Y-pipe. I chose problem sizes that could be solved quite quickly on Wolfgang's machines (2 sockets, each with AMD EPYC 7552 48-Core), using 1 MPI process and 64 MPI processes, respectively. The examples do not use matrix-free methods.

I didn't notice much of a change in overall performance. It stood out that dof-enumeration is faster, but operations in the estimate-mark-refine cycle are slower. You can find the results in this spreadsheet. I've colored changes larger than 5%.
compare_runtimes.ods

I also ran cachegrind on the smallest Laplace problem with 1 MPI process. I didn't use this tool before, so I'm rather unfamiliar with its output. The only difference I noticed with this patch is that cache miss rates for first-level instructions are slightly higher I1 miss rate: 0.06% -> 0.10%. You'll find the output attached.
nocache.tar.gz
cache.tar.gz

kronbichler · 2022-09-18T12:19:44Z

Thank you for the detailed experiment @marcfehling! The numbers you present make sense, as the enumeration of unknowns now has to do less work (do not populate the cache), whereas the estimation part has the opposite effect: We compute quantities that are relatively cheap (the solution gradient at faces), and then the additional work for extracting the indices gets more visible. Since this type of face access retrieves the information 2 * dim times per cell, we see a relatively high cost of index manipulations, higher than in loops over cells.

I also looked into the attached cachegrind profiles: Again, the observed increase in instruction cache misses is not surprising, because the code to get all the indices for a cell is not a short loop any more, but needs to step over objects. Nonetheless, the overall number of instruction cache misses is very low, 1 out of 1000 is a small number. In reality, I believe that valgrind's model might even be pessimistic, because the last level cache can cover most of the misses, and then instruction prefetching will actually remove most problematic cases. In terms of performance, I would therefore see the main performance impact of removing the instruction cache not in the instruction cache misses, but the instruction count that is a few percent higher for selected operations, because that is the actual bottleneck in most assembly/estimation loops. Especially without vectorization, these things are core bound.

So I would summarize your experiment as overall positive: The increase in instruction counts and slowdown is similar to what I observe for the non-hp case. There are some cases here and there with increased costs (at most 5-10%), but they occur on code that is rather cheap in comparison to the main computations in a finite element solver. On the plus side, we reduce the memory consumption of the DoFHandler and make the setup somewhat cheaper. I personally think that the main reason not to store this cache data structure is that while better in some scenarios, it is not fast enough to accommodate for a real HPC usage (in that case, we would be storing MPI-local indices, not MPI-global indices), so one would have yet another level of storage in that case.

To summarize I would suggest we should strive to get at least three approvals whether we want to merge this PR, because it is a tradeoff.

tjhei · 2022-09-21T18:49:27Z

I would argue that code simplification is another reason to merge this PR. 👍

kronbichler added DoFs ready to test labels Jun 28, 2022

kronbichler added this to the Release 9.5 milestone Jun 28, 2022

peterrum reviewed Jun 29, 2022

View reviewed changes

kronbichler mentioned this pull request Jun 29, 2022

Speed up DoFAccessorImplementation::get_dof_indices #14066

Merged

kronbichler force-pushed the eliminate_dof_handler_cache branch from 25e8925 to 4886a34 Compare July 8, 2022 06:30

kronbichler mentioned this pull request Jul 8, 2022

BarycentricPolynomial::value() optimizations - get rid of pow #14117

Closed

kronbichler force-pushed the eliminate_dof_handler_cache branch from 4886a34 to f1c5d42 Compare July 8, 2022 10:19

kronbichler mentioned this pull request Jul 8, 2022

get_line_indices_of_cell: Add specialization for tetrahedra #14118

Merged

kronbichler force-pushed the eliminate_dof_handler_cache branch 2 times, most recently from a252c88 to a166416 Compare July 9, 2022 19:54

masterleinad reviewed Jul 26, 2022

View reviewed changes

kronbichler added 2 commits August 18, 2022 11:09

Remove the cell dof indices cache

e1fdbcc

Typedef to define the temporary array in single place only

c91ae16

kronbichler force-pushed the eliminate_dof_handler_cache branch from a166416 to c91ae16 Compare August 18, 2022 09:52

masterleinad approved these changes Aug 18, 2022

View reviewed changes

peterrum approved these changes Sep 18, 2022

View reviewed changes

drwells approved these changes Sep 21, 2022

View reviewed changes

tjhei merged commit 07fc51a into dealii:master Sep 21, 2022

This was referenced Sep 22, 2022

Remove reference to update_cell_dof_indices_cache #14299

Merged

Do not mention update_cell_dof_indices_cache in documentation #14301

Merged

kronbichler mentioned this pull request May 16, 2023

DoFHandler index handling #2000

Closed

kronbichler deleted the eliminate_dof_handler_cache branch August 10, 2023 16:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the cell dof indices cache #14068

Remove the cell dof indices cache #14068

kronbichler commented Jun 28, 2022

peterrum left a comment

kronbichler commented Jun 29, 2022

kronbichler commented Jul 8, 2022

masterleinad Jul 26, 2022

drwells Jul 28, 2022

drwells Jul 28, 2022

masterleinad Jul 28, 2022

tjhei Jul 29, 2022

kronbichler Aug 16, 2022

kronbichler Aug 18, 2022

marcfehling commented Jul 26, 2022

marcfehling commented Sep 6, 2022

marcfehling commented Sep 17, 2022 •

edited

kronbichler commented Sep 18, 2022

tjhei commented Sep 21, 2022

Remove the cell dof indices cache #14068

Remove the cell dof indices cache #14068

Conversation

kronbichler commented Jun 28, 2022

peterrum left a comment

Choose a reason for hiding this comment

kronbichler commented Jun 29, 2022

kronbichler commented Jul 8, 2022

masterleinad Jul 26, 2022

Choose a reason for hiding this comment

drwells Jul 28, 2022

Choose a reason for hiding this comment

drwells Jul 28, 2022

Choose a reason for hiding this comment

masterleinad Jul 28, 2022

Choose a reason for hiding this comment

tjhei Jul 29, 2022

Choose a reason for hiding this comment

kronbichler Aug 16, 2022

Choose a reason for hiding this comment

kronbichler Aug 18, 2022

Choose a reason for hiding this comment

marcfehling commented Jul 26, 2022

marcfehling commented Sep 6, 2022

marcfehling commented Sep 17, 2022 • edited

kronbichler commented Sep 18, 2022

tjhei commented Sep 21, 2022

marcfehling commented Sep 17, 2022 •

edited