Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the cell dof indices cache #14068

Merged
merged 2 commits into from Sep 21, 2022

Conversation

kronbichler
Copy link
Member

Part of #2000.

This PR finally removes the cell_dof_indices_cache. It does not really make sense without #14066 and #14063 because the resulting code would be much slower than necessary, but the present PR is technically independent of the other PR.

The advantage of this PR is that it simplifies the data structures of DoFHandler in the sense that we keep a single copy of the unknowns, and access of cell and face DoFs is of the same performance now. While this is slower (around 3x the number of instructions) than the old cache, the function to retrieve dof_indices is essentially never the real bottleneck. More importantly, the main use of fast access to indices needs MPI-local indices (which cannot be part of DoFHandler, see #2000), which is available through the MatrixFree/DoFInfo component, which can be seen as the more appropriate cache.

Thoughts? Are there benchmark cases we should look at to assess the proposed direction? Maybe in the hp case with some non-trivial mesh?

Copy link
Member

@peterrum peterrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not really make sense without #14066 and #14063 because the resulting code would be much slower than necessary, but the present PR is technically independent of the other PR.

Generally, I am favor of removing the cache. But as far as I see most of your optimizations target hypercube cells. But what is the result on, simplex and mixed meshes: is it "much slower"? Are we talking about a few percent of reading the indices, which anyway dominated by FEValues. You could run the benchmark in https://github.com/peterrum/dealii-simplex-throughput/blob/master/throughput_01.cc and/or use it as a starting point.

@marcfehling Could you run the tests from your paper to see how fast loading DoF indices is currently.

@kronbichler
Copy link
Member Author

I agree, we should run a benchmark and have a look at the proposed changes in #14066. I won't get to it this week, but this PR is not super urgent anyway.

@kronbichler
Copy link
Member Author

I did have a look at the simplex-throughput benchmark. While the proposed change are certainly visible, the contribution to the overall instruction count is less than 0.5% (18 million out of 4 billion instructions), and also little compared to MatrixFree::reinit (less than 10% with the current inefficient version, probably less than 3% with some optimization). The problem is that the simplex code paths have a large number of inefficiencies here and there, so this gets completely dominated by those, see also #14117. I will fix the simplex case for the line indices, but I can't fix the other places, which are unrelated to this PR anyway.

this->present_level,
this->index(),
this->get_fe().n_dofs_per_cell());
boost::container::small_vector<types::global_dof_index, 27> dof_indices(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you come up with 27?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is for quadratic hex elements (HEX27) in 3D.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this value a static constant somewhere? That would make it simpler to change in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, my question mostly is why this is a good value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to have a comment here explaining this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I will add a new static constexpr variable to the dof accessor class for this. The choice 27 is indeed motivated by getting the most common low-order cases fast (and for the larger ones, the dof index does not matter much). I wanted to have up to scalar Q_2 (27 in 3D) and vector Q_1 (24) covered, plus 2D Stokes with Taylor-Hood (22).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now added an alias to keep the definition in a single place, and added a comment on what the motivation for that was.

@marcfehling
Copy link
Member

@marcfehling Could you run the tests from your paper to see how fast loading DoF indices is currently.

Sorry, I did miss your message. I will run some comparing tests and will post the results here soon.

@marcfehling
Copy link
Member

Again, sorry that I gave this PR such a low priority. I will run my examples this week and provide you with timing results.

@marcfehling
Copy link
Member

marcfehling commented Sep 17, 2022

I've run the two scenarios from our parallel-hp paper: the Laplace problem on the L-shape domain, and the Stokes problem on the Y-pipe. I chose problem sizes that could be solved quite quickly on Wolfgang's machines (2 sockets, each with AMD EPYC 7552 48-Core), using 1 MPI process and 64 MPI processes, respectively. The examples do not use matrix-free methods.

I didn't notice much of a change in overall performance. It stood out that dof-enumeration is faster, but operations in the estimate-mark-refine cycle are slower. You can find the results in this spreadsheet. I've colored changes larger than 5%.
compare_runtimes.ods

I also ran cachegrind on the smallest Laplace problem with 1 MPI process. I didn't use this tool before, so I'm rather unfamiliar with its output. The only difference I noticed with this patch is that cache miss rates for first-level instructions are slightly higher I1 miss rate: 0.06% -> 0.10%. You'll find the output attached.
nocache.tar.gz
cache.tar.gz

@kronbichler
Copy link
Member Author

Thank you for the detailed experiment @marcfehling! The numbers you present make sense, as the enumeration of unknowns now has to do less work (do not populate the cache), whereas the estimation part has the opposite effect: We compute quantities that are relatively cheap (the solution gradient at faces), and then the additional work for extracting the indices gets more visible. Since this type of face access retrieves the information 2 * dim times per cell, we see a relatively high cost of index manipulations, higher than in loops over cells.

I also looked into the attached cachegrind profiles: Again, the observed increase in instruction cache misses is not surprising, because the code to get all the indices for a cell is not a short loop any more, but needs to step over objects. Nonetheless, the overall number of instruction cache misses is very low, 1 out of 1000 is a small number. In reality, I believe that valgrind's model might even be pessimistic, because the last level cache can cover most of the misses, and then instruction prefetching will actually remove most problematic cases. In terms of performance, I would therefore see the main performance impact of removing the instruction cache not in the instruction cache misses, but the instruction count that is a few percent higher for selected operations, because that is the actual bottleneck in most assembly/estimation loops. Especially without vectorization, these things are core bound.

So I would summarize your experiment as overall positive: The increase in instruction counts and slowdown is similar to what I observe for the non-hp case. There are some cases here and there with increased costs (at most 5-10%), but they occur on code that is rather cheap in comparison to the main computations in a finite element solver. On the plus side, we reduce the memory consumption of the DoFHandler and make the setup somewhat cheaper. I personally think that the main reason not to store this cache data structure is that while better in some scenarios, it is not fast enough to accommodate for a real HPC usage (in that case, we would be storing MPI-local indices, not MPI-global indices), so one would have yet another level of storage in that case.

To summarize I would suggest we should strive to get at least three approvals whether we want to merge this PR, because it is a tradeoff.

@tjhei
Copy link
Member

tjhei commented Sep 21, 2022

I would argue that code simplification is another reason to merge this PR. 👍

@tjhei tjhei merged commit 07fc51a into dealii:master Sep 21, 2022
@kronbichler kronbichler deleted the eliminate_dof_handler_cache branch August 10, 2023 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants