New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove the cell dof indices cache #14068
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not really make sense without #14066 and #14063 because the resulting code would be much slower than necessary, but the present PR is technically independent of the other PR.
Generally, I am favor of removing the cache. But as far as I see most of your optimizations target hypercube cells. But what is the result on, simplex and mixed meshes: is it "much slower"? Are we talking about a few percent of reading the indices, which anyway dominated by FEValues
. You could run the benchmark in https://github.com/peterrum/dealii-simplex-throughput/blob/master/throughput_01.cc and/or use it as a starting point.
@marcfehling Could you run the tests from your paper to see how fast loading DoF indices is currently.
I agree, we should run a benchmark and have a look at the proposed changes in #14066. I won't get to it this week, but this PR is not super urgent anyway. |
25e8925
to
4886a34
Compare
I did have a look at the simplex-throughput benchmark. While the proposed change are certainly visible, the contribution to the overall instruction count is less than 0.5% (18 million out of 4 billion instructions), and also little compared to |
4886a34
to
f1c5d42
Compare
a252c88
to
a166416
Compare
this->present_level, | ||
this->index(), | ||
this->get_fe().n_dofs_per_cell()); | ||
boost::container::small_vector<types::global_dof_index, 27> dof_indices( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you come up with 27?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is for quadratic hex elements (HEX27) in 3D.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this value a static constant somewhere? That would make it simpler to change in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, my question mostly is why this is a good value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great to have a comment here explaining this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I will add a new static constexpr variable to the dof accessor class for this. The choice 27 is indeed motivated by getting the most common low-order cases fast (and for the larger ones, the dof index does not matter much). I wanted to have up to scalar Q_2 (27 in 3D) and vector Q_1 (24) covered, plus 2D Stokes with Taylor-Hood (22).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now added an alias to keep the definition in a single place, and added a comment on what the motivation for that was.
Sorry, I did miss your message. I will run some comparing tests and will post the results here soon. |
a166416
to
c91ae16
Compare
Again, sorry that I gave this PR such a low priority. I will run my examples this week and provide you with timing results. |
I've run the two scenarios from our parallel-hp paper: the Laplace problem on the L-shape domain, and the Stokes problem on the Y-pipe. I chose problem sizes that could be solved quite quickly on Wolfgang's machines (2 sockets, each with AMD EPYC 7552 48-Core), using 1 MPI process and 64 MPI processes, respectively. The examples do not use matrix-free methods. I didn't notice much of a change in overall performance. It stood out that dof-enumeration is faster, but operations in the estimate-mark-refine cycle are slower. You can find the results in this spreadsheet. I've colored changes larger than 5%. I also ran |
Thank you for the detailed experiment @marcfehling! The numbers you present make sense, as the enumeration of unknowns now has to do less work (do not populate the cache), whereas the estimation part has the opposite effect: We compute quantities that are relatively cheap (the solution gradient at faces), and then the additional work for extracting the indices gets more visible. Since this type of face access retrieves the information I also looked into the attached cachegrind profiles: Again, the observed increase in instruction cache misses is not surprising, because the code to get all the indices for a cell is not a short loop any more, but needs to step over objects. Nonetheless, the overall number of instruction cache misses is very low, 1 out of 1000 is a small number. In reality, I believe that valgrind's model might even be pessimistic, because the last level cache can cover most of the misses, and then instruction prefetching will actually remove most problematic cases. In terms of performance, I would therefore see the main performance impact of removing the instruction cache not in the instruction cache misses, but the instruction count that is a few percent higher for selected operations, because that is the actual bottleneck in most assembly/estimation loops. Especially without vectorization, these things are core bound. So I would summarize your experiment as overall positive: The increase in instruction counts and slowdown is similar to what I observe for the non-hp case. There are some cases here and there with increased costs (at most 5-10%), but they occur on code that is rather cheap in comparison to the main computations in a finite element solver. On the plus side, we reduce the memory consumption of the To summarize I would suggest we should strive to get at least three approvals whether we want to merge this PR, because it is a tradeoff. |
I would argue that code simplification is another reason to merge this PR. 👍 |
Part of #2000.
This PR finally removes the
cell_dof_indices_cache
. It does not really make sense without #14066 and #14063 because the resulting code would be much slower than necessary, but the present PR is technically independent of the other PR.The advantage of this PR is that it simplifies the data structures of
DoFHandler
in the sense that we keep a single copy of the unknowns, and access ofcell
andface
DoFs is of the same performance now. While this is slower (around 3x the number of instructions) than the old cache, the function to retrievedof_indices
is essentially never the real bottleneck. More importantly, the main use of fast access to indices needs MPI-local indices (which cannot be part ofDoFHandler
, see #2000), which is available through theMatrixFree
/DoFInfo
component, which can be seen as the more appropriate cache.Thoughts? Are there benchmark cases we should look at to assess the proposed direction? Maybe in the hp case with some non-trivial mesh?