New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Global coarsening: compress weights #13099
Conversation
5aedbc7
to
cfa11cf
Compare
@kronbichler I am tempted to get rid of the short-cut path, since it makes the application of hanging-node constraints and the application of weights annoyingly complicated. What do you think? dealii/include/deal.II/multigrid/mg_transfer_global_coarsening.templates.h Lines 2526 to 2571 in cfa11cf
|
I think I agree. Just so we do not miss anything, can you summarize the operations we need to go through in that case? I guess it is a single sum-factorization interpolation (at least if we take the right path), i.e., approximately |
No that would be still guarded. The only difference would be that data is copied into a buffer in a vectorized form. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, apart from a few comments on the code. I agree we should remove the specialization as it does not contribute to faster execution (unless everything sits in L2 cache).
|
||
for (const auto &scheme : transfer.schemes) | ||
{ | ||
std::cout << scheme.degree_fine << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::cout << scheme.degree_fine << std::endl; |
|
||
if (!set(mask[c][shift], weights[0])) | ||
return; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
9 * degree_to_3[k] + 3 * degree_to_3[j]; | ||
|
||
if (!set(mask[c][shift], weights[0])) | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why you want to return here? Shouldn't we simply continue
in that case? Either way, I do not think I understand what exactly happens in this function and what is supposed to happen if set
returns false
, given the side effects inside the function on mask
and multiple checks. Also, I think that some comment on what the loop over cell
does would help.
for (unsigned int c = 0; c < n_components; ++c) | ||
for (int k = 0; k < (dim > 2 ? loop_length : 1); ++k) | ||
for (int j = 0; j < (dim > 1 ? loop_length : 1); ++j) | ||
{ | ||
const unsigned int shift = 9 * degree_to_3[k] + 3 * degree_to_3[j]; | ||
data[0] *= weights[shift]; | ||
// loop bound as int avoids compiler warnings in case loop_length | ||
// == 1 (polynomial degree 0) | ||
for (int i = 1; i < loop_length - 1; ++i) | ||
data[i] *= weights[shift + 1]; | ||
data[loop_length - 1] *= weights[shift + 2]; | ||
data += loop_length; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the same code as in the other MG transfer function
dealii/source/multigrid/mg_transfer_matrix_free.cc
Lines 362 to 388 in cfa11cf
weight_dofs_on_child(const VectorizedArray<Number> *weights, | |
const unsigned int n_components, | |
const unsigned int fe_degree, | |
VectorizedArray<Number> * data) | |
{ | |
Assert(fe_degree > 0, ExcNotImplemented()); | |
Assert(fe_degree < 100, ExcNotImplemented()); | |
const int loop_length = degree != -1 ? 2 * degree + 1 : 2 * fe_degree + 1; | |
unsigned int degree_to_3[100]; | |
degree_to_3[0] = 0; | |
for (int i = 1; i < loop_length - 1; ++i) | |
degree_to_3[i] = 1; | |
degree_to_3[loop_length - 1] = 2; | |
for (unsigned int c = 0; c < n_components; ++c) | |
for (int k = 0; k < (dim > 2 ? loop_length : 1); ++k) | |
for (int j = 0; j < (dim > 1 ? loop_length : 1); ++j) | |
{ | |
const unsigned int shift = 9 * degree_to_3[k] + 3 * degree_to_3[j]; | |
data[0] *= weights[shift]; | |
// loop bound as int avoids compiler warnings in case loop_length | |
// == 1 (polynomial degree 0) | |
for (int i = 1; i < loop_length - 1; ++i) | |
data[i] *= weights[shift + 1]; | |
data[loop_length - 1] *= weights[shift + 2]; | |
data += loop_length; | |
} | |
} |
I suggest to move the code to a common location and only implement it once. Since this is a regular tensor product variant, it might fit in include/deal.II/matrix_free/tensor_product_kernels.h
.
cfa11cf
to
8cd520f
Compare
@kronbichler I have made the changes! |
@@ -2586,6 +2586,36 @@ namespace internal | |||
} | |||
|
|||
|
|||
template <int dim, typename Number> | |||
inline void | |||
weight_dofs_on_child(const VectorizedArray<Number> *weights, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this into something more generic, e.g. weight_fe_q_dofs_by_entity
? Also, I wonder if you would mind injecting the dimension loop_length
via a template argument for the case this is supported, in order to reduce loop overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I wonder if you would mind injecting the dimension loop_length via a template argument for the case this is supported, in order to reduce loop overhead?
I would have thought that inline is enough. But I can do the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, but my experience is that the compiler feels tempted to not inline such functions (with more than >20 obvious instructions) because it sees it fit for multiple template arguments, reducing the machine code size a bit. Now I won't blame any compiler for doing so (not sure if this is the case here, though), because it is impossible for the compiler to know that different degrees will rarely be executed in close temporal proximity, making the code size reduction effective at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
8cd520f
to
a2962cd
Compare
/rebuild |
a2962cd
to
9f5e645
Compare
9f5e645
to
bf201fd
Compare
No description provided.