Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global coarsening: compress weights #13099

Merged
merged 1 commit into from Dec 22, 2021
Merged

Conversation

peterrum
Copy link
Member

No description provided.

@peterrum peterrum force-pushed the gc_weighting branch 2 times, most recently from 5aedbc7 to cfa11cf Compare December 20, 2021 19:23
@peterrum peterrum changed the title [WIP] Global coarsening: compress weights Global coarsening: compress weights Dec 20, 2021
@peterrum
Copy link
Member Author

@kronbichler I am tempted to get rid of the short-cut path, since it makes the application of hanging-node constraints and the application of weights annoyingly complicated. What do you think?

// identity -> take short cut and work directly on global vectors
if (scheme.prolongation_matrix.size() == 0 &&
scheme.prolongation_matrix_1d.size() == 0)
{
for (unsigned int cell = 0; cell < scheme.n_coarse_cells;
cell += n_lanes)
{
const unsigned int n_lanes_filled =
(cell + n_lanes > scheme.n_coarse_cells) ?
(scheme.n_coarse_cells - cell) :
n_lanes;
// read from source vector
for (unsigned int v = 0; v < n_lanes_filled; ++v)
{
if ((scheme.n_dofs_per_cell_fine != 0) &&
(scheme.n_dofs_per_cell_coarse != 0))
{
if (fine_element_is_continuous)
for (unsigned int i = 0;
i < scheme.n_dofs_per_cell_fine;
++i)
vec_fine_ptr->local_element(indices_fine[i]) +=
read_dof_values(indices_coarse[i], vec_coarse) *
weights[i];
else
for (unsigned int i = 0;
i < scheme.n_dofs_per_cell_fine;
++i)
vec_fine_ptr->local_element(indices_fine[i]) +=
read_dof_values(indices_coarse[i], vec_coarse);
}
indices_fine += scheme.n_dofs_per_cell_fine;
indices_coarse += scheme.n_dofs_per_cell_coarse;
if (fine_element_is_continuous)
weights += scheme.n_dofs_per_cell_fine;
}
if (fine_element_is_continuous)
weights_compressed += 1;
}
continue;
}

@kronbichler
Copy link
Member

I am tempted to get rid of the short-cut path [...]

I think I agree. Just so we do not miss anything, can you summarize the operations we need to go through in that case? I guess it is a single sum-factorization interpolation (at least if we take the right path), i.e., approximately dim * (n_points_fine)^dim n_points_coarse?

@peterrum
Copy link
Member Author

I think I agree. Just so we do not miss anything, can you summarize the operations we need to go through in that case? I guess it is a single sum-factorization interpolation (at least if we take the right path), i.e., approximately dim * (n_points_fine)^dim n_points_coarse?

No that would be still guarded. The only difference would be that data is copied into a buffer in a vectorized form.

Copy link
Member

@kronbichler kronbichler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, apart from a few comments on the code. I agree we should remove the specialization as it does not contribute to faster execution (unless everything sits in L2 cache).


for (const auto &scheme : transfer.schemes)
{
std::cout << scheme.degree_fine << std::endl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::cout << scheme.degree_fine << std::endl;


if (!set(mask[c][shift], weights[0]))
return;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

9 * degree_to_3[k] + 3 * degree_to_3[j];

if (!set(mask[c][shift], weights[0]))
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why you want to return here? Shouldn't we simply continue in that case? Either way, I do not think I understand what exactly happens in this function and what is supposed to happen if set returns false, given the side effects inside the function on mask and multiple checks. Also, I think that some comment on what the loop over cell does would help.

Comment on lines 2445 to 2457
for (unsigned int c = 0; c < n_components; ++c)
for (int k = 0; k < (dim > 2 ? loop_length : 1); ++k)
for (int j = 0; j < (dim > 1 ? loop_length : 1); ++j)
{
const unsigned int shift = 9 * degree_to_3[k] + 3 * degree_to_3[j];
data[0] *= weights[shift];
// loop bound as int avoids compiler warnings in case loop_length
// == 1 (polynomial degree 0)
for (int i = 1; i < loop_length - 1; ++i)
data[i] *= weights[shift + 1];
data[loop_length - 1] *= weights[shift + 2];
data += loop_length;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same code as in the other MG transfer function

weight_dofs_on_child(const VectorizedArray<Number> *weights,
const unsigned int n_components,
const unsigned int fe_degree,
VectorizedArray<Number> * data)
{
Assert(fe_degree > 0, ExcNotImplemented());
Assert(fe_degree < 100, ExcNotImplemented());
const int loop_length = degree != -1 ? 2 * degree + 1 : 2 * fe_degree + 1;
unsigned int degree_to_3[100];
degree_to_3[0] = 0;
for (int i = 1; i < loop_length - 1; ++i)
degree_to_3[i] = 1;
degree_to_3[loop_length - 1] = 2;
for (unsigned int c = 0; c < n_components; ++c)
for (int k = 0; k < (dim > 2 ? loop_length : 1); ++k)
for (int j = 0; j < (dim > 1 ? loop_length : 1); ++j)
{
const unsigned int shift = 9 * degree_to_3[k] + 3 * degree_to_3[j];
data[0] *= weights[shift];
// loop bound as int avoids compiler warnings in case loop_length
// == 1 (polynomial degree 0)
for (int i = 1; i < loop_length - 1; ++i)
data[i] *= weights[shift + 1];
data[loop_length - 1] *= weights[shift + 2];
data += loop_length;
}
}

I suggest to move the code to a common location and only implement it once. Since this is a regular tensor product variant, it might fit in include/deal.II/matrix_free/tensor_product_kernels.h.

@peterrum
Copy link
Member Author

@kronbichler I have made the changes!

@@ -2586,6 +2586,36 @@ namespace internal
}


template <int dim, typename Number>
inline void
weight_dofs_on_child(const VectorizedArray<Number> *weights,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this into something more generic, e.g. weight_fe_q_dofs_by_entity? Also, I wonder if you would mind injecting the dimension loop_length via a template argument for the case this is supported, in order to reduce loop overhead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I wonder if you would mind injecting the dimension loop_length via a template argument for the case this is supported, in order to reduce loop overhead?

I would have thought that inline is enough. But I can do the change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but my experience is that the compiler feels tempted to not inline such functions (with more than >20 obvious instructions) because it sees it fit for multiple template arguments, reducing the machine code size a bit. Now I won't blame any compiler for doing so (not sure if this is the case here, though), because it is impossible for the compiler to know that different degrees will rarely be executed in close temporal proximity, making the code size reduction effective at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@kronbichler
Copy link
Member

/rebuild

@peterrum peterrum merged commit acabf31 into dealii:master Dec 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants