Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of setup of MGTwoLevelTransfer/MGTransferGlobalCoarsening #14968

Open
kronbichler opened this issue Mar 24, 2023 · 9 comments
Milestone

Comments

@kronbichler
Copy link
Member

Already in two projects, I observed that MGTwoLevelTransfer::reinit() is takes a long time for polynomial transfer. More precisely, I observed that the case

if (do_polynomial_transfer)
internal::MGTwoLevelTransferImplementation::reinit_polynomial_transfer(
dof_handler_fine,
dof_handler_coarse,
constraints_fine,
constraints_coarse,
mg_level_fine,
mg_level_coarse,
*this);
with uniform polynomial degree takes much longer to set up than the geometric transfer, which is, given the simplicity in case of the same mesh, surprising. Obviously, the code is written for the general case, but there might be room for some improvements. Maybe we should also add a benchmark similar to https://github.com/dealii/dealii/blob/master/tests/performance/timing_step_37.cc but using some combination of adaptive meshes, global coarsening and p-multigrid to ensure that these steps get done reasonably well.

@kronbichler
Copy link
Member Author

I have a benchmark for matrix-free, global-coarsening multigrid, p-multigrid and mesh adaptivity 95% ready. I use an affine mesh because that has a higher arithmetic intensity and thus more easily exposes bad data structures in our code. I will post it in the coming days, unless someone else feels we should do something entirely different.

@kronbichler
Copy link
Member Author

Let me give some numbers from a 2D version of my benchmark, I plan the 3D case of course:

  • Refining the grid adaptively takes 600m instructions
  • Setting up geometric_coarsening_sequence takes 4.9b instructions
  • DoFHandler::distribute_dofs takes 400m instructions, showing the optimizations done in the past
  • Matrix-free setup + renumbering (float 2x, double) is around 3b instructions, so pretty good given the work we do there
  • MGTwoLevelTransfer::reinit_polynomial_transfer takes 3.8b instructions
  • MGTwoLevelTransfer::reinit_geometric_transfer takes 1.8b instructions
  • Solving two times takes 1.1b instructions

The cost of the setup, compared to the solve cost, shows there is something to be done.

@peterrum
Copy link
Member

Setting up geometric_coarsening_sequence takes 4.9b instructions

Which version are you using?

template <int dim, int spacedim>
std::vector<std::shared_ptr<const Triangulation<dim, spacedim>>>
create_geometric_coarsening_sequence(
const Triangulation<dim, spacedim> &fine_triangulation_in)

or

template <int dim, int spacedim>
std::vector<std::shared_ptr<const Triangulation<dim, spacedim>>>
create_geometric_coarsening_sequence(
Triangulation<dim, spacedim> & fine_triangulation_in,
const RepartitioningPolicyTools::Base<dim, spacedim> &policy,
const bool keep_fine_triangulation,
const bool repartition_fine_triangulation)

If latter, which policy are you using?

MGTwoLevelTransfer::reinit_polynomial_transfer takes 3.8b instructions
MGTwoLevelTransfer::reinit_geometric_transfer takes 1.8b instructions

I don't see an obvious reason why it is so expensive. The element prolongation matrices are only set up once and there is a few loops over cells to collect data like indices or constraints. I guess you use FE_Q? Which degree? You first coarse p and after that h?

@kronbichler
Copy link
Member Author

@peterrum have a look at #14981. The numbers you see here are for a 2D version of the code, but they are essentially the same otherwise. (I did in fact use the first variant of create_geometric_coarsening_sequence before and switched to the second one later, but in 3D they are not that far apart.)

@peterrum
Copy link
Member

peterrum commented Aug 1, 2023

Let me post some results here:

PR #15794 allow to run local smoothing with MGTransferMF (aka MGTransferGlobalCoarsaning) and PR #15807 specializes the setup (for first-child policy and p-multigrid without partitioning).

The left column contains the current times of timing_step_37 with MGTransferMatrixFree and MGTransferMF and timing_mg_glob_coarsen. The right column contains the old times.

Screenshot from 2023-08-01 15-35-14

Exploiting the knowledge of having first-child policy or having meshes that are not repartitioned allows to reduce the setup costs by approx 35%.

MGTransferMatrixFree and MGTransferMF have similar costs but latter has still higher setup costs.

@peterrum peterrum changed the title reinit_polynomial_transfer is very slow Improve performance of setup of MGTwoLevelTransfer/MGTransferGlobalCoarsening Aug 5, 2023
@peterrum
Copy link
Member

peterrum commented Aug 5, 2023

When timing timing_mg_glob_coarsen with #15807, I noticed that changing

coarse_triangulations =
MGTransferGlobalCoarseningTools::create_geometric_coarsening_sequence(
triangulation/*,
RepartitioningPolicyTools::MinimalGranularityPolicy<dim>(16)*/);

to RepartitioningPolicyTools::FirstChildPolicy<dim>(triangulation) allows to cut down the setup costs by 50%. Shall we adopt this change also in the performance test?

@kronbichler
Copy link
Member Author

Thanks for the timings, this helps to understand the variations.

]...] RepartitioningPolicyTools::FirstChildPolicy<dim>(triangulation) allows to cut down the setup costs by 50%. Shall we adopt this change also in the performance test?

In my opinion, the performance tests do not necessarily need to cover the best possible way to run a problem, and stressing an additional component can make sense. Therefore, I do not think it is necessary to change this part of the benchmark, and I would rather see it as part of the benchmark to repartition and create a new mesh.

On a related note, I would think that it should be possible to make this manipulation of the grid much cheaper than operations working with the dofs, since we use higher order polynomials and the grid should not. This does not need to happen now, but I think it is good to have it on the radar.

If we have good reasons, we can of course also think of a different setup that we think is even more representative of real work loads. I think repartitioning is useful in real cases.

@peterrum
Copy link
Member

peterrum commented Aug 7, 2023

On a related note, I would think that it should be possible to make this manipulation of the grid much cheaper than operations working with the dofs, since we use higher order polynomials and the grid should not. This does not need to happen now, but I think it is good to have it on the radar.

I don't understand what you mean. We work on cell level but need to gather the DoFs after we have figured out which cells are relevant.

@kronbichler
Copy link
Member Author

kronbichler commented Aug 7, 2023

I was not super precise, but thought it was obvious from the context: The test case uses polynomial coarsening, so the grid manipulation and repartitioning is on a low-order finite element space (degree 2) and should thus be much cheaper than the work on the active dofs (degree 4) inside the solver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants