Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed triangulation: assertion when distributing DoFs #16272

Closed
mathsen opened this issue Nov 16, 2023 · 7 comments · Fixed by #16287
Closed

Distributed triangulation: assertion when distributing DoFs #16272

mathsen opened this issue Nov 16, 2023 · 7 comments · Fixed by #16287
Labels
Milestone

Comments

@mathsen
Copy link
Contributor

mathsen commented Nov 16, 2023

Hello,

I have a problem with a parallel distributed triangulation that occurs in a 3d scenario when using at least three levels of refinements and at least 4 MPI processes when distributing the degrees of freedom.
The problem is dependent on the mesh and it occurs when using the torus mesh of the GridGenerator class.
I tested this on dealii 9.5.1 and todays master branch. Here is a minimal working example:
distributing_crash.zip

If I run it with at least -np4 I get the following assertion:

$ mpirun -np 4 ./distributing_crash
Running on 4 MPI rank(s)...
Number of locally owned active cells: 3840
Number of globally active cells:      15360
Distributing dofs... 

--------------------------------------------------------
An error occurred in line <3556> of file <~/Downloads/dealii_dev/source/dofs/dof_handler_policy.cc> in function
    dealii::internal::DoFHandlerImplementation::Policy::{anonymous}::communicate_dof_indices_on_marked_cells<3, 3>(const dealii::DoFHandler<3, 3>&, std::vector<bool>&)::<lambda(const auto:118&, const auto:119&)>::<lambda(auto:120&, auto:121)> [with auto:120 = unsigned int; auto:121 = unsigned int*]
The violated condition was: 
    (stored_index == invalid_dof_index) || (stored_index == *received_index)
Additional information: 
    This exception -- which is used in many places in the library --
    usually indicates that some condition which the author of the code
    thought must be satisfied at a certain point in an algorithm, is not
    fulfilled. An example would be that the first part of an algorithm
    sorts elements of an array in ascending order, and a second part of
    the algorithm later encounters an element that is not larger than the
    previous one.
    
    There is usually not very much you can do if you encounter such an
    exception since it indicates an error in deal.II, not in your own
    program. Try to come up with the smallest possible program that still
    demonstrates the error and contact the deal.II mailing lists with it
    to obtain help.

Stacktrace:
-----------
#0  ~/Downloads/dealii_dev/build/lib/libdeal_II.g.so.9.6.0-pre: 
#1  ~/Downloads/dealii_dev/build/lib/libdeal_II.g.so.9.6.0-pre: 
#2  ~/Downloads/dealii_dev/build/lib/libdeal_II.g.so.9.6.0-pre: 
#3  ~/Downloads/dealii_dev/build/lib/libdeal_II.g.so.9.6.0-pre: 
#4  ~/Downloads/dealii_dev/build/lib/libdeal_II.g.so.9.6.0-pre: 
#5  ~/Downloads/dealii_dev/build/lib/libdeal_II.g.so.9.6.0-pre: 
#6  ~/Downloads/dealii_dev/build/lib/libdeal_II.g.so.9.6.0-pre: std::function<void (dealii::TriaActiveIterator<dealii::DoFCellAccessor<3, 3, false> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&)>::operator()(dealii::TriaActiveIterator<dealii::DoFCellAccessor<3, 3, false> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&) const
#7  ~/Downloads/dealii_dev/build/lib/libdeal_II.g.so.9.6.0-pre: void dealii::GridTools::internal::exchange_cell_data<std::vector<unsigned int, std::allocator<unsigned int> >, dealii::DoFHandler<3, 3>, dealii::TriaActiveIterator<dealii::DoFCellAccessor<3, 3, false> > >(dealii::DoFHandler<3, 3> const&, std::function<std::optional<std::vector<unsigned int, std::allocator<unsigned int> > > (dealii::TriaActiveIterator<dealii::DoFCellAccessor<3, 3, false> > const&)> const&, std::function<void (dealii::TriaActiveIterator<dealii::DoFCellAccessor<3, 3, false> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&)> const&, std::function<bool (dealii::TriaActiveIterator<dealii::DoFCellAccessor<3, 3, false> > const&)> const&, std::function<void (std::function<void (dealii::TriaActiveIterator<dealii::DoFCellAccessor<3, 3, false> > const&, unsigned int)> const&)> const&, std::function<std::set<unsigned int, std::less<unsigned int>, std::allocator<unsigned int> > (dealii::parallel::TriangulationBase<dealii::DoFHandler<3, 3>::dimension, dealii::DoFHandler<3, 3>::space_dimension> const&)> const&)
#8  ~/Downloads/dealii_dev/build/lib/libdeal_II.g.so.9.6.0-pre: void dealii::GridTools::exchange_cell_data_to_ghosts<std::vector<unsigned int, std::allocator<unsigned int> >, dealii::DoFHandler<3, 3> >(dealii::DoFHandler<3, 3> const&, std::function<std::optional<std::vector<unsigned int, std::allocator<unsigned int> > > (dealii::DoFHandler<3, 3>::active_cell_iterator const&)> const&, std::function<void (dealii::DoFHandler<3, 3>::active_cell_iterator const&, std::vector<unsigned int, std::allocator<unsigned int> > const&)> const&, std::function<bool (dealii::DoFHandler<3, 3>::active_cell_iterator const&)> const&)
#9  ~/Downloads/dealii_dev/build/lib/libdeal_II.g.so.9.6.0-pre: 
#10  ~/Downloads/dealii_dev/build/lib/libdeal_II.g.so.9.6.0-pre: dealii::internal::DoFHandlerImplementation::Policy::ParallelDistributed<3, 3>::distribute_dofs() const
#11  ~/Downloads/dealii_dev/build/lib/libdeal_II.g.so.9.6.0-pre: dealii::DoFHandler<3, 3>::distribute_dofs(dealii::hp::FECollection<3, 3> const&)
#12  ~/Downloads/dealii_dev/build/lib/libdeal_II.g.so.9.6.0-pre: dealii::DoFHandler<3, 3>::distribute_dofs(dealii::FiniteElement<3, 3> const&)
#13  ./distributing_crash: Problem_DGP::Demonstration<3>::run()
#14  ./distributing_crash: main

Calling MPI_Abort now.
To break execution in a GDB session, execute 'break MPI_Abort' before running. You can also put the following into your ~/.gdbinit:
  set breakpoint pending on
  break MPI_Abort
  set breakpoint pending auto

I attached a debugger and looked at some variables when the assertion is thrown (it is thrown on proc0):

stored_index = 77463
*received_index = 143823
invalid_dof_index = 4294967295

cell->index() = 4461

cell->center(false, false) = {2.1532106405941569, -0.3266047428319967, 1.0183927880754229}

The minimal working example runs fine in release mode - nevertheless in my production code the program crashes later on when creating the sparsity pattern - and I think this is a consequence of a problem when distributing the DoFs:

TimerOutput objects finalize timed values printed to the
screen by communicating over MPI in their destructors.
Since an exception is currently uncaught, this
synchronization (and subsequent output) will be skipped
to avoid a possible deadlock.
---------------------------------------------------------

********************************************************************************

An EXCEPTION occured: Please READ the following output CAREFULLY!

--------------------------------------------------------
An error occurred in line <1294> of file <~/Downloads/dealii_dev/include/deal.II/lac/trilinos_sparsity_pattern.h> in function
    void dealii::TrilinosWrappers::SparsityPattern::add_entries(size_type, ForwardIterator, ForwardIterator, bool) [with ForwardIterator = const unsigned int*; size_type = unsigned int]
The violated condition was:
    ierr >= 0
Additional information:
    An error with error number -2 occurred while calling a Trilinos
    function

Stacktrace:
-----------
#0  ~/Downloads/dealii_dev/build/lib/libdeal_II.so.9.6.0-pre:
#1  ~/Downloads/dealii_dev/build/lib/libdeal_II.so.9.6.0-pre: dealii::AffineConstraints<double>::add_entries_local_to_global(std::vector<unsigned int, std::allocator<unsigned int> > const&, dealii::AffineConstraints<double> const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, dealii::SparsityPatternBase&, bool, dealii::Table<2, bool> const&) const
#2  ~/Downloads/dealii_dev/build/lib/libdeal_II.so.9.6.0-pre: dealii::AffineConstraints<double>::add_entries_local_to_global(std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, dealii::SparsityPatternBase&, bool, dealii::Table<2, bool> const&) const
#3  ~/Downloads/dealii_dev/build/lib/libdeal_II.so.9.6.0-pre: void dealii::DoFTools::make_flux_sparsity_pattern<3, 3, double>(dealii::DoFHandler<3, 3> const&, dealii::SparsityPatternBase&, dealii::AffineConstraints<double> const&, bool, unsigned int)
--------------------------------------------------------

APPLICATION TERMINATED unexpectedly due to an exception.

Other problems with other meshes (e.g. channel with cylinder) run fine - even with finer meshes on more processes. It is just that I need a torus-like mesh now.

Maybe anybody has an idea what is going wrong here or can help me with this?
Greetings
Mathias

@drwells drwells added the Bug label Nov 17, 2023
@bangerth bangerth added this to the Release 9.6 milestone Nov 17, 2023
@peterrum
Copy link
Member

Hi Mathias. This is an annoying bug and not related to DoFHandler. The reason that the exception was triggered in the DoFHandler has following reason. The DoFHandler enumerates DoFs in parallel on each process. In the case that DoFs are shared by processes, only one does it (based on the rank of the (ghost) cell). However, in the given mesh certain cells are, for some reason, not identified as ghost cells although they should, resulting in the issue that DoFs are enumerated by multiple processes what should not happen....

I can replicate the issue by only creating the mesh (with 4 processes) and checking if cells should be ghost or not, based on the vertices:

#include <deal.II/distributed/tria.h>

#include <deal.II/grid/grid_generator.h>

using namespace dealii;

int
main(int argc, char *argv[])
{
  Utilities::MPI::MPI_InitFinalize mpi_initialization(argc, argv, 1);

  // create mesh
  const int dim = 3;

  parallel::distributed::Triangulation<dim> triangulation(
    MPI_COMM_WORLD,
    typename Triangulation<dim>::MeshSmoothing(
      Triangulation<dim>::smoothing_on_refinement |
      Triangulation<dim>::smoothing_on_coarsening));

  const double       R                = 2.;
  const double       r                = 1.;
  const unsigned int n_cells_toroidal = 6;
  const double       phi              = 0.5 * dealii::numbers::PI;
  dealii::GridGenerator::torus<dim, dim>(
    triangulation, R, r, n_cells_toroidal, phi);
  triangulation.refine_global(3);

  // determine vertices connected to locally owned cells
  std::vector<bool> mask(triangulation.n_vertices(), false);

  for (const auto &cell : triangulation.active_cell_iterators())
    if (cell->is_locally_owned())
      for (const auto v : cell->vertex_indices())
        mask[cell->vertex_index(v)] = true;

  // determine problematic cells: cells that are not ghost cells but are
  // conncted to vertices belonging to locally owned cells
  std::vector<typename Triangulation<dim>::cell_iterator> problemtic_cells;

  for (const auto &cell : triangulation.active_cell_iterators())
    if (cell->is_locally_owned() == false)
      {
        bool is_ghost = false;
        for (const auto v : cell->vertex_indices())
          is_ghost |= mask[cell->vertex_index(v)];

        if (is_ghost != cell->is_ghost())
          problemtic_cells.emplace_back(cell);
      }

  // print problematic cells
  for (const auto &cell : problemtic_cells)
    {
      // check if locally owned cells are face neighbors
      bool face_neighbour = false;

      for (const auto f : cell->face_indices())
        if (cell->at_boundary() == false)
          face_neighbour |= cell->neighbor(f)->is_locally_owned();

      std::cout << cell->id() << " " << int(face_neighbour) << std::endl;
    }

  return 0;
}

I don't know what the root cause of this issue is. @tjhei Do you have idea? The options are p:d:T or p4est!?

For now, you could try out parallel::fullydistributed::Triangulation, which you can create from a serial triangulation. See https://www.dealii.org/developer/doxygen/deal.II/namespaceTriangulationDescription_1_1Utilities.html.

I find this bug somewhat worrying, since I used the the torus geometry in the past (with DG; https://github.com/hyperdeal/hyperdeal/blob/3640c397c9a2bd78c80325ee97dc526319197087/examples/vlasov_poisson/cases/torus_hyperball.h#L111).

@kronbichler
Copy link
Member

I did not check closely, but for me the test case by @peterrum runs into an assertion in p4est:

#0  0x00007fffe73ed2d0 in PMPI_Abort () from /lib/x86_64-linux-gnu/libmpi.so.40
#1  0x00007fffdd30e28a in sc_abort_handler () at /home/kronbichler/sw/p4est/p4est-2.8/sc/src/sc.c:1029
#2  0x00007fffdd30e06d in sc_abort () at /home/kronbichler/sw/p4est/p4est-2.8/sc/src/sc.c:981
#3  0x00007fffdd30e32d in sc_abort_verbose (
    filename=0x7fffdab81bd0 "/home/kronbichler/sw/p4est/p4est-2.8/src/p8est_connectivity.c", lineno=1559, 
    msg=0x7fffdab85388 "Expected '(int) edge_trees - (int) ta->elem_count': 'p8est_find_edge_transform_internal (conn, itree, iedge, ei, conn->edge_to_tree + ettae, conn->edge_to_edge + ettae, edge_trees)'")
    at /home/kronbichler/sw/p4est/p4est-2.8/sc/src/sc.c:1039
#4  0x00007fffdaad8bf0 in p8est_find_edge_transform (conn=0x5555555f0810, itree=0, iedge=5, ei=0x7fffffffbfd0)
    at /home/kronbichler/sw/p4est/p4est-2.8/src/p8est_connectivity.c:1559
#5  0x00007fffdaaccaed in p8est_connectivity_is_valid (conn=0x5555555f0810)
    at /home/kronbichler/sw/p4est/p4est-2.8/src/p4est_connectivity.c:599
#6  0x00007ffff70e1315 in dealii::parallel::distributed::Triangulation<3, 3>::copy_new_triangulation_to_p4est (
    this=this@entry=0x7fffffffd720) at /home/martin/deal/deal.II/source/distributed/tria.cc:2502
#7  0x00007ffff70fa6f3 in dealii::parallel::distributed::Triangulation<3, 3>::create_triangulation (this=0x7fffffffd720, 
    vertices=..., cells=..., subcelldata=...) at /home/martin/deal/deal.II/source/distributed/tria.cc:1779
#8  0x00007ffff64ad170 in dealii::GridGenerator::torus<3, 3> (tria=..., 
    R=<error reading variable: That operation is not available on integers of more than 8 bytes.>, 
    r=<error reading variable: That operation is not available on integers of more than 8 bytes.>, 
    n_cells_toroidal=n_cells_toroidal@entry=6, phi=<optimized out>) at /home/martin/deal/deal.II/source/grid/grid_generator.cc:2090

This would indicate that we either give an invalid mesh to p4est, or that p4est fails at connecting meshes and building the connectivity. Can you confirm?

@peterrum
Copy link
Member

This would indicate that we either give an invalid mesh to p4est, or that p4est fails at connecting meshes and building the connectivity. Can you confirm?

When compiling deal.II with the debug version of p4est, the program just crashes without throwing an exception on my side (also in serial).

The stack trace indicates that the issue also occurs during the creation of the coarse grid (also in serial). Which is a good step 👍

@kronbichler
Copy link
Member

If I see things correctly, p4est has problems with the orientation of the grid. If I orient the 2D base grid of the torus similarly to the 2D circle, I don't see the error. Can you please try this patch:

diff --git a/source/grid/grid_generator.cc b/source/grid/grid_generator.cc
index c4c33b3a1f..999f65c127 100644
--- a/source/grid/grid_generator.cc
+++ b/source/grid/grid_generator.cc
@@ -2002,7 +2002,7 @@ namespace GridGenerator
            ExcMessage("Outer radius R must be greater than the inner "
                       "radius r."));
     Assert(r > 0.0, ExcMessage("The inner radius r must be positive."));
-    Assert(n_cells_toroidal > 2,
+    Assert(n_cells_toroidal > static_cast<unsigned int>(phi / numbers::PI),
            ExcMessage("Number of cells in toroidal direction has "
                       "to be at least 3."));
     AssertThrow(phi > 0.0 && phi < 2.0 * numbers::PI + 1.0e-15,
@@ -2063,20 +2063,20 @@ namespace GridGenerator
             cells[5 * c + 1].vertices[2 + j * 4] = offset + 4;
             cells[5 * c + 1].vertices[3 + j * 4] = offset + 5;
             // cell 2 in x-y-plane
-            cells[5 * c + 2].vertices[0 + j * 4] = offset + 4;
-            cells[5 * c + 2].vertices[1 + j * 4] = offset + 5;
-            cells[5 * c + 2].vertices[2 + j * 4] = offset + 6;
-            cells[5 * c + 2].vertices[3 + j * 4] = offset + 7;
+            cells[5 * c + 2].vertices[0 + j * 4] = offset + 6;
+            cells[5 * c + 2].vertices[1 + j * 4] = offset + 4;
+            cells[5 * c + 2].vertices[2 + j * 4] = offset + 7;
+            cells[5 * c + 2].vertices[3 + j * 4] = offset + 5;
             // cell 3 in x-y-plane
             cells[5 * c + 3].vertices[0 + j * 4] = offset + 0;
             cells[5 * c + 3].vertices[1 + j * 4] = offset + 2;
             cells[5 * c + 3].vertices[2 + j * 4] = offset + 6;
             cells[5 * c + 3].vertices[3 + j * 4] = offset + 4;
             // cell 4 in x-y-plane
-            cells[5 * c + 4].vertices[0 + j * 4] = offset + 3;
-            cells[5 * c + 4].vertices[1 + j * 4] = offset + 1;
-            cells[5 * c + 4].vertices[2 + j * 4] = offset + 5;
-            cells[5 * c + 4].vertices[3 + j * 4] = offset + 7;
+            cells[5 * c + 4].vertices[0 + j * 4] = offset + 1;
+            cells[5 * c + 4].vertices[1 + j * 4] = offset + 7;
+            cells[5 * c + 4].vertices[2 + j * 4] = offset + 3;
+            cells[5 * c + 4].vertices[3 + j * 4] = offset + 5;
           }
 
         cells[5 * c].material_id = 0;

@kronbichler
Copy link
Member

Note that the numbers come from here:

const int cell_vertices[5][4] = {
{0, 1, 2, 3}, {0, 2, 6, 4}, {2, 3, 4, 5}, {1, 7, 3, 5}, {6, 4, 7, 5}};

@mathsen
Copy link
Contributor Author

mathsen commented Nov 22, 2023

@kronbichler Great - I can confirm that the problem works fine with your patch - no assertion is thrown when distributing DoFs (and later on creating the sparsity pattern also works).
Before applying the patch I just tried a run with the master branch of p4est and "--enable-debug" and could also reproduce the crash when calling create_triangulation() (even on with -np 1). So there really seems to be an issue within p4est?

@kronbichler
Copy link
Member

I am not sure if it is an issue with p4est, or if p4est expects you to give a mesh in the "right" form, which would mean the version after my change. We would need to look up the documentation as to what the expectations from p4est are regarding the mesh and its orientation. Either way, I will commit the change and add the test case you suggested, so we can have a more robust code base for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants