Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MatrixFree::find_vector_in_mf for multiple DoFHandler/AffineConstrains #12025

Closed
peterrum opened this issue Apr 8, 2021 · 8 comments
Closed

Comments

@peterrum
Copy link
Member

peterrum commented Apr 8, 2021

The following lines might return the wrong partitioner if MatrixFree has been set up with multiple DoFHandler and AffineConstraint object:

if (
# ifdef DEBUG
check_global_compatibility ?
vec.get_partitioner()->is_globally_compatible(
*matrix_free.get_dof_info(c).vector_partitioner) :
# endif
vec.get_partitioner()->is_compatible(
*matrix_free.get_dof_info(c).vector_partitioner))

with the consequence that different processes use different partitioners with different communication patterns, possibly leading to MPI errors of the type (see also https://github.com/MeltPoolDG/MeltPoolDG/pull/117#issuecomment-811762113):

[] *** An error occurred in MPI_Waitall
[] *** reported by process [2149777409,1]
[] *** on communicator MPI_COMM_WORLD
[] *** MPI_ERR_TRUNCATE: message truncated
[] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[] *** and potentially your MPI job)

By checking more internal fields in Partitioner::is_compatible() this issue could be solved (without the need to perform a global communication):

diff --git a/source/base/partitioner.cc b/source/base/partitioner.cc
index b6639aa..0e58858 100644
--- a/source/base/partitioner.cc
+++ b/source/base/partitioner.cc
@@ -487,8 +487,22 @@ namespace Utilities
         }
 #endif
       return (global_size == part.global_size &&
+              locally_owned_range_data == part.locally_owned_range_data &&
               local_range_data == part.local_range_data &&
-              ghost_indices_data == part.ghost_indices_data);
+              ghost_indices_data == part.ghost_indices_data &&
+              n_ghost_indices_data == part.n_ghost_indices_data &&
+              ghost_targets_data == part.ghost_targets_data &&
+              import_indices_data == part.import_indices_data &&
+              n_import_indices_data == part.n_import_indices_data &&
+              import_targets_data == part.import_targets_data &&
+              import_indices_chunks_by_rank_data ==
+                part.import_indices_chunks_by_rank_data &&
+              n_ghost_indices_in_larger_set ==
+                part.n_ghost_indices_in_larger_set &&
+              ghost_indices_subset_chunks_by_rank_data ==
+                part.ghost_indices_subset_chunks_by_rank_data &&
+              ghost_indices_subset_data == part.ghost_indices_subset_data);
     }

If we should decide to make this change, we probably need to reduce the number of checks. What are alternative approaches to tackle the described problem?

FYI @nmuch @mschreter

@kronbichler
Copy link
Member

I agree with this direction; however, I think one could still construct scenarios where one process has all the same same data locally (ind any of those checks) and only sees a global difference. I think of the case of 3 MPI ranks where there is a difference in the ghost exchange between rank 1 and 2, invisible to rank 0. Rank 0 could then return the value true in the check above, whereas rank 1 and 2 would return false. I agree with the proposed additions (and I can help to narrow the things we need to compare down), but I guess we will need more. The simple fix I initially though of would have been to first run through all partitioners in MatrixFree and check if the shared_ptr of the vector points to the same data as one of the partitioners/DoFHandlers in MatrixFree. That would be the one to pick. Such a mechanism would cover all cases where we do MatrixFree::initialize_dof_vector without further copies of the partitioners, which maybe helps in 90+% of the cases.

The more general way to do things to also cover unusual cases might be to create some hash/sha of the ghost states among all MPI ranks that we store in the partitioner class and compute in setup. For MatrixFree, we would add a mechanism that all Partitioner objects in the various DoFHandler/Constraints objects actually get a different hash if is_global_compatible returns false. I am not sure we need the latter, but I think this could be the way to go.

@peterrum
Copy link
Member Author

Regarding #12033 (review):

That is a good step forward. We might need to eliminate the global communication again if we can, so I think it's good to keep #12025 open for now to see if we can improve the local checks.

Let me quickly explain why I chose the approach I have taken (step 1: check the pointers; step 2: check via is_globally_compatible() - latter is slow and only used to happen in debug mode). My assumption is that users should initialize their vectors via MatrixFree::initialize_dof_vector() (or by using the (shared pointer of the) partitioner returned indirectly there - via Vector::reinit()). This is the approach we encourage users.

If user insists on creating their own partitioners, they will be penalized (by a global communication).

Do I miss instances where users will be penalized unjustifiably?

@peterrum
Copy link
Member Author

peterrum commented May 1, 2021

@kronbichler Would it be tragic if the current approach becomes part of the next release?

@peterrum
Copy link
Member Author

peterrum commented May 8, 2021

@kronbichler This is in a certain sense related to #12012. Do we want to tackle the hashs before the release or is that too dangerous?

@kronbichler
Copy link
Member

I think the hashes are too dangerous, and I think we did supply a good option for the default case. Let us postpone it for now to get it out of the release countdown.

@kronbichler kronbichler modified the milestones: Release 9.3, Release 10.0 May 11, 2021
@kronbichler
Copy link
Member

I think we will not get around to this one for this release, so we need to postpone it.

@kronbichler kronbichler modified the milestones: Release 9.4, Release 9.5 May 31, 2022
@peterrum
Copy link
Member Author

@kronbichler I think we can close this issue. Using MatrixFree::intialize_dof_vector() is in my opinion the right way if someone wants to use MatrixFree; together with @mschreter, we have improved the internal asserts much to make vector accesses within matrix-free context much more safer and much more informative, so that we could actually drop the support of other vectors, since manually setting up partitioners, which are compatible with MatrixFree is many cases hard to do, since MF, e.g., resolved identity constraints, ....

@kronbichler
Copy link
Member

I agree, we have a fairly robust infrastructure in place. Let us close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants