Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I/O errors when MPI-IO is not available #3396

Closed
maxrudolph opened this issue Feb 27, 2020 · 9 comments
Closed

I/O errors when MPI-IO is not available #3396

maxrudolph opened this issue Feb 27, 2020 · 9 comments

Comments

@maxrudolph
Copy link
Contributor

On the UC Davis cluster 'peloton', our scratch filesystem is provided as a NFS mount. Because of potential issues with:
(1) correctness of output due to file locking behaviors over NFS, discussed here:
open-mpi/ompi#4446
(2) generally poor performance of mpi-io over NFS
openmpi has been compiled without support for MPI-IO on peloton. ASPECT and its dependencies seem to rely heavily on MPI-IO for output of visualization files and for checkpointing.

I have created a github repository LINKED HERE that builds a docker container containing openmpi-4.0.2 without mpi-io that can reproduce these issues. Please note that the container takes a long time to build. You may want to run on a machine with the ability to build on a couple of dozen threads.

If the number of grouped files in the input file subsection ```Postprocess/visualization/`` is set to >0, there will be an MPI error. If the number of grouped files is set to 0, it appears that vtu output can be written successfully.

Checkpoints cannot be written successfully, and we encounter a MPI error:


Termination requested by criterion: end step
---------------------------------------------------------
TimerOutput objects finalize timed values printed to the
screen by communicating over MPI in their destructors.
Since an exception is currently uncaught, this
synchronization (and subsequent output) will be skipped to
avoid a possible deadlock.
---------------------------------------------------------
---------------------------------------------------------
TimerOutput objects finalize timed values printed to the
screen by communicating over MPI in their destructors.
Since an exception is currently uncaught, this
synchronization (and subsequent output) will be skipped to
avoid a possible deadlock.
---------------------------------------------------------
ERROR: Uncaught exception in MPI_InitFinalize on proc 0. Skipping MPI_Finalize() to avoid a deadlock.


----------------------------------------------------
Exception 'dealii::ExcMPI(ierr)' on rank 0 on processing: 

--------------------------------------------------------
An error occurred in line <1805> of file </opt/dealii-toolchain/tmp/unpack/deal.II-v9.1.0/source/distributed/tria.cc> in function
    void dealii::parallel::distributed::Triangulation<dim, spacedim>::DataTransfer::save(const typename dealii::internal::p4est::types<dim>::forest*, const string&) const [with int dim = 3; int spacedim = 3; typename dealii::internal::p4est::types<dim>::forest = p8est; std::__cxx11::string = std::__cxx11::basic_string<char>]
The violated condition was: 
    ierr == MPI_SUCCESS
Additional information: 
deal.II encountered an error while calling an MPI function.
The description of the error provided by MPI is "MPI_ERR_OTHER: known error not in list".
The numerical value of the original error code is 16.
--------------------------------------------------------

Aborting!
----------------------------------------------------
ERROR: Uncaught exception in MPI_InitFinalize on proc 1. Skipping MPI_Finalize() to avoid a deadlock.


----------------------------------------------------
Exception 'dealii::ExcMPI(ierr)' on rank 1 on processing: 

--------------------------------------------------------
An error occurred in line <1805> of file </opt/dealii-toolchain/tmp/unpack/deal.II-v9.1.0/source/distributed/tria.cc> in function
    void dealii::parallel::distributed::Triangulation<dim, spacedim>::DataTransfer::save(const typename dealii::internal::p4est::types<dim>::forest*, const string&) const [with int dim = 3; int spacedim = 3; typename dealii::internal::p4est::types<dim>::forest = p8est; std::__cxx11::string = std::__cxx11::basic_string<char>]
The violated condition was: 
    ierr == MPI_SUCCESS
Additional information: 
deal.II encountered an error while calling an MPI function.
The description of the error provided by MPI is "MPI_ERR_OTHER: known error not in list".
The numerical value of the original error code is 16.
--------------------------------------------------------

Aborting!

@tjhei
Copy link
Member

tjhei commented Feb 28, 2020

The crash is not a bug in p4est, but a routine inside deal.II that requires MPI I/O:
https://github.com/dealii/dealii/blob/47870e28657efe76b1bbf1207bea5d6d634aa534/source/distributed/tria.cc#L1801-L1805

The call to MPI_file_open fails, unsurprisingly. The function is used to store solution vectors. One could work around this but it certainly requires some extra work.

HDF5 serial support is something we can look into. It might be easier to do.

@tjhei
Copy link
Member

tjhei commented Feb 28, 2020

Regarding hdf5 support:
We require parallel hdf5 if MPI support is enabled in deal.II: https://github.com/dealii/dealii/blob/622f4179497f339d995c33f0c20ca2fdfb52fa96/cmake/configure/configure_hdf5.cmake#L29-L30
Regarding the implementation, it looks like we do not support writing with serial hdf5 when running in parallel: https://github.com/dealii/dealii/blob/73066b8ce7cace339c2676282882ff9745c1164e/source/base/data_out_base.cc#L8052-L8053
I don't see an easy way to write a single .h5 file when running with several mpi ranks with a serial hdf5 library. The workarounds comming to mind are 1. to write a single file per rank, but I wonder how useful that is. 2. send all data to rank 0 and output it there. Won't scale well, but is likely to work.

@maxrudolph
Copy link
Contributor Author

The hdf5 issues are less critical - checkpointing is really essential in order to do any kind of production run. I think that it's actually fairly straightforward to implement a workaround in deal.ii if MPI_File commands are unavailable. I think that it's probably easiest to just write a separate file for each MPI process. Each process should write its piece to the local scratch (if available) or /tmp and then copy to the output directory in a background thread.

re the HDf5 issue, I agree that it's not obvious how to deal with this problem but either of your ideas could work. I don't know enough about xdmf files to know whether it's possible to provide the informationload multiple subdomains at a timestep, similar to what's done using pvtu files. If so, having each rank write a separate file and providing the necessary metadata to load it in the xdmf file seems like the right approach.

I am still not sure whether these are problems worth investing the possibly significant amount of time to solve, though it is important that they be documented in case others encounter them. I share your sentiment that having broken I/O on clusters with NFS storage would exclude a significant group of users, but these problem seem to be occurring selectively on our cluster due to the decision to not compile mpi-io support in openmpi. I think that other sysadmins must be including mpi-io and hoping for the best, and users are not encountering incorrect output, though it's certainly possible that it could occur. I think that the more pragmatic approach might be for our sysadmin to include mpi-io support but to enable the most conservative (but slow) file locking behavior by default. On our similar cluster with NFS storage at Portland State, I never had problems, though there we were using the Intel MPI library.

@bangerth
Copy link
Contributor

bangerth commented Feb 28, 2020 via email

@maxrudolph
Copy link
Contributor Author

maxrudolph commented Feb 28, 2020 via email

@tjhei
Copy link
Member

tjhei commented Feb 28, 2020

I was wondering the same thing. Would it be difficult to write a test for the deal.ii cmake step to see whether a simple test program that calls MPI_File_open fails?

This would be the last resort (a different check is likely much faster/easier). I don't know of a way to check it by inspecting mpi.h (I glanced at the spec and could not find anything).

@bangerth
Copy link
Contributor

bangerth commented Feb 28, 2020 via email

@maxrudolph
Copy link
Contributor Author

maxrudolph commented Feb 28, 2020 via email

@tjhei
Copy link
Member

tjhei commented Jun 28, 2021

We have currently no plans to make checkpointing work without MPI IO.

@tjhei tjhei closed this as completed Jun 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants