I/O errors when MPI-IO is not available #3396

maxrudolph · 2020-02-27T23:43:38Z

On the UC Davis cluster 'peloton', our scratch filesystem is provided as a NFS mount. Because of potential issues with:
(1) correctness of output due to file locking behaviors over NFS, discussed here:
open-mpi/ompi#4446
(2) generally poor performance of mpi-io over NFS
openmpi has been compiled without support for MPI-IO on peloton. ASPECT and its dependencies seem to rely heavily on MPI-IO for output of visualization files and for checkpointing.

I have created a github repository LINKED HERE that builds a docker container containing openmpi-4.0.2 without mpi-io that can reproduce these issues. Please note that the container takes a long time to build. You may want to run on a machine with the ability to build on a couple of dozen threads.

If the number of grouped files in the input file subsection ```Postprocess/visualization/`` is set to >0, there will be an MPI error. If the number of grouped files is set to 0, it appears that vtu output can be written successfully.

Checkpoints cannot be written successfully, and we encounter a MPI error:


Termination requested by criterion: end step
---------------------------------------------------------
TimerOutput objects finalize timed values printed to the
screen by communicating over MPI in their destructors.
Since an exception is currently uncaught, this
synchronization (and subsequent output) will be skipped to
avoid a possible deadlock.
---------------------------------------------------------
---------------------------------------------------------
TimerOutput objects finalize timed values printed to the
screen by communicating over MPI in their destructors.
Since an exception is currently uncaught, this
synchronization (and subsequent output) will be skipped to
avoid a possible deadlock.
---------------------------------------------------------
ERROR: Uncaught exception in MPI_InitFinalize on proc 0. Skipping MPI_Finalize() to avoid a deadlock.


----------------------------------------------------
Exception 'dealii::ExcMPI(ierr)' on rank 0 on processing: 

--------------------------------------------------------
An error occurred in line <1805> of file </opt/dealii-toolchain/tmp/unpack/deal.II-v9.1.0/source/distributed/tria.cc> in function
    void dealii::parallel::distributed::Triangulation<dim, spacedim>::DataTransfer::save(const typename dealii::internal::p4est::types<dim>::forest*, const string&) const [with int dim = 3; int spacedim = 3; typename dealii::internal::p4est::types<dim>::forest = p8est; std::__cxx11::string = std::__cxx11::basic_string<char>]
The violated condition was: 
    ierr == MPI_SUCCESS
Additional information: 
deal.II encountered an error while calling an MPI function.
The description of the error provided by MPI is "MPI_ERR_OTHER: known error not in list".
The numerical value of the original error code is 16.
--------------------------------------------------------

Aborting!
----------------------------------------------------
ERROR: Uncaught exception in MPI_InitFinalize on proc 1. Skipping MPI_Finalize() to avoid a deadlock.


----------------------------------------------------
Exception 'dealii::ExcMPI(ierr)' on rank 1 on processing: 

--------------------------------------------------------
An error occurred in line <1805> of file </opt/dealii-toolchain/tmp/unpack/deal.II-v9.1.0/source/distributed/tria.cc> in function
    void dealii::parallel::distributed::Triangulation<dim, spacedim>::DataTransfer::save(const typename dealii::internal::p4est::types<dim>::forest*, const string&) const [with int dim = 3; int spacedim = 3; typename dealii::internal::p4est::types<dim>::forest = p8est; std::__cxx11::string = std::__cxx11::basic_string<char>]
The violated condition was: 
    ierr == MPI_SUCCESS
Additional information: 
deal.II encountered an error while calling an MPI function.
The description of the error provided by MPI is "MPI_ERR_OTHER: known error not in list".
The numerical value of the original error code is 16.
--------------------------------------------------------

Aborting!

The text was updated successfully, but these errors were encountered:

tjhei · 2020-02-28T02:56:48Z

The crash is not a bug in p4est, but a routine inside deal.II that requires MPI I/O:
https://github.com/dealii/dealii/blob/47870e28657efe76b1bbf1207bea5d6d634aa534/source/distributed/tria.cc#L1801-L1805

The call to MPI_file_open fails, unsurprisingly. The function is used to store solution vectors. One could work around this but it certainly requires some extra work.

HDF5 serial support is something we can look into. It might be easier to do.

tjhei · 2020-02-28T03:03:50Z

Regarding hdf5 support:
We require parallel hdf5 if MPI support is enabled in deal.II: https://github.com/dealii/dealii/blob/622f4179497f339d995c33f0c20ca2fdfb52fa96/cmake/configure/configure_hdf5.cmake#L29-L30
Regarding the implementation, it looks like we do not support writing with serial hdf5 when running in parallel: https://github.com/dealii/dealii/blob/73066b8ce7cace339c2676282882ff9745c1164e/source/base/data_out_base.cc#L8052-L8053
I don't see an easy way to write a single .h5 file when running with several mpi ranks with a serial hdf5 library. The workarounds comming to mind are 1. to write a single file per rank, but I wonder how useful that is. 2. send all data to rank 0 and output it there. Won't scale well, but is likely to work.

maxrudolph · 2020-02-28T06:44:03Z

The hdf5 issues are less critical - checkpointing is really essential in order to do any kind of production run. I think that it's actually fairly straightforward to implement a workaround in deal.ii if MPI_File commands are unavailable. I think that it's probably easiest to just write a separate file for each MPI process. Each process should write its piece to the local scratch (if available) or /tmp and then copy to the output directory in a background thread.

re the HDf5 issue, I agree that it's not obvious how to deal with this problem but either of your ideas could work. I don't know enough about xdmf files to know whether it's possible to provide the informationload multiple subdomains at a timestep, similar to what's done using pvtu files. If so, having each rank write a separate file and providing the necessary metadata to load it in the xdmf file seems like the right approach.

I am still not sure whether these are problems worth investing the possibly significant amount of time to solve, though it is important that they be documented in case others encounter them. I share your sentiment that having broken I/O on clusters with NFS storage would exclude a significant group of users, but these problem seem to be occurring selectively on our cluster due to the decision to not compile mpi-io support in openmpi. I think that other sysadmins must be including mpi-io and hoping for the best, and users are not encountering incorrect output, though it's certainly possible that it could occur. I think that the more pragmatic approach might be for our sysadmin to include mpi-io support but to enable the most conservative (but slow) file locking behavior by default. On our similar cluster with NFS storage at Portland State, I never had problems, though there we were using the Intel MPI library.

bangerth · 2020-02-28T18:24:02Z

On 2/27/20 7:56 PM, Timo Heister wrote: The crash is not a bug in p4est, but a routine inside deal.II that requires MPI I/O: https://github.com/dealii/dealii/blob/47870e28657efe76b1bbf1207bea5d6d634aa534/source/distributed/tria.cc#L1801-L1805 The call to MPI_file_open fails, unsurprisingly. The function is used to store solution vectors. One could work around this but it certainly requires some extra work.

That shouldn't be too much work to rewrite. But is there a standard way of finding out whether MPI I/O is available?

maxrudolph · 2020-02-28T20:21:03Z

I was wondering the same thing. Would it be difficult to write a test for the deal.ii cmake step to see whether a simple test program that calls MPI_File_open fails? On Fri, Feb 28, 2020 at 10:24 AM Wolfgang Bangerth <notifications@github.com> wrote:

…

On 2/27/20 7:56 PM, Timo Heister wrote: > The crash is not a bug in p4est, but a routine inside deal.II that > requires MPI I/O: > https://github.com/dealii/dealii/blob/47870e28657efe76b1bbf1207bea5d6d634aa534/source/distributed/tria.cc#L1801-L1805 > > The call to MPI_file_open fails, unsurprisingly. The function is used to > store solution vectors. One could work around this but it certainly > requires some extra work. That shouldn't be too much work to rewrite. But is there a standard way of finding out whether MPI I/O is available? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3396?email_source=notifications&email_token=AB6G6NX63IPALVQQOHXPHO3RFFJEHA5CNFSM4K5FR472YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENJUXNY#issuecomment-592661431>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB6G6NWNOPRADW66T54AH7TRFFJEHANCNFSM4K5FR47Q> .

tjhei · 2020-02-28T20:26:49Z

I was wondering the same thing. Would it be difficult to write a test for the deal.ii cmake step to see whether a simple test program that calls MPI_File_open fails?

This would be the last resort (a different check is likely much faster/easier). I don't know of a way to check it by inspecting mpi.h (I glanced at the spec and could not find anything).

bangerth · 2020-02-28T20:38:22Z

On 2/28/20 1:21 PM, Max Rudolph wrote: I was wondering the same thing. Would it be difficult to write a test for the deal.ii cmake step to see whether a simple test program that calls MPI_File_open fails?

That's going to fail on clusters on which the head node is different from the compute nodes :-( There may in fact be no way to run `mpirun` on the head node to begin with. How does p4est check whether MPI IO is available? Through a ./configure argument?

maxrudolph · 2020-02-28T20:46:32Z

Yes, p4est would be configured with --disable-mpiio On Fri, Feb 28, 2020 at 12:38 PM Wolfgang Bangerth <notifications@github.com> wrote:

…

On 2/28/20 1:21 PM, Max Rudolph wrote: > I was wondering the same thing. Would it be difficult to write a test for > the deal.ii cmake step to see whether a simple test program that calls > MPI_File_open fails? That's going to fail on clusters on which the head node is different from the compute nodes :-( There may in fact be no way to run `mpirun` on the head node to begin with. How does p4est check whether MPI IO is available? Through a ./configure argument? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3396?email_source=notifications&email_token=AB6G6NT7THYMNI273WE527TRFFY35A5CNFSM4K5FR472YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENKCOVI#issuecomment-592717653>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB6G6NR63E7ANLWRKYB2UPLRFFY35ANCNFSM4K5FR47Q> .

tjhei · 2021-06-28T16:26:10Z

We have currently no plans to make checkpointing work without MPI IO.

tjhei closed this as completed Jun 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I/O errors when MPI-IO is not available #3396

I/O errors when MPI-IO is not available #3396

maxrudolph commented Feb 27, 2020

tjhei commented Feb 28, 2020

tjhei commented Feb 28, 2020

maxrudolph commented Feb 28, 2020

bangerth commented Feb 28, 2020 via email

maxrudolph commented Feb 28, 2020 via email

tjhei commented Feb 28, 2020

bangerth commented Feb 28, 2020 via email

maxrudolph commented Feb 28, 2020 via email

tjhei commented Jun 28, 2021

I/O errors when MPI-IO is not available #3396

I/O errors when MPI-IO is not available #3396

Comments

maxrudolph commented Feb 27, 2020

tjhei commented Feb 28, 2020

tjhei commented Feb 28, 2020

maxrudolph commented Feb 28, 2020

bangerth commented Feb 28, 2020 via email

maxrudolph commented Feb 28, 2020 via email

tjhei commented Feb 28, 2020

bangerth commented Feb 28, 2020 via email

maxrudolph commented Feb 28, 2020 via email

tjhei commented Jun 28, 2021