New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Triangulation::load()/save() with large output files #12752
Comments
The code snippets you mentioned were already there in version 9.2.0, but at a different spot. Using the dealii/source/distributed/tria.cc Lines 1989 to 1990 in b42193f
The entire Anyways, I agree that we should use a different type here. |
At least the first of these code snippets appears correct to me since it pertains to per-cell data. I agree that everything that invokes |
I disagree. This is an offset into the data file and is used in the MPI IO call where the type is |
We incorrectly compute MPI_Offset for MPI IO for checkpointing using SolutionTransfer using 32 bit indices, which means that files larger than 4GB end up being corrupted. This manifests in errors like n error occurred in line <749> of file <../source/distributed/tria_base.cc> in function void dealii::parallel::DistributedTriangulationBase<dim, spacedim>::load_attached_data(unsigned int, unsigned int, unsigned int, const string&, unsigned int, unsigned int) [with int dim = 3; int spacedim = 3; std::string = std::__cxx11::basic_string<char>] The violated condition was: (cell_rel.second == parallel::DistributedTriangulationBase<dim, spacedim>::CELL_PERSIST) part of dealii#12752
We incorrectly compute MPI_Offset for MPI IO for checkpointing using SolutionTransfer using 32 bit indices, which means that files larger than 4GB end up being corrupted. This manifests in errors like n error occurred in line <749> of file <../source/distributed/tria_base.cc> in function void dealii::parallel::DistributedTriangulationBase<dim, spacedim>::load_attached_data(unsigned int, unsigned int, unsigned int, const string&, unsigned int, unsigned int) [with int dim = 3; int spacedim = 3; std::string = std::__cxx11::basic_string<char>] The violated condition was: (cell_rel.second == parallel::DistributedTriangulationBase<dim, spacedim>::CELL_PERSIST) part of dealii#12752
@marcfehling I have the fix for the variable transfer ready to go as well. Do I see this right, that we do not test this anywhere? |
We incorrectly compute MPI_Offset for MPI IO for checkpointing using SolutionTransfer using 32 bit indices, which means that files larger than 4GB end up being corrupted. This manifests in errors like n error occurred in line <749> of file <../source/distributed/tria_base.cc> in function void dealii::parallel::DistributedTriangulationBase<dim, spacedim>::load_attached_data(unsigned int, unsigned int, unsigned int, const string&, unsigned int, unsigned int) [with int dim = 3; int spacedim = 3; std::string = std::__cxx11::basic_string<char>] The violated condition was: (cell_rel.second == parallel::DistributedTriangulationBase<dim, spacedim>::CELL_PERSIST) part of dealii#12752
This fixes save/load of fixed and variable checkpointing where individual ranks write more than 2GBs of data. Part of dealii#12873 and dealii#12752
We have a bug reported for ASPECT related to Triangulation::load/save that got introduced between 9.2 and 9.3 (see 1). After a quick look into tria_base.cc, I am sure that we will run into issues when the number of cells and/or the total file size runs above 32bits. I have not tested this, but this is suspicious (we need to use
types::global_cell_index
orMPI_Offset
):dealii/source/distributed/tria_base.cc
Lines 1606 to 1607 in c090a8c
dealii/source/distributed/tria_base.cc
Line 1656 in c090a8c
dealii/source/distributed/tria_base.cc
Line 1383 in c090a8c
dealii/source/distributed/tria_base.cc
Line 1250 in c090a8c
I see the following issues:
The text was updated successfully, but these errors were encountered: