Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Triangulation::load()/save() with large output files #12752

Open
5 of 7 tasks
tjhei opened this issue Sep 10, 2021 · 4 comments
Open
5 of 7 tasks

Bug: Triangulation::load()/save() with large output files #12752

tjhei opened this issue Sep 10, 2021 · 4 comments

Comments

@tjhei
Copy link
Member

tjhei commented Sep 10, 2021

We have a bug reported for ASPECT related to Triangulation::load/save that got introduced between 9.2 and 9.3 (see 1). After a quick look into tria_base.cc, I am sure that we will run into issues when the number of cells and/or the total file size runs above 32bits. I have not tested this, but this is suspicious (we need to use types::global_cell_index or MPI_Offset):

const unsigned int offset =
sizes_fixed_cumulative.size() * sizeof(unsigned int);

const unsigned int offset = global_num_cells * sizeof(unsigned int);

const unsigned int global_num_cells,

unsigned int offset = numbers::invalid_unsigned_int;

I see the following issues:

@marcfehling
Copy link
Member

marcfehling commented Sep 10, 2021

The code snippets you mentioned were already there in version 9.2.0, but at a different spot. Using the v9.2.0 tag for example

const unsigned int offset =
sizes_fixed_cumulative.size() * sizeof(unsigned int);

The entire DataTransfer class has been moved from p:d:Tria to p:DistributedTriangulationBase to be used in combination with p:fd:Tria. DataTransfer has not been changed between releases AFAIK. I would bet that the error might be somewhere else, but I don't have an idea right now :(


Anyways, I agree that we should use a different type here.

@bangerth
Copy link
Member

At least the first of these code snippets appears correct to me since it pertains to per-cell data. I agree that everything that invokes global_num_cells looks suspicious and should probably be types::global_cell_index.

@tjhei
Copy link
Member Author

tjhei commented Sep 17, 2021

At least the first of these code snippets appears correct to me since it pertains to per-cell data. I

I disagree. This is an offset into the data file and is used in the MPI IO call where the type is MPI_Offset (64 bit). If your file is bigger than 4 GB, this will fail.

tjhei added a commit to tjhei/dealii that referenced this issue Oct 15, 2021
We incorrectly compute MPI_Offset for MPI IO for checkpointing using
SolutionTransfer using 32 bit indices, which means that files larger
than 4GB end up being corrupted.
This manifests in errors like

n error occurred in line <749> of file
<../source/distributed/tria_base.cc> in function
void dealii::parallel::DistributedTriangulationBase<dim,
spacedim>::load_attached_data(unsigned int, unsigned int, unsigned int,
const string&, unsigned int, unsigned int) [with int dim = 3; int
spacedim = 3; std::string = std::__cxx11::basic_string<char>]
The violated condition was:
(cell_rel.second == parallel::DistributedTriangulationBase<dim,
spacedim>::CELL_PERSIST)

part of dealii#12752
tjhei added a commit to tjhei/dealii that referenced this issue Oct 15, 2021
We incorrectly compute MPI_Offset for MPI IO for checkpointing using
SolutionTransfer using 32 bit indices, which means that files larger
than 4GB end up being corrupted.
This manifests in errors like

n error occurred in line <749> of file
<../source/distributed/tria_base.cc> in function
void dealii::parallel::DistributedTriangulationBase<dim,
spacedim>::load_attached_data(unsigned int, unsigned int, unsigned int,
const string&, unsigned int, unsigned int) [with int dim = 3; int
spacedim = 3; std::string = std::__cxx11::basic_string<char>]
The violated condition was:
(cell_rel.second == parallel::DistributedTriangulationBase<dim,
spacedim>::CELL_PERSIST)

part of dealii#12752
@tjhei
Copy link
Member Author

tjhei commented Oct 15, 2021

@marcfehling I have the fix for the variable transfer ready to go as well. Do I see this right, that we do not test this anywhere?

tamiko pushed a commit to tamiko/dealii that referenced this issue Oct 25, 2021
We incorrectly compute MPI_Offset for MPI IO for checkpointing using
SolutionTransfer using 32 bit indices, which means that files larger
than 4GB end up being corrupted.
This manifests in errors like

n error occurred in line <749> of file
<../source/distributed/tria_base.cc> in function
void dealii::parallel::DistributedTriangulationBase<dim,
spacedim>::load_attached_data(unsigned int, unsigned int, unsigned int,
const string&, unsigned int, unsigned int) [with int dim = 3; int
spacedim = 3; std::string = std::__cxx11::basic_string<char>]
The violated condition was:
(cell_rel.second == parallel::DistributedTriangulationBase<dim,
spacedim>::CELL_PERSIST)

part of dealii#12752
tjhei added a commit to tjhei/dealii that referenced this issue Nov 17, 2021
tjhei added a commit to tjhei/dealii that referenced this issue Nov 20, 2021
This fixes save/load of fixed and variable checkpointing where
individual ranks write more than 2GBs of data.
Part of dealii#12873 and dealii#12752
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants