Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI IO file write with >2GB individual writes #12873

Closed
Tracked by #12752
tjhei opened this issue Oct 23, 2021 · 8 comments
Closed
Tracked by #12752

MPI IO file write with >2GB individual writes #12873

tjhei opened this issue Oct 23, 2021 · 8 comments
Labels

Comments

@tjhei
Copy link
Member

tjhei commented Oct 23, 2021

We are using MPI_File_write_at with data type CHAR to write large blobs of data in several places, like

ierr = MPI_File_write_at(fh,
offset_variable +
prefix_sum, // global position in file
DEAL_II_MPI_CONST_CAST(data),
src_data_variable.size(), // local buffer
MPI_CHAR,
MPI_STATUS_IGNORE);

Note that the count parameter (the number of bytes to write) is a signed int. It can easily overflow if we try to write more than 2 GB from a single rank. This (and many other places) need to be fixed.

@tjhei
Copy link
Member Author

tjhei commented Oct 25, 2021

My thoughts to solve this so far:
A) Break large writes into individual reads/writes <2GB.
B) Add some padding to the buffers to write and switch to a bigger datatype (and as such increase the maximum buffer size by that factor, for example 4x with int).
C) Create a custom datatype that covers the total size and set count==1. This is what BigMPI does (see above).

While A) seems to be the simplest and best, it doesn't work for *_write_ordered, which we use at least for DataOut. I think this gives better performance, so checkpointing should use this as well.

I find C) to be quite complicated and I don't want to include a library just for that purpose.

That leaves B), which is also a bit annoying as it requires changes to the file layout / offset computation.

Am I missing something?

@tjhei
Copy link
Member Author

tjhei commented Oct 25, 2021

Another option is to use MPI 4.0, which defines large count IO operations, for example
image
That doesn't help us for <4.0 of course.

@bangerth
Copy link
Member

Requiring MPI 4 is not great. I don't know an installation that has that -- in fact, I didn't know that was even out.

I don't understand the comment about padding. Let's say you have a buffer of size 678 bytes to write, couldn't you just create an MPI data type of 678 chars and use that to write with count==1? I get that that is a nuisance to create this kind of data type, but we could write a function namespace Utilities::MPI that does that for you, returns a std::unique_ptr to it, and attaches a custom deleter that frees the data type again. That encapsulates nearly everything you want -- you could even call it in place of the second to last argument and not even give the object a name.

(That only leaves the issue of the bizarre bug @tamiko discovered a while ago whereby his MPI implementation did not release the memory associated with the data structure objects if I recall correctly. It may also have been about custom MPI operators.)

@tjhei
Copy link
Member Author

tjhei commented Oct 26, 2021

I don't understand the comment about padding. Let's say you have a buffer of size 678 bytes to write,

You have two options:
A) increase the datatype size to the total size, set count=1. This is what BigMPI does (see my link above). Creating a datatype of >2GB size seems to be a bit complicated.
B) increase the datatype size to something bigger than char but smaller than the total size. For example CHAR -> INT. Now you can represent bigger total sizes, but your data to write needs to be padded to divide by 4 evenly.

You seem to suggest A), while I was thinking B) might be a lot easier to pull off. I have to admit that I don't understand exactly what https://github.com/jeffhammond/BigMPI/blob/5300b18cc8ec1b2431bf269ee494054ee7bd9f72/src/type_contiguous_x.c#L74 does to get the right datatype.

Edit: The code is much simpler for char, actually. I think we can go with that.

@tjhei
Copy link
Member Author

tjhei commented Oct 26, 2021

void make_large_MPI_type(MPI_Count size, MPI_Datatype *destination)
{
  const MPI_Count max_signed_int = (1U << 31)-1;

  MPI_Count n_chunks = size/max_signed_int;
  MPI_Count n_bytes_remainder = size%max_signed_int;

  MPI_Datatype chunks;
  MPI_Type_vector(n_chunks, max_signed_int, max_signed_int, MPI_BYTE, &chunks);

  MPI_Datatype remainder;
  MPI_Type_contiguous(n_bytes_remainder, MPI_BYTE, &remainder);

  int blocklengths[2]       = {1,1};
  MPI_Aint displacements[2] = {0,static_cast<MPI_Aint>(n_chunks)*max_signed_int};
  MPI_Datatype types[2]     = {chunks,remainder};
  MPI_Type_create_struct(2, blocklengths, displacements, types, destination);

  MPI_Type_free(&chunks);
  MPI_Type_free(&remainder);
}

@bangerth
Copy link
Member

bangerth commented Oct 26, 2021

Yes, this code looks correct. It creates the MPI equivalent of

  struct X {
    using Chunk = char[max_signed_int];
    Chunk    chunks[n_chunks];
    char     remainder[n_bytes_remainder];
  };

which has exactly the right size and if you map it directly onto your buffer covers its entirety. You'd then call MPI_write with (X*)data as address and 1 as count.

tjhei added a commit to tjhei/dealii that referenced this issue Nov 20, 2021
This fixes save/load of fixed and variable checkpointing where
individual ranks write more than 2GBs of data.
Part of dealii#12873 and dealii#12752
@tjhei
Copy link
Member Author

tjhei commented Jun 1, 2022

#13611 is the facility to use

@tjhei tjhei closed this as completed Jun 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants