MPI IO file write with >2GB individual writes #12873

tjhei · 2021-10-23T20:01:37Z

We are using MPI_File_write_at with data type CHAR to write large blobs of data in several places, like

Lines 1525 to 1531 in 61d7023

    
           ierr = MPI_File_write_at(fh, 
        
                                    offset_variable + 
        
                                      prefix_sum, // global position in file 
        
                                    DEAL_II_MPI_CONST_CAST(data), 
        
                                    src_data_variable.size(), // local buffer 
        
                                    MPI_CHAR, 
        
                                    MPI_STATUS_IGNORE);

Note that the count parameter (the number of bytes to write) is a signed int. It can easily overflow if we try to write more than 2 GB from a single rank. This (and many other places) need to be fixed.

The text was updated successfully, but these errors were encountered:

tjhei · 2021-10-25T02:26:29Z

Some reading material:

tjhei · 2021-10-25T02:38:02Z

My thoughts to solve this so far:
A) Break large writes into individual reads/writes <2GB.
B) Add some padding to the buffers to write and switch to a bigger datatype (and as such increase the maximum buffer size by that factor, for example 4x with int).
C) Create a custom datatype that covers the total size and set count==1. This is what BigMPI does (see above).

While A) seems to be the simplest and best, it doesn't work for *_write_ordered, which we use at least for DataOut. I think this gives better performance, so checkpointing should use this as well.

I find C) to be quite complicated and I don't want to include a library just for that purpose.

That leaves B), which is also a bit annoying as it requires changes to the file layout / offset computation.

Am I missing something?

tjhei · 2021-10-25T20:52:25Z

Another option is to use MPI 4.0, which defines large count IO operations, for example

That doesn't help us for <4.0 of course.

bangerth · 2021-10-26T17:16:29Z

Requiring MPI 4 is not great. I don't know an installation that has that -- in fact, I didn't know that was even out.

I don't understand the comment about padding. Let's say you have a buffer of size 678 bytes to write, couldn't you just create an MPI data type of 678 chars and use that to write with count==1? I get that that is a nuisance to create this kind of data type, but we could write a function namespace Utilities::MPI that does that for you, returns a std::unique_ptr to it, and attaches a custom deleter that frees the data type again. That encapsulates nearly everything you want -- you could even call it in place of the second to last argument and not even give the object a name.

(That only leaves the issue of the bizarre bug @tamiko discovered a while ago whereby his MPI implementation did not release the memory associated with the data structure objects if I recall correctly. It may also have been about custom MPI operators.)

tjhei · 2021-10-26T21:35:38Z

I don't understand the comment about padding. Let's say you have a buffer of size 678 bytes to write,

You have two options:
A) increase the datatype size to the total size, set count=1. This is what BigMPI does (see my link above). Creating a datatype of >2GB size seems to be a bit complicated.
B) increase the datatype size to something bigger than char but smaller than the total size. For example CHAR -> INT. Now you can represent bigger total sizes, but your data to write needs to be padded to divide by 4 evenly.

You seem to suggest A), while I was thinking B) might be a lot easier to pull off. I have to admit that I don't understand exactly what https://github.com/jeffhammond/BigMPI/blob/5300b18cc8ec1b2431bf269ee494054ee7bd9f72/src/type_contiguous_x.c#L74 does to get the right datatype.

Edit: The code is much simpler for char, actually. I think we can go with that.

tjhei · 2021-10-26T22:12:32Z

void make_large_MPI_type(MPI_Count size, MPI_Datatype *destination)
{
  const MPI_Count max_signed_int = (1U << 31)-1;

  MPI_Count n_chunks = size/max_signed_int;
  MPI_Count n_bytes_remainder = size%max_signed_int;

  MPI_Datatype chunks;
  MPI_Type_vector(n_chunks, max_signed_int, max_signed_int, MPI_BYTE, &chunks);

  MPI_Datatype remainder;
  MPI_Type_contiguous(n_bytes_remainder, MPI_BYTE, &remainder);

  int blocklengths[2]       = {1,1};
  MPI_Aint displacements[2] = {0,static_cast<MPI_Aint>(n_chunks)*max_signed_int};
  MPI_Datatype types[2]     = {chunks,remainder};
  MPI_Type_create_struct(2, blocklengths, displacements, types, destination);

  MPI_Type_free(&chunks);
  MPI_Type_free(&remainder);
}

bangerth · 2021-10-26T22:28:50Z

Yes, this code looks correct. It creates the MPI equivalent of

  struct X {
    using Chunk = char[max_signed_int];
    Chunk    chunks[n_chunks];
    char     remainder[n_bytes_remainder];
  };

which has exactly the right size and if you map it directly onto your buffer covers its entirety. You'd then call MPI_write with (X*)data as address and 1 as count.

This fixes save/load of fixed and variable checkpointing where individual ranks write more than 2GBs of data. Part of dealii#12873 and dealii#12752

tjhei · 2022-06-01T17:16:46Z

#13611 is the facility to use

tjhei mentioned this issue Oct 23, 2021

Bug: Triangulation::load()/save() with large output files #12752

Open

7 tasks

tjhei added the Bug label Oct 23, 2021

tjhei mentioned this issue Nov 17, 2021

Utilities::MPI::create_mpi_data_type_n_bytes #12964

Merged

tjhei added a commit to tjhei/dealii that referenced this issue Nov 20, 2021

fix checkpointing with >2GB per process

65d8f01

This fixes save/load of fixed and variable checkpointing where individual ranks write more than 2GBs of data. Part of dealii#12873 and dealii#12752

This was referenced Nov 20, 2021

fix checkpointing with >2GB per process #12973

Merged

Bug: writing large vtu files fails #12974

Closed

tjhei closed this as completed Jun 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI IO file write with >2GB individual writes #12873

MPI IO file write with >2GB individual writes #12873

tjhei commented Oct 23, 2021 •

edited

tjhei commented Oct 25, 2021

tjhei commented Oct 25, 2021

tjhei commented Oct 25, 2021

bangerth commented Oct 26, 2021

tjhei commented Oct 26, 2021 •

edited

tjhei commented Oct 26, 2021

bangerth commented Oct 26, 2021 •

edited

tjhei commented Jun 1, 2022

MPI IO file write with >2GB individual writes #12873

MPI IO file write with >2GB individual writes #12873

Comments

tjhei commented Oct 23, 2021 • edited

tjhei commented Oct 25, 2021

tjhei commented Oct 25, 2021

tjhei commented Oct 25, 2021

bangerth commented Oct 26, 2021

tjhei commented Oct 26, 2021 • edited

tjhei commented Oct 26, 2021

bangerth commented Oct 26, 2021 • edited

tjhei commented Jun 1, 2022

tjhei commented Oct 23, 2021 •

edited

tjhei commented Oct 26, 2021 •

edited

bangerth commented Oct 26, 2021 •

edited