Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel IO in CICE #81

Open
anton-seaice opened this issue Oct 27, 2023 · 18 comments
Open

Parallel IO in CICE #81

anton-seaice opened this issue Oct 27, 2023 · 18 comments
Assignees
Labels
blocked For issues waiting resolution of issues outside this repository cice6 Related to CICE6

Comments

@anton-seaice
Copy link
Contributor

In OM2 - a fair bit of work was done to add parrallel writing of netcdf output to get around delays writing daily out from CICE:

COSIMA/cice5#34

COSIMA/cice5@e9575cd

The ice_history code between CICE5 and 6 looks largely unchanged, so we will probably need to make similar changes to CICE6?

@anton-seaice anton-seaice changed the title Parrallel IO Parrallel IO in CICE Oct 27, 2023
@micaeljtoliveira
Copy link
Contributor

CICE6 has the option to perform IO using parallelio. This is implemented here:

https://github.com/CICE-Consortium/CICE/tree/main/cicecore/cicedyn/infrastructure/io/io_pio2

My understanding is that, when using it, it replaces the serial IO entirely, which is probably why this is not obvious in ice_history.F90 .

Note that, currently, the default build option in OM3 is to use PIO (see here

@anton-seaice
Copy link
Contributor Author

Thanks Micael

Maybe I misunderstood the changes done to CICE5 and COSIMA/cice5@e9575cd is just about adding the chunking features and some other improvements? But the parrallel IO was already working?

@aekiss - Can you confirm?

@micaeljtoliveira
Copy link
Contributor

@anton-seaice I think the development of PIO support in the COSIMA fork of CICE5 and in CICE6 were done independently. So they might not provide exactly the same features. Still, very likely the existing PIO support in CICE6 is good enough for our needs, although that needs to be tested.

@anton-seaice
Copy link
Contributor Author

anton-seaice commented Nov 3, 2023

Using the config from ACCESS-NRI/access-om3-configs#17 , ice.log gives these times:

Timer   1:     Total     173.07 seconds
Timer  13:   History      43.67 seconds

It's not clear to me if that is a problem (times are not mutually exclusive), and we might not know until we try the higher resolutions.

There are a couple of other issues though:

Monthly output in OM2 was ~17mb:

-rw-r-----+ 1 rmh561 ik11 7.6M May 11 2022 /g/data/ik11/outputs/access-om2/1deg_era5_ryf/output000/ice/OUTPUT/iceh.1900-01.nc

But the OM3 output is ~69mb
-rwxrwx--x 1 as2285 tm70 69M Nov 3 14:22 GMOM_JRA.cice.h.0001-01.nc

The history output is not chunked
And @dougie pointed out the history output is being written in "64-bit offset" which is a very dated way to write output which predates NetCDF-4

@anton-seaice
Copy link
Contributor Author

anton-seaice commented Nov 10, 2023

It looks like we need to set pio_typename = netcdf4p in nuopc.runconfig to turn this on (per med_io_mod )

But when I do this, i get this error in access-om3.err:

get_stripe failed: 61 (No data available)
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Obtained 10 stack frames.
/g/data/ik11/spack/0.20.1/opt/linux-rocky8-cascadelake/intel-2021.6.0/parallelio-2.5.10-hyj75i7/lib/libpioc.so(print_trace+0x29) [0x147f3a88eff9]
/g/data/ik11/spack/0.20.1/opt/linux-rocky8-cascadelake/intel-2021.6.0/parallelio-2.5.10-hyj75i7/lib/libpioc.so(piodie+0x42) [0x147f3a88d082]
/g/data/ik11/spack/0.20.1/opt/linux-rocky8-cascadelake/intel-2021.6.0/parallelio-2.5.10-hyj75i7/lib/libpioc.so(check_netcdf2+0x1b9) [0x147f3a88d019]
/g/data/ik11/spack/0.20.1/opt/linux-rocky8-cascadelake/intel-2021.6.0/parallelio-2.5.10-hyj75i7/lib/libpioc.so(PIOc_openfile_retry+0x855) [0x147f3a88d9f5]
/g/data/ik11/spack/0.20.1/opt/linux-rocky8-cascadelake/intel-2021.6.0/parallelio-2.5.10-hyj75i7/lib/libpioc.so(PIOc_openfile+0x16) [0x147f3a8887e6]
/g/data/ik11/spack/0.20.1/opt/linux-rocky8-cascadelake/intel-2021.6.0/parallelio-2.5.10-hyj75i7/lib/libpiof.so(piolib_mod_mp_pio_openfile_+0x21f) [0x147f3a61dacf]
/scratch/tm70/as2285/experiments/cice650_netcdf/as2285/access-om3/work/MOM6-CICE6/access-om3-MOM6-CICE6-37f9856-modified-37f9856-modified-4c25570-modified() [0x4082508]
/scratch/tm70/as2285/experiments/cice650_netcdf/as2285/access-om3/work/MOM6-CICE6/access-om3-MOM6-CICE6-37f9856-modified-37f9856-modified-4c25570-modified() [0x408b56f]
/scratch/tm70/as2285/experiments/cice650_netcdf/as2285/access-om3/work/MOM6-CICE6/access-om3-MOM6-CICE6-37f9856-modified-37f9856-modified-4c25570-modified() [0x42544bd]
/scratch/tm70/as2285/experiments/cice650_netcdf/as2285/access-om3/work/MOM6-CICE6/access-om3-MOM6-CICE6-37f9856-modified-37f9856-modified-4c25570-modified() [0x40589e5]

The No data available is curious. I think its trying to open the restart file (which works fine if pio_typename = netcdf). This implies it could be missing dependencies - are we including both the HDF5 and PnetCDF libraries ? Where would I find out? (more importantly)

@micaeljtoliveira
Copy link
Contributor

The definitions of the spack environments we are using can be found here. For the development version of OM3, we are using this one.

HDF5 with MPI support is included by default when compiling netCDF with spack , while pnetcdf is off when building parallelio. If you want I can try to rebuild parallelio with pnetcdf support.

@aekiss
Copy link
Contributor

aekiss commented Nov 14, 2023

Possibly relevant:
COSIMA/access-om2#166

@anton-seaice
Copy link
Contributor Author

The definitions of the spack environments we are using can be found here. For the development version of OM3, we are using this one.

HDF5 with MPI support is included by default when compiling netCDF with spack , while pnetcdf is off when building parallelio. If you want I can try to rebuild parallelio with pnetcdf support.

Thanks - this sounds ok. HDF5 is the one we want, and the ParrallelIO library should be backward compatible without pnetcdf.

I am still getting the "NetCDF: Error initializing for parallel access" error when reading files (although I can generate netcdf4 files ok). The error text comes from the Netcdf library but it looks like it could be an error from the HDF library. I can't see any error logs from the HDF5 library though? I wonder if building hdf in Build Mode: 'Debug' rather than release will generate error messages (or at least lines numbers in the stack trace)?

@access-hive-bot
Copy link

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/payu-generated-symlinks-dont-work-with-parallelio-library/1617/1

@anton-seaice
Copy link
Contributor Author

anton-seaice commented Nov 21, 2023

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/payu-generated-symlinks-dont-work-with-parallelio-library/1617/1

I was way off on a tangent. The ParallelIO library doesn't like using a symlink to the initial conditions file, and this gives the get_stripe failed errror.

@anton-seaice
Copy link
Contributor Author

I raised an issue for the code changes needed for chunking and compression:
CICE-Consortium/CICE#914

@anton-seaice anton-seaice changed the title Parrallel IO in CICE Parallel IO in CICE Nov 21, 2023
@anton-seaice
Copy link
Contributor Author

anton-seaice commented Dec 12, 2023

For anyone reading later, Dale Roberts and OpenMPI both suggested setting the mpi io library to romio321 instead of ompio (the default).

(i.e. mpirun --mca io romio321 ./cice)

Which works and open files through the symlink, but there is a significant performance hit. Monthly runs (with some daily output) have history timers in the ice.log of approximately double (99 seconds vs 54 seconds, 48 cores, 12 pio tasks, pio_type=netcdf4p).

It looks like ompio was deliberately chosen in OM2, (see https://cosima.org.au/index.php/category/minutes/ and COSIMA/cice5#34 (comment)) but the details are pretty minimal. So it doesn't seem like a good fix.

There is an open issue with OpenMPI still: open-mpi/ompi#12141

@dsroberts
Copy link

Hi @anton-seaice. Was going to email the following to you, but thought I'd put it here:
In my experience ROMIO is very sensitive to tuning parameters. If your lustre striping, buffer sizes and aggregator settings don't line up just so, performance is barely any better than sequential writes because that's more or less what it'll be doing under the hood. It does require a bit of thought, and it very much depends on your application's output patterns. For what its worth, I recently did some MPI-IO tuning a high-resolution regional atmosphere simulation. Picking the correct MPI-IO settings improved the write performance from ~400MB/s to 2.5-3GB/s sustained to a single file. If your pio tasks aggregate data sequentially, then the general advice is set lustre_stripe_count <= cb_nodes <= n_pio_tasks, with the cb_buffer_size set such that each write transaction fits entirely within the buffer. There isn't a ton of info on tuning MPI-IO out there, best place to start is the source: https://ftp.mcs.anl.gov/pub/romio/users-guide.pdf.

@anton-seaice
Copy link
Contributor Author

anton-seaice commented Dec 12, 2023

Hi @anton-seaice. Was going to email the following to you, but thought I'd put it here: In my experience ROMIO is very sensitive to tuning parameters. If your lustre striping, buffer sizes and aggregator settings don't line up just so, performance is barely any better than sequential writes because that's more or less what it'll be doing under the hood. It does require a bit of thought, and it very much depends on your application's output patterns. For what its worth, I recently did some MPI-IO tuning a high-resolution regional atmosphere simulation. Picking the correct MPI-IO settings improved the write performance from ~400MB/s to 2.5-3GB/s sustained to a single file. If your pio tasks aggregate data sequentially, then the general advice is set lustre_stripe_count <= cb_nodes <= n_pio_tasks, with the cb_buffer_size set such that each write transaction fits entirely within the buffer. There isn't a ton of info on tuning MPI-IO out there, best place to start is the source: https://ftp.mcs.anl.gov/pub/romio/users-guide.pdf.

Thanks Dale.

The other big caveat here is we only have a 1 degree resolution at this point, and in OM2, performance was worse with parallel IO (than without) at 1 degree but better at 0.25 degree. So it may get hard to really get into the details at this point.

Lustre stripe count is 1 (files are <100MB), but I couldn't figure out an easy way to check cb_nodes?

CICE uses the ncar parallelio library. The data might in a somewhat sensible order. Each PE would have 10 or so blocks of adjacent data (in a line of constant longitude). If we use the 'box rearranger', then each io task might end up with adjacent data in latitude too (assuming PE's get assigned sequentially?).

Saying that, it looks like using 1 pio iotask (there are 48 PEs) and box re-arranger is fastest. With 1 pio task, box rearranger and ompio reported history time is ~12seconds (vs about 15 seconds with romio321).

(For reference: config tested )

@anton-seaice
Copy link
Contributor Author

anton-seaice commented Dec 19, 2023

OpenMpi will fix the bug, so plan of action is

@aekiss
Copy link
Contributor

aekiss commented Dec 20, 2023

Could also be worth discussing with Rui Yang (NCI) - he has a lot of experience with parallel IO.

@aekiss
Copy link
Contributor

aekiss commented Dec 21, 2023

CICE uses the ncar parallelio library. The data might in a somewhat sensible order. Each PE would have 10 or so blocks of adjacent data (in a line of constant longitude). If we use the 'box rearranger', then each io task might end up with adjacent data in latitude too (assuming PE's get assigned sequentially?).

Would efficient parallel io also require a chunked NetCDF file, with chunks corresponding to each iotask's set of blocks?

Also (as in OM2) we'll probably use different distribution_type, distribution_wght and processor_shape at higher resolution, probably with land block elimination (distribution_wght = block). In this case each compute PE handles a non-rectangular region - guess this makes the role of the rearranger more important?

@anton-seaice
Copy link
Contributor Author

anton-seaice commented Jan 7, 2024

Would efficient parallel io also require a chunked NetCDF file, with chunks corresponding to each iotask's set of blocks?

Possibly - we will have to revisit when the chunking is working, although with neatly organised data (i.e. in 1 degree where blocks are adjacent) it might not matter. If we stick with the box rearranger, then 1 chunk per iotask is worth trying. Ofcourse we need to mindful of read patterns just as much as write speed though.

Also (as in OM2) we'll probably use different distribution_type, distribution_wght and processor_shape at higher resolution, probably with land block elimination (distribution_wght = block). In this case each compute PE handles a non-rectangular region - guess this makes the role of the rearranger more important?

Using the box rearranger - this would send all data from one compute task to one IO task - but then the data blocks would be non-contiguous in the output and need multiple calls to the netcdf library. (Presumably set netcdf chunk size = block size)

Using the subset rearranger - the data from compute tasks would be spread among multiple IO tasks - but then the data blocks would be contiguous for each IO task and require only one call to the netcdf library. (Presumably set netcdf chunk size = 1 chunk per IO task)

Box would have more IO operations and subset would have more network operations. I don't know how they would balance out (and also would guess the results are different depending if the tasks are across multiple NUMA nodes / real nodes etc).

NB: The TWG minutes talk about this a lot. Suggestion is actually that one chunk per node will be best!

@anton-seaice anton-seaice self-assigned this Feb 1, 2024
@anton-seaice anton-seaice added the cice6 Related to CICE6 label Feb 1, 2024
@dougiesquire dougiesquire added the blocked For issues waiting resolution of issues outside this repository label May 2, 2024
anton-seaice added a commit to COSIMA/om3-scripts that referenced this issue Aug 2, 2024
Creation of users scripts for OM3, see COSIMA/access-om3#182

There are three processes here:

1. Per COSIMA/access-om3#81 initial conditions file for CICE are converted from symlink to files/hardlinks
2. CICE daily output in concatenated into one file per month
3. An intake esm datastore for the run is generated

---------

Co-authored-by: Dougie Squire <42455466+dougiesquire@users.noreply.github.com>
Co-authored-by: dougiesquire <dougiesquire@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked For issues waiting resolution of issues outside this repository cice6 Related to CICE6
Projects
Status: Blocked
Development

No branches or pull requests

6 participants