-
Notifications
You must be signed in to change notification settings - Fork 7
ENH: Add parallel NF90 single-file output #16
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a few things got lost in the port from MITgcm66h. Please add changes to diagnostics_out.F
that call the nf90io bits, and see other comments in the code.
Is the .oldcode
directory there on purpose?
With the requested changes I was able to compile and run the test on our cluster!
#else | ||
CLOSE(ku,STATUS='DELETE') | ||
#endif /* SINGLE_DISK_IO */ | ||
CLOSE (ku, STATUS = 'DELETE') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a regression. Please restore the SINGLE_DISK_IO changes
pkg/nf90io/nf90io_utils.F
Outdated
#include "SIZE.h" | ||
#include "EEPARAMS.h" | ||
#include "EESUPPORT.h" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to #include "NF90IO.h"
here and in NF90IO_OPEN
for this to compile. I think generally we declare functions and their type explicitly in every subroutine that uses them (i.e., a FUNCTION
and a type statement), so there may not be a need for this header. But maybe we should discuss this convention @jm-c, @christophernhill?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, hmmm. Maybe you use a more strict compile options than me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I am using stock linux_amd64_gfortran...
& SQUEEZE_RIGHT , 1) | ||
|
||
C Close the open data file | ||
CLOSE(iUnit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be updated for SINGLE_DISK_IO (see diagnostics_readparms.F)
tools/genmake2
Outdated
@@ -1,5 +1,7 @@ | |||
#! /usr/bin/env bash | |||
# | |||
# $Header$ | |||
# $Name$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regression: CVS keywords have been removed.
tools/genmake2
Outdated
echo "Somehow h5pcc not compiled w/ parallel" >> $LOGFILE | ||
return | ||
fi | ||
echo " ...returns yes" >> $LOGFILE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need these new tests? I would just test compilation of f90tst_parallel.f90 and maybe on failure print some hints to help the user figure out whether she has parallel netcdf support. netcdf 4.5, for instance, has nc-config --has-parallel
. I don't like the use of $NETCDF
and $HDF5
too much as most optfiles do not define or use these right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I remember now. The issue is that these tests don't fail if parallel support is not enabled. And if I remember correctly the error was a bit mysterious to the end user.
OTOH, I agree that there doesn't seem to be an easily robust way to check for parallel netcdf. You can always call nc-config
but it is hard to tell if it is the right nc-config
on a supercomputer with many possible modules. Thats why I was suggesting the environment variables. Some supercomputers seem to set these environment variables, and if they don't I don't think its a heavy ask to ask the user to set them, rather than just providing bare $LIB
and $INC
lists. But, I don't think sorting all that out should hold this up, just something to consider.
tools/genmake2
Outdated
echo "<<< f90tst_parallel.f90 ===" >> f90tst_parallel.log | ||
echo "$FC $FFLAGS $FOPTIM -c f90tst_parallel.f90 \ " >> f90tst_parallel.log | ||
echo " && $LINK $FFLAGS $FOPTIM -o f90tst_parallel.o $LIBS" >> f90tst_parallel.log | ||
$FC $FFLAGS $FOPTIM $INCLUDES -c ${TOOLSDIR}/maketests/f90tst_parallel.f90 >> f90tst_parallel.log 2>&1 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change to F90C here and below. Many optfiles assume that FC only sees fixed-form files and will not compile this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, my mind is hazy on this, though we had a discussion in altMITgcm/MITgcm66h#15
I don't actually ever compile the rest of the code with F90C, including the new stuff. I think I just assume that FC will compile F90 code. I tried specifically using F90C for the new files, but the compile options don't carry over, and I wasn't sure how to proceed (I don't have access or experience with compiler options really).
So... I can do this, but I'm just flagging that there is still some inconsistency, and maybe someone else needs to a) Force Push onto this PR (which anyone w/ commit bit is welcome to do), or b) merge and fix later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we effectively use FC for fixed-form source files (with file extension .F) and F90C for free-form ones (with extension .F90). There can be Fortran-90 features in .F files and these may then just not work with some compilers (like g77). @jm-c correct me if I'm wrong.
Thanks @jahn thats very helpful. |
Thanks! It compiles and runs now. There are still things to think about, such as the tests for parallel netcdf (as you pointed out) and naming of dimensions, etc., so still WIP! |
Thanks @jahn Removing the WIP, because I think this could be merged w/o hurting anything else. Of the two remaining issues genmake2 checksFor now I think this just needs to be a documentation issue. It'll probably evolve as more people use the package and let us know what issues they have on various machines. Naming conventionsTotally open to working on naming of dimensions with folks. I do think this is somewhat urgent, because we don't want people to start writing downstream code and then yanking the rug out from under them in three or four months. OTOH, I will agitate to merge this sooner rather than later so I don't have to keep rebasing/merging. When the documentation is merged we can stipulate that the naming convention of the dimensions is subject to change for a release cycle or two. And further PRs to get that all straight could be made after this is in place. |
Note that this is still not the final git repository but frozen at CVS from October. So we'll have to rebase it a least once more... |
How is that going? Moving the canonical repository could theoretically happen before the docs are finished (so long as there are enough docs that folks can download the code, but that is surely straight-forward). Happy to help where I can... |
It's great to have a test included, but I couldn't run it with
The readme looks like a great base for the documentation for this package. |
OK, it seems the It does seem that just running the job and looking at |
We do run tests with mpi. We could make sure that some of these have parallel netcdf available. We could also think about supporting a check script like the one you provide. This would be useful for checking other aspects of binary output too.
…On January 15, 2018 2:23:10 PM GMT-03:00, Jody Klymak ***@***.***> wrote:
OK, it seems the `verification` experiments all want an `output.txt`.
I've not supplied one. Maybe that is the wrong place for `testNF90io`?
I'm not really up on what the verification expts are supposed to check
or how `testreport` is meant to behave so any guidance is welcome.
It does seem that just running the job and looking at `output.txt` will
be a weak indication that things are working. The only strong
indication that things are working is using a multi-core machine with
parallel netcdf installed.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#16 (comment)
|
I wrote that script ( |
I'm a big fan of testing, but not a fan of including binaries in the repo. I'll let someone else (@jm-c or @christophernhill perhaps?) jump in about the philosophy behind |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc/tag-index
needs to be updated. (The docs don't currently mention this, but it's in the pipeline)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix endianness and axes order of the binary input files.
x = x-x[int(nx/2)] | ||
|
||
with open(outdir+"/delXvar.bin", "wb") as f: | ||
dx.tofile(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This writes native-endian. Can you change to big-endian here and below, e.g.,
dx.astype('>f8').tofile(f)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll do it, but reluctantly... At some point it'd be nice to phase out big-endian, unless VAX makes a comeback someday 😉
f.close() | ||
|
||
# save T0 over whole domain | ||
TT0 = matlib.repmat(T0,nx,ny).T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not give the correct order of axes. Can you fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll get rid of matlib
all-together because its disappearing from Matplotlib...
If we keep this test here in verification, we should add the binary input files, reference output and I think this test is useful, even without a way to automatically check the resulting netCDF file (which would also be useful, of course). |
Looks like this works only with |
wrt |
well, the file comes out all shuffled around. I can have a look what might be the problem.
…On January 16, 2018 8:00:25 PM GMT-03:00, Jody Klymak ***@***.***> wrote:
wrt `nSx`, ummm... hmmm. I've not tested for that, but I do iterate
over `bi` and `bj`. Do you think its incorrect?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#16 (comment)
|
err = nf90_put_var(ncid, varid, | ||
& dat(1:sNx, 1:sNy, 1:Nr, bi, bj), | ||
& start = start, count = count) | ||
CALL nf90ERR(err, "Putting data into file", myThid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jahn I think that the tiling error would be in here (and similar). I don't see anything obvious, but maybe I did something dumb....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess would be that in collective mode each processor has to make exactly one call to nf90_put_var for each record. If that's true, one would have to combine the tiles one processor controls into one array that can be passed to nf90_put_var. I couldn't verify this in the netCDF docs, though. Maybe you know where to look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I have no idea - not claiming to be a netcdf expert, just seemed easy to put this together so I did.
I think we could proceed for now by putting a STOP
in somewhere.
If what you say above is correct, I'd need guidance on how frequently muti-threading is used by folks as to whether it was worth the extra code to gather each processor's output and save it. Certainly thats substantially more complicated and memory intensive than what is done now. If its worth it (i.e. many of people actually multithread) then we should do it as a future refinement. Myself, I'm not even sure how to run a multithreaded job, though no doubt I could figure it out; maybe as easy as just setting nSx/nSy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not just for multi-threading, one can run with multiple tiles per processor in a single thread (that's what these loops are for). Sometimes it is convenient if one can run on fewer processors without having to change the tiling. And testreport uses this to run parallel tests on as many processors as there are available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed on my machine. Advice appreciated. We could gather a processer-global temp variable and nf90_put_var
that variable instead. I think that seems reasonable, and not too much work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found my bug and/or implimented gathering. Not sure gathering is necessary. Basically, I needed to trim the arrays qdiag
to Nr
because they are NrMax
in the k-direction, and that caused bad data to be in the routines if sNx>1
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, didn't need to gather, just needed to pass the properly trimmed data to the writing functions. @jahn, this should work now.
Working on making the example testreport
ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it does!
How does the |
Yes, |
There is some information on verification experiment setup and
how testreport works here, section 4.2:
http://mitgcm.org/public/devel_HOWTO/devel_HOWTO_onepage/
…On Wed, Jan 17, 2018 at 07:21:50AM -0800, Oliver Jahn wrote:
Yes, `testreport -mpi` (or `-MPI <N>`) uses `SIZE.h_mpi`. The latter also modifies it to match the actual number of processors, changing `nSx` or `nSy`.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#16 (comment)
|
For the rest to get |
|
||
# save T0 over whole domain | ||
slope = np.arange(nx) * 0.2 | ||
TT0 = T0[None, None, :] + 0 * y[None, :, None] + slope[:, None, None] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for being a pain, but the axes order is still backwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐑
For the testreport run to be meaningful, you have to reduce SIZE.h has to be set up for a one-processor run, i.e., To test with mpi, with a specific number of processors, use something like this:
This should work if your optfile supports mpi. If you cannot get |
Thanks, still a bit confused: What do I use for I assume when I run |
|
Closing in favour of the proper repository. MITgcm/MITgcm#31 |
Preface
Ported over from the experimental altMITgcm/MITgcm66h#15
Description
This is an implementation of the netcdf4 parallel writing, so that each tile in an mpi process writes to the same file, as described: https://www.unidata.ucar.edu/software/netcdf/netcdf-4/newdocs/netcdf-f90/
Currently it only supports writes using
pkg/diagnostics
. Setdiag_nf90io=.TRUE.
indata.diagnostics
.Modifies
genmake2
to test for NF90. Also changes.f.o:
Makefile directive to haveINCLUDES
as a flag for the compiler to accommodateuse netcdf
commands in the*.f
files.See
verification/testNF90io/
for a basic example.See the README.rst in
pkg/nf90io
for a possible manual entry.Todo:
genmake2
that tests if hdf5 and netcdf have been compiled with parallel support: This may be beyond my expertise. I note thatautoconf
has such a test, and maybe their macro could be used.verification/testNF90io
mnc
do the time alignment between the different packages.diagnostics
print
statements..FALSE.