Lemon pb on BG/Q #230

Closed
MarianeBrinet opened this Issue Feb 14, 2013 · 15 comments

Comments

Projects
None yet
4 participants

I have a problem with Lemon on BG/Q; the error message I get when
running inverter with Lemon is:

[LEMON] Node 5771 reports in lemonWriteLatticeParallel:
Could not write the required amount of data.
[LEMON] Node 1568 reports in lemonWriteLatticeParallel:
Could not write the required amount of data.

The "Lemon configure version" is 0.99 (written in the config.log file, but does it give the
Lemon version...?). I configured using

./configure --prefix=/workgpfs/rech/xgm/rxgm005/binaires/lemon CC=mpixlcxx CFLAGS=-I/bgsys/drivers/cfloor/comm/xl.ndebug/include -qstrict -O3

(removing -qstrict and/or -O3 does not solve the problem).

The error appears at run time, not when compiling.

Please let me know if you need more information, and thanks in advance for your
help!

Mariane

@ghost ghost assigned deuzeman and urbach Feb 14, 2013

Contributor

deuzeman commented Feb 14, 2013

Does this problem appear for every write? Or intermittently?

Since it's only two nodes out of many reporting a writing issue, this could be the old problem where blocks of zeros appear in the configurations. If so, then this is ironically an improvement, in that this implementation of the MPI I/O standard seems to at least be aware of something bad happening. In that case, maybe we can implement some graceful recovery. Either retrying the write, or have a fallback to c-lime...

I did not copy the full output but there is a full bunch of such error message and it seems to appear on every node. In addition it appears in inverter, so in principle it is not linked to configuration writing.... Lime is indeed working OK, but with 48^3, it takes more time to write propas than to actually compute it.

Contributor

deuzeman commented Feb 18, 2013

From your report, it appears that the amount of data written, as reported, does not match the requested amount. This may be due to several reasons and I'll need some additional info to determine what is going on. So first of all, I'd like to know a little more about the following...

  • Does this problem occur immediately, or after one or more successful writes?
  • Has a file actually been written when the problem occurs? If so, what does 'lemon_contents' (or 'lime_contents') show?
  • Does this problem occur when running on smaller lattices, or at least smaller local volumes?
  • If you have not seen any successful parallel writing operation, could you try to run the 'parallel' executable from the check directory? What does 'lemon_benchmark 48 5' produce?

In the mean time, I will prepare a version of the library with some debugging output in a dedicated branch. With some luck, running with that modified library will let us know what's going on.

Contributor

deuzeman commented Apr 8, 2013

Revision 1.1 of Lemon is not available from the main repository (etmc/lemon@95726a0). If there are integer overflow issues, this should help. There seem to be general I/O issues on the BG/Q (see #243) so your mileage may vary, but this code has been tested and can be considered stable.

Hi Albert,

I took your last lemon version (1.1) and I am trying to have it running on the French BG/Q. There is an error when running hmc_tm (compiled with 4D):

Version 5.2.0, commit c1ea739105c28cef32efd49acc5e925005fc8f55
# The code is compiled with QPX intrinsics for Blue Gene/Q
# Compiled with BG/Q SPI communication
# The code is compiled with -D_GAUGE_COPY
# The code is compiled with -D_USE_HALFSPINOR
# the code is compiled for non-blocking MPI calls (spinor and gauge)
# the code is compiled with MPI IO / Lemon
# the code is compiled with openMP support
# Periodic boundary conditions are used
# The lattice size is 48 x 24 x 24 x 24
# The local lattice size is 12 x 6 x 12 x 12
# Even/odd preconditioning is used
# beta = 1.850000 , kappa= 0.139897
# boundary conditions for fermion fields (t,x,y,z) * pi: 1.000000 0.000000 0.000000 0.000000
# mu = 0.006000
# g_rgi_C0 = 3.648000, g_rgi_C1 = -0.331000
# Using relative precision for the inversions!
# Initialising rectangular gauge action stuff
# The lattice is correctly mapped by the index arrays

For gauge file conf.0016, calculated and stored values for SciDAC checksum A do not match.
#   Calculated            : A = 0x749b8361 B = 0x83121925.
Error -1 while reading gauge field from conf.0016
Aborting...
#   Read from LIME headers: A = 0x7ff6d7a6 B = 0xe2f48b53.
For gauge file conf.0016, calculated and stored values for SciDAC checksum A do not match.
Error -1 while reading gauge field from conf.0016
Aborting...

I think David got the same kind of error at some point (although with a previous lemon version) but I do not know
what the solution is. Shall I try compiling tmLQCD with 3D mapping instead?

Best,

Mariane

oups.... seems copying/pasting the error messages on Firefox is not that great.... At least with this size font, no problem to read the message....

Contributor

deuzeman commented Apr 10, 2013

seems copying/pasting the error messages on Firefox is not that great...

The problem is the formatting that Github uses. If you precede a line with a hash, it changes the formatting of that line to 'header 1'. If you need to quote blocks of output, you can change the formatting to verbatim in a block by preceding each line by at least four spaces. It takes some getting used to. To preserve sanity on the page, I edited your message.

Contributor

deuzeman commented Apr 10, 2013

On the matter of the I/O failure, was this from a fresh start? I mean, did this run write out some configurations properly before? Because then it seems like the same problem we've been seeing in Juelich. That appears to be a hardware/firmware issue, rather than something we've messed up.

The problem with this version of the code, is that we made any I/O failure fatal when using Lemon. But that doesn't necessarily help if there's nothing you can really do about the problem.

Would it be possible for you to update to the latest version of tmLQCD (c9c0fb4)? That version contains some logic (see pull request #249) to hopefully make these errors less disruptive. When it encounters an I/O failure, it puts the program to sleep for a few seconds to give the congestion (or whatever it is that is causing trouble) to clear up. It then tries to write the configuration again, repeating this process up to five times. Hopefully that will be enough to keep running.

Two things that might be important to note. First, these I/O issues are actually not exclusive to Lemon. We've also seen some failures with c-lime, but older versions of the code didn't do any checks on the I/O when c-lime was used. So switching back to serial I/O could appear to fix the issues, but in reality it was potentially just hiding them.

Second, for some reason it seems to be the reading of the file that fails at least as often as the writing. I still find that a little weird, but it's what the tests show. But it means that what appear to be failed writes might be perfectly fine.

Thanks a lot for your prompt answer. I think I have the latest version of tmLQCD (version 5.2.0, downloaded last
Friday and updated with the last version of Carsten (by git checkout -b InterleavedNDTwistedClover carsten/InterleavedNDTwistedClover). I was actually starting configuration generation from one conf. given by Carsten. Trying from a fresh start solves the problem (well, at least partially, since in fact I wanted to take over a run already started by Carsten...). For the moment we can probably leave with the problem but if it appears systematically when reading on a BG/Q a file generated on another BG/Q, this might be more annoying.
I will contact David to see if I can make a test from a configuration generated in Cineca.

Contributor

deuzeman commented Apr 10, 2013

carsten/InterleavedNDTwistedClover

Ah, you're not running the master branch? Having had a look, that branch is not completely up-to-date, actually. The retry on the I/O has not been merged in there yet -- it's only very recently been added to the master. Are you using this branch specifically for the work Carsten did on it? In that case, either Carsten (or you, locally) will have to merge the latest version of etmc/master in again, or it should become part of etmc/master itself. What do you think is best at this moment, @urbach?

Trying from a fresh start solves the problem

Really? If I/O is successful at least part of the time, then having the retry available might solve most issues. We may have to tweak the delay a little, though. The five seconds it currently uses are rather arbitrary. But in this case, if the configuration that you now have systematically fails, perhaps it is in fact corrupt. We could calculate the checksum outside of tmLQCD and the BG/Q to make sure, but either way, maybe Carsten can provide you with an earlier one to resume the run from?

if it appears systematically when reading on a BG/Q a file generated on another BG/Q

If this is the case, then there is something more going on that intermittent hardware failure. That would be really quite serious, we'd have to fix that ASAP and you would be very rightly annoyed! Let's hope it's not. :)

I will contact David to see if I can make a test from a configuration generated in Cineca.

That would be rather helpful, actually.

I came back to the master branch of tmLQCD but it does not help. The "error -1 while reading gauge field from conf.0016" still persists. It is weird since lemon still manages to read xml headers; and running lime_contents on the configuration works (though I am not sure it is the guarantee that lime can properly read all the binary data....), so maybe reading the input configuration with lime and then switching to lemon would be a way to avoid the problem (although not very satisfactory). BTW, when not using lemon at all but only lime, hmc_tm writes one configuration and then gets stuck.
I let you know as soon as I have tried with a configuration generated on Fermi.

Contributor

deuzeman commented Apr 11, 2013

The "error -1 while reading gauge field from conf.0016" still persists.

If this is actually your starting configuration, the changes in the master branch wouldn't help. Those are intended to keep the run alive when it's saving a new configuration. Although I suppose we could introduce multiple read attempts for the initial reading in, too. But you've already manually done that, so it's probably not be the solution to this problem.

It is weird since lemon still manages to read xml headers

The error messages actually tell us that it managed to read all of the file without problem. But the code subsequently calculates the Scidac checksum on the data it just loaded into memory and compares it to the one stored in the XML file. Apparently those two don't match, so one has to assume that either the file has been corrupted, or the process of reading it in went wrong. If you still have the different error messages, could you check if the calculated checksum it reports is the same in all cases?

Also, could I somehow obtain this particular configuration?

and running lime_contents on the configuration works

Lime_contents just spits out the content, but doesn't perform any checks on the content. It does mean that the file is well-formed, from the LIME file format point of view. It doesn't mean that lime would be able to read the content correctly.

so maybe reading the input configuration with lime and then switching to lemon would be a way to avoid the problem

The slower reading speed when using c-lime might mean that the system has an easier time of reading the file. So if the error comes from reading failures due to the I/O system being overloaded, it might help. But if the reading failures are due to network interruption of power fluctuations or anything of the sort, the longer times spent in I/O might even make runs with c-lime more likely to fail. It seems we're having problems using both in Juelich, so it may be a matter of picking your poison... :(

BTW, when not using lemon at all but only lime, hmc_tm writes one configuration and then gets stuck.

I think Bartek saw something similar, but it turned out hmc_tm wasn't really stuck. It was just that the I/O was that slow.

Hi Albert,
The last month was a bit hectic and I could come to the I/O problem only yesterday; sorry for the delay.....

I did several tests with lemon/lime and two versions of tmlQCD: tmlQCD from last december and tmLQCD master version (from last month actually....I did not check if there was a newer version). It turns out that only tmLQCD_december with lime works. All other combinations tmlQCD/lime/lemon give errors. I always took the same starting configuration, i.e. one generated on JuQueen and provided by Carsten (conf.0016). I used two versions of lemon: a version from last month (which is I think the most recent, hereafter denoted lemon_1) and the previous version (lemon_0). In more details, I get:

  • with tmLQCD_december and lime: reading/writing OK, no problem
    the checksums computed and read from header are the same:

     #   Calculated            : A = 0x7ff6d7a6 B = 0xe2f48b53. 
     #   Read from LIME headers: A = 0x7ff6d7a6 B = 0xe2f48b53. 
    
  • with tmLQCD_december and lemon_0:

     # Constructing LEMON reader for file conf.0016 ...
     found header xlf-info, will now read the message
     found header ildg-format, will now read the message
     found header ildg-binary-data, will now read the message
     [LEMON] Node 0 reports in lemonReadLatticeParallelMapped:
     Could not read the required amount of data.
    

the last message appearing on all nodes.

  • with tmLQCD_december and lemon_1:

The input configuration is not read properly:

 # Trying to read gauge field from file conf.0016 in double precision.
 # Constructing LEMON reader for file conf.0016 ...
 found header xlf-info, will now read the message
 found header ildg-format, will now read the message
 found header ildg-binary-data, will now read the message
 # Time spent reading 382 Mb was 314 ms.
 # Reading speed: 1.22 Gb/s (19.0 Mb/s per MPI process).
 found header scidac-checksum, will now read the message
 # Scidac checksums for gaugefield conf.0016:
 For gauge file conf.0016, calculated and stored values for SciDAC checksum A do not match.
 For gauge file conf.0016, calculated and stored values for SciDAC checksum A do not match.
 For gauge file conf.0016, calculated and stored values for SciDAC checksum A do not match.
 For gauge file conf.0016, calculated and stored values for SciDAC checksum A do not match.
 …
 #   Calculated            : A = 0x749b8361 B = 0x83121925.
 Error -1 while reading gauge field from conf.0016
 Aborting...

The calculated checksum is indeed not the right one (should be 0x7ff6d7a6 B = 0xe2f48b53)

  • with tmLQCD_master and lime:

the input configuration is read correctly, the first produced conf is written in 12 s and then it stops, or is toooooo slooooowww, or I do not know what, but no configuration is written in the next 10 hours (then my job was killed)....:

 # Trajectory is accepted.
 # Writing gauge field to .conf.tmp.
 # Constructing LIME writer for file .conf.tmp for append = 0
  • with tmLQCD_master and lemon_1:

Same error than with the previous tmLQCD version: the input configuration is not read properly:

 # Trying to read gauge field from file conf.0016 in double precision.
 # Constructing LEMON reader for file conf.0016 ...
 found header xlf-info, will now read the message
 found header ildg-format, will now read the message
 found header ildg-binary-data, will now read the message
 # Time spent reading 382 Mb was 235 ms.
 # Reading speed: 1.63 Gb/s (25.4 Mb/s per MPI process).
 found header scidac-checksum, will now read the message
 # Scidac checksums for gaugefield conf.0016:
 For gauge file conf.0016, calculated and stored values for SciDAC checksum A do not match.
 For gauge file conf.0016, calculated and stored values for SciDAC checksum A do not match.
 For gauge file conf.0016, calculated and stored values for SciDAC checksum A do not match.
 #   Calculated            : A = 0x749b8361 B = 0x83121925.
 Error -1 while reading gauge field from conf.0016
 Aborting...

The wrong checksum is always the same.

The conclusion is that I cannot produce 2+2 configurations from a starting conf.

Last test: David sent me a configuration produced on Fermi and it cannot be read neither. So I think the problem
is not the input configuration.

Do you know how to solve these problems (or at least one: having tmLQCD master with lime only or lemon only working, would already be great)? I am annoyed now because I cannot be of any help in generating configurations.....

Cheers,

Mariane

Owner

kostrzewa commented Jun 4, 2013

Dear Mariane,

the input configuration is read correctly, the first produced conf is written in 12 s and then it stops, or is toooooo slooooowww, or I do not know what, but no configuration is written in the next 10 hours (then my job was killed)....

I can confirm this behaviour with LIME. On JuQueen at least, usually the reading and writing works for me but I've had cases where the process just locked up and sat there for hours.

I've had some other problems with LEMON sometimes reporting wrong checksums but this hasn't occured for quite a while now. I'll send the commit hash to the mailing list.

Owner

kostrzewa commented Apr 24, 2017

I guess we can close this.

@kostrzewa kostrzewa closed this Apr 24, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment