Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

JuQueen I/O test conclusions #243

Open
kostrzewa opened this Issue Mar 14, 2013 · 81 comments

Comments

Projects
None yet
4 participants
Owner

kostrzewa commented Mar 14, 2013

This issue will collect conclusions from the I/O test.

Owner

kostrzewa commented Mar 14, 2013

@urbach, @deuzeman

I ran the tests on the pra073 contingent using about 0.5 RD in total (probably a little less than that). My tests haven't fully completed but after reading and writing about a PB of data I haven't had a single failure with either LEMON or LIME. Note that I used the unmodified LEMON version which still had the potential integer overflow bug! (I did this on purpose to have a baseline)

The tests involve 9 test configurations each which are first read 5 times in a row. Then read again and written to disk, read back and compared.

I tested:

L48T96

512 ranks, hybrid, LEMON
32768 ranks, pure MPI, LEMON
512 ranks, hybrid, LIME
32768 ranks, pure MPI, LIME - aborted, this was just too slow but no problems in the 40 minutes that it ran for

L64T128

1024 ranks, hybrid, LEMON
65536 ranks, pure MPI, LEMON
1024 ranks, hybrid, LIME

I have to say that the GPFS is really impressive. It might use some sort of very large RAM cache because even with 3 jobs reading and writing concurrently I get speeds of several GB/s! This, however, might also be a bit of a limitation of this test. Maybe it would be worthwhile to introduce random waiting periods to make sure that the cache has been written to disk.

Owner

kostrzewa commented Mar 14, 2013

@urbach For the LIME run it takes a very long time (~180s or more for 48x96) to write configurations. Was the lock-up maybe just one that appeared to be a lock-up?

Contributor

urbach commented Mar 14, 2013

hmm, I don't know actually. I tried now with the new lemon version and so far it seems to be working.

Contributor

deuzeman commented Mar 14, 2013

That's sounding very promising! Any indication of the performance difference between the hybrid and pure MPI codes?

As an aside, I just got an email from David about their issues. It seems he sent it just to me, so I think it would be good to share this.

From his description:

The point is that we have got IO errors when using lemon and a 24^3x48 lattice
in Fermi (BGQ) with 512 mpi processes and 64 openmp threads.
I don't know if that is expected or not, but I thought the bug you were trying
to catch happens only when the local lattice is too large...
Indeed, that was what we had seen up to now: this type of error occured when
running a 48^3x96 lattice in the same (or double) partition.

And the associated error message:

WARNING, writeout of .conf.tmp returned no error, but verification discovered errors.
For gauge file .conf.tmp, calculated and stored values for SciDAC checksum A do not match.
Calculated : A = 0xbc7d4996 B = 0x00f2cc1a.
Read from LIME headers: A = 0xefb54a75 B = 0xcd6124fc.
Potential disk or MPI I/O error. Aborting...

So it seems the old "lemon" bug is back. Have either of you seen this?

Contributor

deuzeman commented Mar 14, 2013

Ah, and do we know if there was a firmware upgrade during the last maintenance cycle? Could it be that this still has to be done at Fermi? Wishful thinking here...

Owner

kostrzewa commented Mar 14, 2013

That's a very large partition to run a 24^3x48 on..

  • I guess I'll try a smaller volume too just to be on the safe side.
  • It would also be wise to repeat the test at CINECA.
  • I will also try the test on one rack but with the L48T96 volume.
  • Finally, these were all runs with 4D parallelization. Maybe the fact that they are using 3D is the culprit?

That's sounding very promising! Any indication of the performance difference between the hybrid and pure MPI codes?

The performance difference is roughly a factor of two with the hybrid code being faster. Hybrid LIME performance is actually not that terrible (about a factor of 4-6 I guess)

So it seems the old "lemon" bug is back. Have either of you seen this?

Hmm... one of the test configurations that I used (the hot start test of Dxx in gregorio's ARCH, configuration 1192) had a mismatch, but this was probably caused during the writing of that configuration.

Ah, and do we know if there was a firmware upgrade during the last maintenance cycle? Could it be that this still has to be done at Fermi? Wishful thinking here...

I'm sure good records are kept of what exactly changed during the various upgrades the machine has undergone so far.

Owner

kostrzewa commented Mar 14, 2013

Hybrid LIME performance is actually not that terrible (about a factor of 4-6 I guess)

acutally, no, the writing performance is absolutely abysmal! (factor 80 or so...)

Owner

kostrzewa commented Mar 14, 2013

An interesting measure, performance scales from midplane to rack:

midplane

Reading gauge field conf.1199 for reread test. Iteration 1, reread 4
# Constructing LEMON reader for file conf.1199 ...
# Time spent reading 6.12 Gb was 2.52 s.
# Reading speed: 2.42 Gb/s (4.73 Mb/s per MPI process).
# Scidac checksums for gaugefield conf.1199:
#   Calculated            : A = 0x87166557 B = 0xabe0e0e9.
#   Read from LIME headers: A = 0x87166557 B = 0xabe0e0e9.
# Reading ildg-format record:
#   Precision = 64 bits (double).
#   Lattice size: LX = 48, LY = 48, LZ = 48, LT = 96.
# Input parameters:
#   Precision = 64 bits (double).
#   Lattice size: LX = 48, LY = 48, LZ = 48, LT = 96.

# Writing gauge field to conf.1199.copy. Iteration 1, reread 4
# Constructing LEMON writer for file conf.1199.copy for append = 0
# Time spent writing 6.12 Gb was 1.70 s.
# Writing speed: 3.60 Gb/s (7.03 Mb/s per MPI process).
# Scidac checksums for gaugefield conf.1199.copy:
#   Calculated            : A = 0x87166557 B = 0xabe0e0e9.
# Write completed, verifying write...
# Constructing LEMON reader for file conf.1199.copy ...
# Time spent reading 6.12 Gb was 1.83 s.
# Reading speed: 3.34 Gb/s (6.52 Mb/s per MPI process).
# Scidac checksums for gaugefield conf.1199.copy:
#   Calculated            : A = 0x87166557 B = 0xabe0e0e9.
#   Read from LIME headers: A = 0x87166557 B = 0xabe0e0e9.
# Reading ildg-format record:
#   Precision = 64 bits (double).
#   Lattice size: LX = 48, LY = 48, LZ = 48, LT = 96.
# Input parameters:
#   Precision = 64 bits (double).
#   Lattice size: LX = 48, LY = 48, LZ = 48, LT = 96.
# Write successfully verified.

rack

Reading gauge field conf.1199 for reread test. Iteration 1, reread 4
# Constructing LEMON reader for file conf.1199 ...
# Time spent reading 19.3 Gb was 3.61 s.
# Reading speed: 5.35 Gb/s (5.22 Mb/s per MPI process).
# Scidac checksums for gaugefield conf.1199:
#   Calculated            : A = 0xce02a1a2 B = 0x96879c6f.
#   Read from LIME headers: A = 0xce02a1a2 B = 0x96879c6f.
# Reading ildg-format record:
#   Precision = 64 bits (double).
#   Lattice size: LX = 64, LY = 64, LZ = 64, LT = 128.
# Input parameters:
#   Precision = 64 bits (double).
#   Lattice size: LX = 64, LY = 64, LZ = 64, LT = 128.

# Writing gauge field to conf.1199.copy. Iteration 1, reread 4
# Constructing LEMON writer for file conf.1199.copy for append = 0
# Time spent writing 19.3 Gb was 2.88 s.
# Writing speed: 6.70 Gb/s (6.55 Mb/s per MPI process).
# Scidac checksums for gaugefield conf.1199.copy:
#   Calculated            : A = 0xce02a1a2 B = 0x96879c6f.
# Write completed, verifying write...
# Constructing LEMON reader for file conf.1199.copy ...
# Time spent reading 19.3 Gb was 1.33 s.
# Reading speed: 14.5 Gb/s (14.2 Mb/s per MPI process).
# Scidac checksums for gaugefield conf.1199.copy:
#   Calculated            : A = 0xce02a1a2 B = 0x96879c6f.
#   Read from LIME headers: A = 0xce02a1a2 B = 0x96879c6f.
# Reading ildg-format record:
#   Precision = 64 bits (double).
#   Lattice size: LX = 64, LY = 64, LZ = 64, LT = 128.
# Input parameters:
#   Precision = 64 bits (double).
#   Lattice size: LX = 64, LY = 64, LZ = 64, LT = 128.
# Write successfully verified.
Owner

kostrzewa commented Mar 14, 2013

Hmm.. I think I really need to add a delay there or something because the reread is so much faster... I think there's a strong possibilty we're reading from some cache... What's on disk might be completely borked.

Owner

kostrzewa commented Mar 14, 2013

Any indication of the performance difference between the hybrid and pure MPI codes?

The performance difference is roughly a factor of two with the hybrid code being faster.

In terms of total runtime it's more like a factor of 4-5 though!

hybrid: 00:38:20
pure MPI: 02:37:39

Contributor

urbach commented Mar 15, 2013

I'm now running successfully with lemon since yesterday. IO checks are enabled and no problem occurred so far. The time for a write of a 48^3x96 latice is 2.5 seconds (the lime write, when it worked, took 195 seconds) (all on a 512 node partition)

I'm not sure on which firmware/software version FERMI runs. JUQUEEN is on V1R2M0. Judging from the uptime and kernel version of the login node on FERMI, I'd guess that FERMI is on a stoneage version...

Contributor

deuzeman commented Mar 15, 2013

To make sure: did we test what happened when reading a Lemon written ensemble with lime and vice versa? Or at least when reading in an older ensemble with the new version?

The reason I'm asking, is that I had to do some manual manipulation of the block sizes. I'm a little worried that the I/O processes might be reversible (since writing and reading use the same data layout), but that the data layout within the file is permuted with respect to the ILDG definition. It would be rather bad to discover this a few months from now...

Owner

kostrzewa commented Mar 15, 2013

To make sure: did we test what happened when reading a Lemon written ensemble with lime and vice versa? Or at least when reading in an older ensemble with the new version?

The ensembles here were written with LEMON and I read them with both LIME and LEMON. I didn't test the converse.

As for the new version, I will test this later today, thanks for the heads up.

Contributor

deuzeman commented Mar 15, 2013

The ensembles here were written with LEMON and I read them with both LIME and LEMON. I didn't test the converse.

Thanks! I think that already covers the problem, actually. But it can't hurt to be thorough :).

Owner

kostrzewa commented Mar 15, 2013

Hmm. I just managed to crash a midplane on reading with LEMON... and another one just now.. it didn't even manage to begin reading..

I did rewrite this on top of the smearing branch now and I'm using the buffers framework.

Contributor

deuzeman commented Mar 15, 2013

Really? That's worrying and weird. The only point of interference would be the definition of g_gf, rather than the previous underlying buffer. But I fail to see how that could cause the problem. Any error messages?

Owner

kostrzewa commented Mar 15, 2013

No, it just locked up in two different places... I'll investigate some more. The scheduler was subsequently unable to kill the job and I guess the midplane was rebooted.

Owner

kostrzewa commented Mar 15, 2013

Oh, looky here in the MOTD:

*******************************************************************************
************
* Friday 15.3.13 13:36 GPFS read(!) access to /work hangs, write access is o.k.

*                - BG/Q jobs may abort due to IO errors
*                - BG/Q jobs may get stuck in REMOVE PRNDING, when being cancel
led
*                - front-end process hang when reading files
*  The situation is expected to be solved by Saturday morning 16.3.13.
*******************************************************************************
************

Not the fault of the smearing codebase then! Maybe my pummelling of the I/O subsystem yesterday was problematic after all.

Contributor

deuzeman commented Mar 15, 2013

Ahuh... Doesn't seem to make a huge amount of sense, but I guess we shouldn't be too worried then?

Owner

kostrzewa commented Mar 15, 2013

Ahuh... Doesn't seem to make a huge amount of sense, but I guess we shouldn't be too worried then?

Well, sure. This means though that unless they fix it we can't really run without risking wasting computing time...

Contributor

deuzeman commented Mar 15, 2013

True, but it should only be half a day then.

Owner

kostrzewa commented Mar 15, 2013

By the way, this is what I've come up with as a filesystem test for now:

https://github.com/kostrzewa/tmLQCD/blob/IO_test/test_io.c

Contributor

deuzeman commented Mar 15, 2013

Neat! I've already pointed David at this discussion and I think this would be a good tool to diagnose the issues at Fermi. It's a very handy piece of code to have lying around going forward, too.

palao commented Mar 15, 2013

On Viernes marzo 15 2013 06:53:08 Albert Deuzeman escribió:

Neat! I've already pointed David at this discussion and I think this would be
a good tool to diagnose the issues at Fermi. It's a very handy piece of code
to have lying around going forward, too.

Reply to this email directly or view it on GitHub.

Hi,
Thanks for the code! I'll run it on Fermi and let you know.
Best,

David

Owner

kostrzewa commented Mar 17, 2013

Dear David,

thanks for running the test on FERMI. Please note that I've just updated the code and added a little bit of documentation (README.test_io) which should get you up and running. Just fetch my "IO_test" branch from github

Cheers!

Contributor

urbach commented Mar 17, 2013

for the records: on supermuc lemon appears to work and is 8 times or so faster than lime on a 512 node partition.

Owner

kostrzewa commented Mar 18, 2013

True, but it should only be half a day then.

still broken... :) They actually even disabled logins now...

Owner

kostrzewa commented Mar 20, 2013

So I ran the test for very small local volumes (24^3x48) on a whole rack and very large local volumes (96^3x192) on one midplane and there are no failures on JuQueen. I think we can trust LEMON to do the right thing! I will update my LEMON branch now and run the standard test again just to make sure that there is no problem due to the update that was done.

Contributor

deuzeman commented Mar 20, 2013

Excellent! I'll wait for that final test, then we can officially push the new version of the library.

Contributor

urbach commented Mar 21, 2013

sorry folks, bad news. For me the following occurred on 1024 nodes on juqueen with @deuzeman lemon branch:

in conf.0132 the following checksum is stored:

<scidacChecksum>
  <version>1.0</version>
  <suma>8d8fd1f6</suma>
  <sumb>59dc0fbb</sumb>
</scidacChecksum>

in the log file I find the following:

# Scidac checksums for gaugefield .conf.tmp:
#   Calculated            : A = 0x8d8fd1f6 B = 0x59dc0fbb.
# Write completed, verifying write...
# Constructing LEMON reader for file .conf.tmp ...
found header xlf-info, will now read the message
found header ildg-format, will now read the message
found header ildg-binary-data, will now read the message
# Time spent reading 6.12 Gb was 465 ms.
# Reading speed: 13.2 Gb/s (12.8 Mb/s per MPI process).
found header scidac-checksum, will now read the message
# Scidac checksums for gaugefield .conf.tmp:
#   Calculated            : A = 0x8d8fd1f6 B = 0x59dc0fbb.
#   Read from LIME headers: A = 0x8d8fd1f6 B = 0x59dc0fbb.
# Reading ildg-format record:
#   Precision = 64 bits (double).
#   Lattice size: LX = 48, LY = 48, LZ = 48, LT = 96.
# Input parameters:
#   Precision = 64 bits (double).
#   Lattice size: LX = 48, LY = 48, LZ = 48, LT = 96.  
# Write successfully verified.
# Renaming .conf.tmp to conf.0132.

so far, so good. Now the newly started job:

#   Calculated            : A = 0x67f40d51 B = 0x34000ed4.
#   Read from LIME headers: A = 0x8d8fd1f6 B = 0x59dc0fbb.

For me this seems to be a problem of the filesystem?! But, and that is actually really bad, it means lemon written configurations would not be safe?!

Owner

kostrzewa commented Mar 21, 2013

There was a core-dump yesterday around the time you started the new job, maybe that's the culprit?

Contributor

urbach commented Mar 21, 2013

addendum: calculating the checksum on a cluster here in Bonn on the configuration gives:

# Scidac checksums for gaugefield conf.0132:
#   Calculated            : A = 0x8d8fd1f6 B = 0x59dc0fbb.
#   Read from LIME headers: A = 0x8d8fd1f6 B = 0x59dc0fbb.
# Reading ildg-format record:

which is correct. So, file seems not currupted, which is good news!

Contributor

urbach commented Mar 21, 2013

core dump is probably from the abortion of the job due to time limit...

Owner

kostrzewa commented Mar 21, 2013

core dump is probably from the abortion of the job due to wrong checksum, I guess...

Hmm, yes indeed that would make sense. Good news regarding the correctness though!

Contributor

urbach commented Mar 21, 2013

sorry, I was too fast, the previous job run into the time limit, and core dumped...

Contributor

deuzeman commented Mar 21, 2013

sorry, I was too fast, the previous job run into the time limit, and core dumped...

I was just wondering about that -- the code should exit gracefully when it finds an I/O error.

Anyway, this would be the first time I've heard of a reading malfunction... But then another piece of good news, is that the current error detection mechanism does in fact pick up on something like this. Perhaps we want to attempt some type of recovery mechanism, though? Like sending the code to sleep for a short time, then attempt to read the file again? If there is an intermittent hardware (?) failure, it gives the impression that the code is utterly broken somehow.

Contributor

urbach commented Mar 21, 2013

Or a stupid bug in the checksum computation?

Contributor

deuzeman commented Mar 21, 2013

Or a stupid bug in the checksum computation?

Potentially, I guess. But why would it show up only now?

Owner

kostrzewa commented Mar 21, 2013

Okay, so my test_io also reports a wrong checksum for this gauge configuration:

# For gauge file test_conf.0000, calculated and stored values for SciDAC checksum A do not match.
#   Calculated            : A = 0xd81214a7 B = 0x69b5ea37.
# For gauge file test_conf.0000, calculated and stored values for SciDAC checksum A do not match.
#   Read from LIME headers: A = 0x8d8fd1f6 B = 0x59dc0fbb.
Contributor

urbach commented Mar 21, 2013

what?? How is that possible? does your test programme also compute the plaquette value?

Owner

kostrzewa commented Mar 21, 2013

And it reports the same wrong checksum for each read of this configuration. This might be a problem with the checksum computation after all?

what?? How is that possible? does your test programme also compute the plaquette value?

It's on the to-do list to add that as an additional error situation. Shouldn't take more than half an hour to add and verify I guess. (going for lunch soon though so it will be around 13:30....)

Contributor

urbach commented Mar 21, 2013

but your checksum is different from the one computed in my run. I guess the parallelisation is different, isn't it? And if you take any of the other gauges available for that run, does the read fail as well? What does lime say?

Owner

kostrzewa commented Mar 21, 2013

It just read it correctly...

Reading gauge field test_conf.0000. Iteration 0
# Constructing LEMON reader for file test_conf.0000 ...
# Time spent reading 6.12 Gb was 1.89 s.
# Reading speed: 3.23 Gb/s (6.31 Mb/s per MPI process).
# Scidac checksums for gaugefield test_conf.0000:
#   Calculated            : A = 0x8d8fd1f6 B = 0x59dc0fbb.
#   Read from LIME headers: A = 0x8d8fd1f6 B = 0x59dc0fbb.
# Reading ildg-format record:
#   Precision = 64 bits (double).
#   Lattice size: LX = 48, LY = 48, LZ = 48, LT = 96.
# Input parameters:
#   Precision = 64 bits (double).
#   Lattice size: LX = 48, LY = 48, LZ = 48, LT = 96.
Contributor

urbach commented Mar 21, 2013

okay, so from the plaquette value we could learn whether its the reading itself that is buggy, or the checksum computation.

Owner

kostrzewa commented Mar 21, 2013

but your checksum is different from the one computed in my run.

I know, that's really perplexing

I guess the parallelisation is different, isn't it?

midplane, EABCDT mapping

And if you take any of the other gauges available for that run, does the read fail as well?

In the current test_io run it read all four gauges correctly and wrote them back correctly. It's still in the first iteration of the test and it will reread them again in a minute or so.

What does lime say?

I've just recompiled the LIME test but I haven't run it yet.

Owner

kostrzewa commented Mar 21, 2013

okay, so from the plaquette value we could learn whether its the reading itself that is buggy, or the checksum computation.

Yes, I'll have it ready after lunch, but we'll have to hope for a reading failure. I'll limit the number of reread tests to 1 so that it does as many reads as possible.

Owner

kostrzewa commented Mar 21, 2013

Is there a convenience function to extract the xlf-info at runtime to get at the stored plaquette value?

Owner

kostrzewa commented Mar 21, 2013

The LIME test seems to read them just fine, but I don't think this means anything given the way this problem seems to show up. Working on the plaquette stuff now...

Owner

kostrzewa commented Mar 21, 2013

Ah, I just finished the new version of the test code and now JuQueen has gone down again... fun!

Contributor

urbach commented Mar 21, 2013

oh yea, emergency boot... Sounds scary...!

Contributor

urbach commented Mar 21, 2013

juqueen back online. Lets hope it was a gpfs problem after all...!?

Owner

kostrzewa commented Mar 21, 2013

A hardware failure is seeming more and more likely... the scheduler seems to have crashed just now or something. Oh, actually no, it's back... weird!

Owner

kostrzewa commented Mar 21, 2013

Okay I'm having some weird problems which I can't really investigate further today. I'll follow up tomorrow.

Owner

kostrzewa commented Mar 22, 2013

Hmm.. so the problem I'm seeing is that upon reading the gauge configuration, I get the wrong plaquette value (as saved in the xlfinfo header) but the correct checksum... I have a feeling this has to do with basing the test on the smearing branch and some problems therewith on BG/Q. I will test this by reading the same configuration with a different program and computing the plaquette..

Contributor

deuzeman commented Mar 22, 2013

There were some changes to the plaquette calculation routine in the smearing branch to allow for the use of gauge_field_t, but it's a little surprising that this would fail on BG/Q. I'll need to have a good look what's causing this, because similar issues might crop up in other parts of the code that are less easily checked.

Owner

kostrzewa commented Mar 22, 2013

Well, I don't know whether that's really the reason because it works fine on Intel. I generate configurations with a "non-smearing" hmc_tm and read them with the smearing-based "test_io" just fine.

Contributor

deuzeman commented Mar 22, 2013

Yes, this would be the first time that I would see issues with the routine, so it would have to be a very specific BG/Q or parallelization related issue.

It's still strange to see the checksum match and the plaquette be wrong, though. My first suspicion would still be some issue with the serialization. Then again, since the checksum and plaquette are calculated from the same block of memory, why would there be a difference if the content of that memory is identical? It would have to be some race condition in the plaquette calculation (thread safety an issue, perhaps?) or some problem that the checksum happens to not pick up on. There wouldn't be anything glaring like spatial permutations that its not sensitive to, would there?

Owner

kostrzewa commented Mar 22, 2013

Oh, I know what's going on... in a specific version Carsten and I pushed to remove g_update_gauge_energy and g_update_rectangle_energy and I was working under the false impression that this had already made its way into the code-base. I'm surprised I don't get the correct value for the first read though.. I will change this and try again!

Owner

kostrzewa commented Mar 22, 2013

I was therefore also saving the wrong plaquette inthe xlfinfo string in many instances!

Owner

kostrzewa commented Mar 22, 2013

Hmm.. actually, I guess for every remap this was set correctly, so this can't be the bug... darn. I'm trying anyway to see what happens.

Contributor

deuzeman commented Mar 22, 2013

Ah, that makes sense. But this does represent an incompatibility between the master branch and the current smearing implementation, right? Nothing too serious, but I'd need to go in and fix that.

Owner

kostrzewa commented Mar 22, 2013

Ah, that makes sense. But this does represent an incompatibility between the master branch and the current smearing implementation, right? Nothing too serious, but I'd need to go in and fix that.

No, it's not in the master branch. The change is to be pulled in with the EM splitting of the gauge action for nucleon gluon momentum calculations once that has been tested. (urbach/GluonMoment)

Owner

kostrzewa commented Mar 22, 2013

Hmm.. actually, I guess for every remap this was set correctly, so this can't be the bug... darn. I'm trying anyway to see what happens.

I left out some gauge exchange, I think that's the problem and would also explain why it works without MPI and why the checksum is correct.

Owner

kostrzewa commented Mar 22, 2013

Hmm.. actually, I guess for every remap this was set correctly, so this can't be the bug... darn. I'm trying anyway to see what happens.

I left out some gauge exchange, I think that's the problem and would also explain why it works without MPI and why the checksum is correct.

Yeah, that was the reason, the plaquette is computed correctly now. I should have guessed that since the plaquette was computed, but it was consistently about 15% too low...

Contributor

urbach commented Mar 22, 2013

so, is the checksum and plaquette correct now?

Owner

kostrzewa commented Mar 22, 2013

so, is the checksum and plaquette correct now?

Well, the test certainly works now and in the job that I ran before lunch no failures were reported. However, as we've seen above, there can be two consecutive runs in which the same configuration is read incorrectly and then, subsequently, correctly. We need to wait for a failure to show up in the test that is queued right now to confirm (or not) that the plaquette is correct (or incorrect) even though the checksum is wrong, so that we can say for instance, that the checksum computation is to blame.

What should we do? Continue the simulation and confirm externally that the configurations are correct or should we do some more testing? There was a LoadLeveler crash this morning so I don't know how well the problems are under control now. Also, nodes keep failing and reappearing...

In addition I would be in favour of adding fail safety to the code slowly now (as we all know this will become more and more important as machines grow). Albert's suggestion of waiting a few seconds and then trying again is certainly a good first step. The same goes for writing: if a write verification fails, wait, verify again, if it fails again, write again, verify again and also compare against the old checksum. If the new and old checksum differ, something serious must have happened and the job should abort.

Contributor

urbach commented Mar 22, 2013

I would continue the simulation now and see how it goes...!

Owner

kostrzewa commented Mar 23, 2013

Okay, so I've prepared everything to continue the simulation, I'll start it now.

Contributor

urbach commented Mar 23, 2013

good! Would you share the directory such that I can also have a look from time to time?

Owner

kostrzewa commented Mar 23, 2013

My directories should be accessible to pra073, $WORK/runs/nf2/iwa.../, the jobscript and input file are in the corresponding directory in my $HOME

Owner

kostrzewa commented Mar 23, 2013

Hmm.. job abort... I must have gotten the directories wrong. Oh, user name pra07309

Owner

kostrzewa commented Mar 23, 2013

okay, running now, silly me

Owner

kostrzewa commented Mar 25, 2013

Just had another lemon failure... I'll try to see if a second reading will be correct. If so, I think we really need the failsafe features... otherwise we'll have to deal with many unnecessary job aborts. Also, the LEMON code produces far too much output in this case... I know that potentially certain processes could crash and output therefore not be displayed... but it would be nice to not have to dig through 1024 lines of output...

Owner

kostrzewa commented Mar 25, 2013

Hmm okay, so this was fortuitous because the trajectory was not accepted, so the previous configuration was the same. Comparing the two, they have the same checksum written in to the header. I'll check whether I can now successfully continue.

Contributor

deuzeman commented Mar 25, 2013

otherwise we'll have to deal with many unnecessary job aborts

I completely agree. The original idea behind this draconian measure, was that we'd want to fix whatever was causing problems if they ever occurred. But it seems there are many occasions where we are just dealing with temporary hardware failure and there isn't really much one can do except for waiting a little and trying again. Just aborting the whole job is just counterproductive here.

While running tmLQCD on our local cluster here in Bern, both with c-lime and LEMON, I actually get quite a few I/O failures. It probably has to do with the file system not quite being up to the task, though interestingly LEMON is actually more stable than c-lime. But still, if I don't turn I/O checks off, the code aborts after about 20 or 30 trajectories on average. That just makes things unusable, especially if all I can really do is restart the job.

My proposal is that we actually replace the current check with a simple loop with counter. If a write/read cycle fails, we simply increase the counter and put the code to sleep for some period of time -- say five or ten seconds. That way, there is time for any I/O spikes to clear up. Only when this fails, say, five times in a row, we abort like we do currently -- in that case, apparently the I/O issues are truly severe.

Contributor

deuzeman commented Mar 25, 2013

Comparing the two, they have the same checksum written in to the header.

I'm actually quite surprised that it seems to be more or less consistently the reading that is breaking down. I'd expect the issues to occur for writing, especially since the speed differential between the two is actually rather small.

Owner

kostrzewa commented Mar 30, 2013

Grr I'm getting annoyed now. Yesterday evening a job failed with a reading error. The previous configuration was verified successfully (conf.0228 for those with access to the prace account on juqueen) but attempting to run the job from this configuration onwards now seems to fail consistently.. I've tried twice and will try again...

Owner

kostrzewa commented Mar 30, 2013

So to continue the job I've backtracked two trajectories and have continued from conf.0227 without problems. I've backed up the faulty conf.0228 so we can check it later on.

Contributor

deuzeman commented Apr 3, 2013

During the last phone conference, I promised to send around a mail on the status of Lemon. While it's clear we're still having some I/O issues, I have the impression that none of them are related to the recent changes in Lemon. Would everybody agree that I declare it stable?

Owner

kostrzewa commented Apr 3, 2013

I agree, I have been using Carsten's build of tmLQCD and this AFAIK uses the LEMON version with your integer fix. The problems on BG/Q are problems of BG/Q. It is clear, however, that we need to add some more fault-tolerance machinery to all parts of the program suite.

This was referenced Apr 8, 2013

Owner

kostrzewa commented May 6, 2013

@urbach

sorry folks, bad news. For me the following occurred on 1024 nodes on juqueen with @deuzeman lemon branch:

incidentally, are you sure you compiled the newest version of Lemon (on your prace account) ? I was seeing FS trouble in Zeuthen with the newest version for anything other than 1D parallelization. I then tested on BG/Q and it didn't work at all (i.e.: never). Albert and I concluded that the new LEMON version had not been tested at all as a consequence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment