Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

BG/Q HMC Lemon IO crash, 16x4 #149

Closed
kostrzewa opened this Issue · 3 comments

3 participants

@kostrzewa
Owner

I'm seeing weird crashes during LEMON writer construction, but so far only with the 16x4 parallelization. It does not occur for every single run...

# Writing gauge field to .conf.tmp.
# Constructing LEMON writer for file .conf.tmp for append = 0
Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(1478): MPI_Bcast(buf=0x1fbfffb064, count=1, MPI_INT, root=-1076188541, comm=0x84000007) failed
PMPI_Bcast(1440): Invalid root (value given was -1076188541)
2012-09-11 15:52:37.265 (WARN ) [0x400011a8b10] :81834:ibm.runjob.client.Job: terminated by signal 6
2012-09-11 15:52:37.265 (WARN ) [0x400011a8b10] :81834:ibm.runjob.client.Job: abnormal termination by signal 6 from rank 0

I haven't investigated at all what could be causing this but post this issue in case someone else comes across the problem.

@urbach
Owner

Still a problem?

@kostrzewa
Owner

It hasn't happened since. Since this was at the very beginning of testing on BG/Q, it might have had to do with the incomplete state of IO at the time.

@deuzeman
Owner

I've had a look at the code to see where this could originate, but it seems the MPI_Bcast is indeed happening somewhere in the I/O implementation. The root number is obviously nonsense, so this does look like a configuration issue. I would vote to change this from a bug to something like Won't fix...

@deuzeman deuzeman closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.