BG/Q HMC Lemon IO crash, 16x4 #149

kostrzewa opened this Issue Sep 11, 2012 · 3 comments


None yet
3 participants

kostrzewa commented Sep 11, 2012

I'm seeing weird crashes during LEMON writer construction, but so far only with the 16x4 parallelization. It does not occur for every single run...

# Writing gauge field to .conf.tmp.
# Constructing LEMON writer for file .conf.tmp for append = 0
Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(1478): MPI_Bcast(buf=0x1fbfffb064, count=1, MPI_INT, root=-1076188541, comm=0x84000007) failed
PMPI_Bcast(1440): Invalid root (value given was -1076188541)
2012-09-11 15:52:37.265 (WARN ) [0x400011a8b10] :81834:ibm.runjob.client.Job: terminated by signal 6
2012-09-11 15:52:37.265 (WARN ) [0x400011a8b10] :81834:ibm.runjob.client.Job: abnormal termination by signal 6 from rank 0

I haven't investigated at all what could be causing this but post this issue in case someone else comes across the problem.


urbach commented Feb 12, 2013

Still a problem?


kostrzewa commented Feb 13, 2013

It hasn't happened since. Since this was at the very beginning of testing on BG/Q, it might have had to do with the incomplete state of IO at the time.


deuzeman commented Feb 14, 2013

I've had a look at the code to see where this could originate, but it seems the MPI_Bcast is indeed happening somewhere in the I/O implementation. The root number is obviously nonsense, so this does look like a configuration issue. I would vote to change this from a bug to something like Won't fix...

@deuzeman deuzeman closed this Apr 9, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment