BG/Q HMC Lemon IO crash, 16x4 #149

Closed
kostrzewa opened this Issue Sep 11, 2012 · 3 comments

Comments

Projects
None yet
3 participants
Owner

kostrzewa commented Sep 11, 2012

I'm seeing weird crashes during LEMON writer construction, but so far only with the 16x4 parallelization. It does not occur for every single run...

# Writing gauge field to .conf.tmp.
# Constructing LEMON writer for file .conf.tmp for append = 0
Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(1478): MPI_Bcast(buf=0x1fbfffb064, count=1, MPI_INT, root=-1076188541, comm=0x84000007) failed
PMPI_Bcast(1440): Invalid root (value given was -1076188541)
2012-09-11 15:52:37.265 (WARN ) [0x400011a8b10] :81834:ibm.runjob.client.Job: terminated by signal 6
2012-09-11 15:52:37.265 (WARN ) [0x400011a8b10] :81834:ibm.runjob.client.Job: abnormal termination by signal 6 from rank 0

I haven't investigated at all what could be causing this but post this issue in case someone else comes across the problem.

Contributor

urbach commented Feb 12, 2013

Still a problem?

Owner

kostrzewa commented Feb 13, 2013

It hasn't happened since. Since this was at the very beginning of testing on BG/Q, it might have had to do with the incomplete state of IO at the time.

Contributor

deuzeman commented Feb 14, 2013

I've had a look at the code to see where this could originate, but it seems the MPI_Bcast is indeed happening somewhere in the I/O implementation. The root number is obviously nonsense, so this does look like a configuration issue. I would vote to change this from a bug to something like Won't fix...

@deuzeman deuzeman closed this Apr 9, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment