GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
Already on GitHub? Sign in to your account
I'm seeing weird crashes during LEMON writer construction, but so far only with the 16x4 parallelization. It does not occur for every single run...
# Writing gauge field to .conf.tmp.
# Constructing LEMON writer for file .conf.tmp for append = 0
Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(1478): MPI_Bcast(buf=0x1fbfffb064, count=1, MPI_INT, root=-1076188541, comm=0x84000007) failed
PMPI_Bcast(1440): Invalid root (value given was -1076188541)
2012-09-11 15:52:37.265 (WARN ) [0x400011a8b10] :81834:ibm.runjob.client.Job: terminated by signal 6
2012-09-11 15:52:37.265 (WARN ) [0x400011a8b10] :81834:ibm.runjob.client.Job: abnormal termination by signal 6 from rank 0
I haven't investigated at all what could be causing this but post this issue in case someone else comes across the problem.
Still a problem?
It hasn't happened since. Since this was at the very beginning of testing on BG/Q, it might have had to do with the incomplete state of IO at the time.
I've had a look at the code to see where this could originate, but it seems the MPI_Bcast is indeed happening somewhere in the I/O implementation. The root number is obviously nonsense, so this does look like a configuration issue. I would vote to change this from a bug to something like Won't fix...