Restart issues in 2.4.0rc2 #56

bbarker · 2015-04-13T13:36:41Z

I'm having issues with restarts passing on my CentOS 7.1 system; these issues don't seem to be present in 2.3.1; in both cases, I've used the default ./configure (no options), but I've included the config.log and config.status here.
Additionally the .cores and checkpoints are available.

Verifying there is enough disk space ...
== Tests ==
dmtcp1         ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.17746 copied to DMTCP_TMPDIR:/tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/) retry:
***** Copied checkpoint images to /tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/dmtcp-autotest-505920597
FAILED
               root-pids: [17773] msg: restart error, 1 expected, 0 found, running=0
dmtcp2         ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED
dmtcp3         ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.17833 copied to DMTCP_TMPDIR:/tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/) retry:PASSED; ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.17883 copied to DMTCP_TMPDIR:/tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/) retry:PASSED
dmtcp4         ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED
dmtcp5         ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.18128 copied to DMTCP_TMPDIR:/tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/) retry:
***** Copied checkpoint images to /tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/dmtcp-autotest-505920597
FAILED
               root-pids: [18164] msg: restart error, 2 expected, 1 found, running=0
syscall-tester ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED
file1          ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED
dmtcpaware1    ckpt:PASSED rstr:
FAILED (first process rec'd signal 11) (core.18229 copied to DMTCP_TMPDIR:/tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/) retry:y



^Cdmtcpaware1    FAILED
               root-pids: [] msg: failed to write 's' to coordinator (pid: 17740)
CLEANUP ERROR: failed to write 'k' to coordinator (pid: 17740)
SHUTDOWN() failed
make: *** [check] Error 1

The text was updated successfully, but these errors were encountered:

bbarker · 2015-04-13T14:49:29Z

I was a bit worried that the issue was due to my having a funny set of LDFLAGS, CPPFLAGS, etc. in my build environment; after getting rid of these and rebuilding the problem persisted, so finally I tested it on a clean CentOS 7 virtual machine (and previously, by accident, on a CentOS 6.6 or 6.5 system).

The verdict: the problem with 2.4.0rc2 persists in all cases so far on CentOS 7, but it was fine on CentOS 6.?.

gc00 · 2015-04-14T02:25:37Z

Hi Brandon,
Thanks very much for your report. Right now, we don't have direct access
to a CentOS 7 system. We've been testing on CentOS 6 and Red Hat 6 so far.
Is there a chance that you could provide a guest account?
In the meantime, some information that might help us is:

First, from the root directory of DMTCP, could you send us the output of:
make display-build-env
Second, could you build DMTCP as follows:
./configure CFLAGS="-g -O0" CXXFLAGS="-g -O0"
make -j
and then:
make tidy
ulimit -c unlimited
bin/dmtcp_launch -i6 test/dmtcp1
bin/dmtcp_restart ckpt_dmtcp1__.dmtcp
[ This should generate a core dump ]
gdb test/dmtcp1 core_
(gdb) apply thread all full
and send us the output from GDB.

Thanks very much,

Gene

On Mon, Apr 13, 2015 at 06:36:41AM -0700, Brandon Elam Barker wrote:

I'm having issues with restarts passing on my CentOS 7.1 system; these issues don't seem to be present in 2.3.1; in both cases, I've used the default ./configure (no options), but I've included the config.log and config.status here.
Additionally the .cores and checkpoints are available.

Verifying there is enough disk space ...
== Tests ==
dmtcp1         ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.17746 copied to DMTCP_TMPDIR:/tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/) retry:
***** Copied checkpoint images to /tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/dmtcp-autotest-505920597
FAILED
               root-pids: [17773] msg: restart error, 1 expected, 0 found, running=0
dmtcp2         ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED
dmtcp3         ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.17833 copied to DMTCP_TMPDIR:/tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/) retry:PASSED; ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.17883 copied to DMTCP_TMPDIR:/tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/) retry:PASSED
dmtcp4         ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED
dmtcp5         ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.18128 copied to DMTCP_TMPDIR:/tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/) retry:
***** Copied checkpoint images to /tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/dmtcp-autotest-505920597
FAILED
               root-pids: [18164] msg: restart error, 2 expected, 1 found, running=0
syscall-tester ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED
file1          ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED
dmtcpaware1    ckpt:PASSED rstr:
FAILED (first process rec'd signal 11) (core.18229 copied to DMTCP_TMPDIR:/tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/) retry:y



^Cdmtcpaware1    FAILED
               root-pids: [] msg: failed to write 's' to coordinator (pid: 17740)
CLEANUP ERROR: failed to write 'k' to coordinator (pid: 17740)
SHUTDOWN() failed
make: *** [check] Error 1

Reply to this email directly or view it on GitHub:
#56

gc00 · 2015-04-14T20:51:46Z

Hi all,
Just a quick status report on DMTCP with CentOS 7. So far what I'm seeing
is that restart is failing when we have multiple threads or processes (via fork).
So, it's failing for the dmtcp3 and dmtcp5 tests.
Interestingly, although dmtcp_restart fails, if I run it under gdb,
it starts up fine:
gdb --args dmtcp_restart ...
Typically this means that there's something about the memory layout,
and gdb forces a different memory layout (or initializes memory differently).
I looked at the memory layout when running under dmtcp3. Here's what I'm
seeing. So, we are seeing vsyscall on the last of the pages that
are mapped.
Kapil, I think you may have said something about that being significant.
Let me know if there are some other tests that you'd like me to run.
[ I'm editing out most of the memory map below. There's a simpler analysis below. - @gc00 ]

[dmtcp@euca-128-84-11-199 dmtcp]$ cat /proc/22243/maps
00400000-00401000 r-xp 00000000 fd:02 4797973 /home/dmtcp/dmtcp/test/dmtcp3
00600000-00601000 r--p 00000000 fd:02 4797973 /home/dmtcp/dmtcp/test/dmtcp3
00601000-00602000 rw-p 00001000 fd:02 4797973 /home/dmtcp/dmtcp/test/dmtcp3
016f2000-01713000 rw-p 00000000 00:00 0 [heap]
...
7fff3aefc000-7fff3b6df000 rw-p 00000000 00:00 0 [stack]
7fff3b7fe000-7fff3b800000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]

gc00 · 2015-04-14T22:47:31Z

And I can now report that when address randomization is turned off,
DMTCP works. I set:
sudo bash -c 'echo 0 > /proc/sys/kernel/randomize_va_space'
and that makes DMTCP work.

Probably the effect of GDB is to turn off the randomization.
Next, I have to make it work again, with address randomization.

Best,

Gene

On Mon, Apr 13, 2015 at 07:49:30AM -0700, Brandon Elam Barker wrote:

I was a bit worried that the issue was due to my having a funny set of LDFLAGS, CPPFLAGS, etc. in my build environment; after getting rid of these and rebuilding the problem persisted, so finally I tested it on a clean CentOS 7 virtual machine (and previously, by accident, on a CentOS 6.6 or 6.5 system).

The verdict: the problem with 2.4.0rc2 persists in all cases so far on CentOS 7, but it was fine on CentOS 6.?.

Reply to this email directly or view it on GitHub:
#56 (comment)

karya0 · 2015-04-15T04:09:39Z

@bbarker: Can you manually checkpoint/restart the test and provide us the output as @gc00 suggested above?

karya0 · 2015-04-15T04:10:24Z

@gc00: Now that you can reproduce the bug, can you provide the output of dmtcp_restart ckpt_dmtcp1*.dmtcp?

gc00 · 2015-04-15T08:40:16Z

Hi all, Since dmtcp-2.3.1 was running correctly, I ran 'git bisect' on this, in the CentOS 7 distro. I found the bug, and a simple one-line bug fix. For a one line fix, I won't do a pull request, but I'd like to analyze the implications of this fix here, and together we'll decide if that is enough.
The bug was introduced in: 1a7d8db
In that commit, mtcp_check_vdso() was turned off by default. ./configure CFLAGS=-DENABLE_VDSO_CHECK turns that function on again, and then the bug goes away.
@karya0: The #ifdef was your code. What is your preference? Shall we permanently remove the #ifdef ENABLE_VDSO_CHECK, and add a comment that CentOS7 requires mtcp_check_vdso()? Or do you want to leave the #ifdef in the code, and add a #define? I would suggest removing the #ifdef and adding a comment.

gc00 · 2015-04-15T10:49:24Z

P.S.: For ./configure CFLAGS=-DENABLE_VDSO_CHECK to work for you above, you will first have to do: git pull --rebase. Another one of those annoying bugs that crept in while we were preparing for the release (and this one was my fault).

gc00 · 2015-04-15T10:57:07Z

And there appears to be one remaining bug exposed in CentOS 7. The dlopen1 test is failing. I'll look into it.

gc00 · 2015-04-15T11:03:39Z

And I meant to write to do: git pull --rebase before ./configure CFLAGS=-DENABLE_VDSO_CHECK. Time to get some more rest before I become too incoherent.

karya0 · 2015-04-15T21:55:38Z

@gc00: The bug isn't quite related to ASLR. Apparently, if the variable type is "char_", JNOTE tries to automatically dereference the variable and prints it. However, when restoring the heap, curBrk was declared of type "char_" and points to the current (unmapped) heap. As obvious, JNOTE tried to dereference curBrk and segfaulted.

Flipping the ASLR hid the bug since the current and original value of the brk() was the same and hence no JNOTE was called and thus no dereference of invalid memory area.
:ta
I'll shortly push the fix to github.

I think I also understand the dlopen bug and have created a separate issue (#57) to track it.

gc00 · 2015-04-15T22:51:31Z

I'll shortly push the fix to github.
I think I also understand the dlopen bug and ....
Very efficient! Thanks.

During restart, we try to restore the heap as it existed at the time of ckpt. A JNOTE was trying to print the original and current values of brk. The variable holding the current brk (curBrk) was of type 'char*' and so JNOTE tried to print the corresponding string on the console. Since the current heap had been unmapped by mtcp_restart, accessing "curBrk" generated a segfault. The fix is to change the type of curBrk to "void*", thus forcing JNOTE to print the value of curBrk instead of dereferencing curBrk itself.

gc00 · 2015-04-16T21:22:23Z

@karya0 and @jiajuncao: The bug fix by @karya0 definitely fixes DMTCP on CentOS 7. However, @jiajuncao has also been seeing a random bug on restart at Stampede (with MVAPICH). He can take the same checkpoint image and restart many times. A bug appears about 20% of the time. From the core image, we see that memory is corrupted on restart (but only 20% of the time).
We then tested at Stampede/MVAPICH by including the function mtcp_check_vdso(). During about 15 or 20 tests, we did not observe the bug on restart.
I propose to remove the #ifdef ENABLE_VDSO_CHECK and to change the corresponding comment to: // If mtcp_check_vdso isn't called, CentOS 7 fails on dmtcp3, dmtcp5, others
Do you agree, @karya0 ? Thanks.

gc00 · 2015-04-16T22:12:34Z

@karya0: I've now created pull request #60 to fix this issue.

karya0 · 2015-04-17T05:09:01Z

Before we enable mtcp_check_vdso, let's verify the underlying cause of the bug. What memory addresses are causing segfault? I am sure there is a different fix that doesn't involve vdso. Note that vDSO and ASLR are two separate issues and part of the reason I didn't like mtcp_check_vdso is because it tries to handle both. vDSO handling should be done strictly by the newer code, and we should create a mtcp_check_aslr to handle ASLR. But before we do any of that, let's verify the problem and faulty addresses.

gc00 · 2015-04-17T09:04:39Z

@karya0: @jiajuncao and @rohgarg both have accounts at Stampede, and have both observed this bug there. They were using code that included the commit a4d67bf . They'll be the best ones to examine this newer version of the bug with you (the version that has only been observed on Stampede so far).

karya0 closed this as completed in 91a7fdb Apr 15, 2015

gc00 reopened this Apr 16, 2015

gc00 mentioned this issue Apr 16, 2015

Fix random restart bug; seen in MVAPICH @ Stampede #60

Closed

rohgarg closed this as completed Apr 17, 2015

rohgarg reopened this Apr 17, 2015

gc00 mentioned this issue Apr 17, 2015

Failing on --enable-m32 in CentOS 7 and CentOS 6.6 #61

Closed

gc00 modified the milestone: 2.4.0 release Apr 26, 2015

bbarker mentioned this issue May 8, 2015

tcsh test on a centos 7.1 system (dup of #104) #98

Open

karya0 modified the milestones: 2.4.0 release, 2.5.0 Jul 30, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restart issues in 2.4.0rc2 #56

Restart issues in 2.4.0rc2 #56

bbarker commented Apr 13, 2015

bbarker commented Apr 13, 2015

gc00 commented Apr 14, 2015

gc00 commented Apr 14, 2015

gc00 commented Apr 14, 2015

karya0 commented Apr 15, 2015

karya0 commented Apr 15, 2015

gc00 commented Apr 15, 2015

gc00 commented Apr 15, 2015

gc00 commented Apr 15, 2015

gc00 commented Apr 15, 2015

karya0 commented Apr 15, 2015

gc00 commented Apr 15, 2015

gc00 commented Apr 16, 2015

gc00 commented Apr 16, 2015

karya0 commented Apr 17, 2015

gc00 commented Apr 17, 2015

Restart issues in 2.4.0rc2 #56

Restart issues in 2.4.0rc2 #56

Comments

bbarker commented Apr 13, 2015

bbarker commented Apr 13, 2015

gc00 commented Apr 14, 2015

gc00 commented Apr 14, 2015

gc00 commented Apr 14, 2015

karya0 commented Apr 15, 2015

karya0 commented Apr 15, 2015

gc00 commented Apr 15, 2015

gc00 commented Apr 15, 2015

gc00 commented Apr 15, 2015

gc00 commented Apr 15, 2015

karya0 commented Apr 15, 2015

gc00 commented Apr 15, 2015

gc00 commented Apr 16, 2015

gc00 commented Apr 16, 2015

karya0 commented Apr 17, 2015

gc00 commented Apr 17, 2015