Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart issues in 2.4.0rc2 #56

Open
bbarker opened this issue Apr 13, 2015 · 16 comments
Open

Restart issues in 2.4.0rc2 #56

bbarker opened this issue Apr 13, 2015 · 16 comments
Milestone

Comments

@bbarker
Copy link

bbarker commented Apr 13, 2015

I'm having issues with restarts passing on my CentOS 7.1 system; these issues don't seem to be present in 2.3.1; in both cases, I've used the default ./configure (no options), but I've included the config.log and config.status here.
Additionally the .cores and checkpoints are available.

Verifying there is enough disk space ...
== Tests ==
dmtcp1         ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.17746 copied to DMTCP_TMPDIR:/tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/) retry:
***** Copied checkpoint images to /tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/dmtcp-autotest-505920597
FAILED
               root-pids: [17773] msg: restart error, 1 expected, 0 found, running=0
dmtcp2         ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED
dmtcp3         ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.17833 copied to DMTCP_TMPDIR:/tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/) retry:PASSED; ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.17883 copied to DMTCP_TMPDIR:/tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/) retry:PASSED
dmtcp4         ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED
dmtcp5         ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.18128 copied to DMTCP_TMPDIR:/tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/) retry:
***** Copied checkpoint images to /tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/dmtcp-autotest-505920597
FAILED
               root-pids: [18164] msg: restart error, 2 expected, 1 found, running=0
syscall-tester ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED
file1          ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED
dmtcpaware1    ckpt:PASSED rstr:
FAILED (first process rec'd signal 11) (core.18229 copied to DMTCP_TMPDIR:/tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/) retry:y



^Cdmtcpaware1    FAILED
               root-pids: [] msg: failed to write 's' to coordinator (pid: 17740)
CLEANUP ERROR: failed to write 'k' to coordinator (pid: 17740)
SHUTDOWN() failed
make: *** [check] Error 1
@bbarker
Copy link
Author

bbarker commented Apr 13, 2015

I was a bit worried that the issue was due to my having a funny set of LDFLAGS, CPPFLAGS, etc. in my build environment; after getting rid of these and rebuilding the problem persisted, so finally I tested it on a clean CentOS 7 virtual machine (and previously, by accident, on a CentOS 6.6 or 6.5 system).

The verdict: the problem with 2.4.0rc2 persists in all cases so far on CentOS 7, but it was fine on CentOS 6.?.

@gc00
Copy link
Contributor

gc00 commented Apr 14, 2015

Hi Brandon,
Thanks very much for your report. Right now, we don't have direct access
to a CentOS 7 system. We've been testing on CentOS 6 and Red Hat 6 so far.
Is there a chance that you could provide a guest account?
In the meantime, some information that might help us is:

First, from the root directory of DMTCP, could you send us the output of:
make display-build-env
Second, could you build DMTCP as follows:
./configure CFLAGS="-g -O0" CXXFLAGS="-g -O0"
make -j
and then:
make tidy
ulimit -c unlimited
bin/dmtcp_launch -i6 test/dmtcp1
bin/dmtcp_restart ckpt_dmtcp1__.dmtcp
[ This should generate a core dump ]
gdb test/dmtcp1 core_
(gdb) apply thread all full
and send us the output from GDB.

Thanks very much,

  • Gene

On Mon, Apr 13, 2015 at 06:36:41AM -0700, Brandon Elam Barker wrote:

I'm having issues with restarts passing on my CentOS 7.1 system; these issues don't seem to be present in 2.3.1; in both cases, I've used the default ./configure (no options), but I've included the config.log and config.status here.
Additionally the .cores and checkpoints are available.

Verifying there is enough disk space ...
== Tests ==
dmtcp1         ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.17746 copied to DMTCP_TMPDIR:/tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/) retry:
***** Copied checkpoint images to /tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/dmtcp-autotest-505920597
FAILED
               root-pids: [17773] msg: restart error, 1 expected, 0 found, running=0
dmtcp2         ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED
dmtcp3         ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.17833 copied to DMTCP_TMPDIR:/tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/) retry:PASSED; ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.17883 copied to DMTCP_TMPDIR:/tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/) retry:PASSED
dmtcp4         ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED
dmtcp5         ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.18128 copied to DMTCP_TMPDIR:/tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/) retry:
***** Copied checkpoint images to /tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/dmtcp-autotest-505920597
FAILED
               root-pids: [18164] msg: restart error, 2 expected, 1 found, running=0
syscall-tester ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED
file1          ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED
dmtcpaware1    ckpt:PASSED rstr:
FAILED (first process rec'd signal 11) (core.18229 copied to DMTCP_TMPDIR:/tmp/dmtcp-brandon@dhcp-rhodes-736.redrover.cornell.edu/) retry:y



^Cdmtcpaware1    FAILED
               root-pids: [] msg: failed to write 's' to coordinator (pid: 17740)
CLEANUP ERROR: failed to write 'k' to coordinator (pid: 17740)
SHUTDOWN() failed
make: *** [check] Error 1

Reply to this email directly or view it on GitHub:
#56

@gc00
Copy link
Contributor

gc00 commented Apr 14, 2015

Hi all,
Just a quick status report on DMTCP with CentOS 7. So far what I'm seeing
is that restart is failing when we have multiple threads or processes (via fork).
So, it's failing for the dmtcp3 and dmtcp5 tests.
Interestingly, although dmtcp_restart fails, if I run it under gdb,
it starts up fine:
gdb --args dmtcp_restart ...
Typically this means that there's something about the memory layout,
and gdb forces a different memory layout (or initializes memory differently).
I looked at the memory layout when running under dmtcp3. Here's what I'm
seeing. So, we are seeing vsyscall on the last of the pages that
are mapped.
Kapil, I think you may have said something about that being significant.
Let me know if there are some other tests that you'd like me to run.
[ I'm editing out most of the memory map below. There's a simpler analysis below. - @gc00 ]

[dmtcp@euca-128-84-11-199 dmtcp]$ cat /proc/22243/maps
00400000-00401000 r-xp 00000000 fd:02 4797973 /home/dmtcp/dmtcp/test/dmtcp3
00600000-00601000 r--p 00000000 fd:02 4797973 /home/dmtcp/dmtcp/test/dmtcp3
00601000-00602000 rw-p 00001000 fd:02 4797973 /home/dmtcp/dmtcp/test/dmtcp3
016f2000-01713000 rw-p 00000000 00:00 0 [heap]
...
7fff3aefc000-7fff3b6df000 rw-p 00000000 00:00 0 [stack]
7fff3b7fe000-7fff3b800000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]

@gc00
Copy link
Contributor

gc00 commented Apr 14, 2015

And I can now report that when address randomization is turned off,
DMTCP works. I set:
sudo bash -c 'echo 0 > /proc/sys/kernel/randomize_va_space'
and that makes DMTCP work.

Probably the effect of GDB is to turn off the randomization.
Next, I have to make it work again, with address randomization.

Best,

  • Gene

On Mon, Apr 13, 2015 at 07:49:30AM -0700, Brandon Elam Barker wrote:

I was a bit worried that the issue was due to my having a funny set of LDFLAGS, CPPFLAGS, etc. in my build environment; after getting rid of these and rebuilding the problem persisted, so finally I tested it on a clean CentOS 7 virtual machine (and previously, by accident, on a CentOS 6.6 or 6.5 system).

The verdict: the problem with 2.4.0rc2 persists in all cases so far on CentOS 7, but it was fine on CentOS 6.?.


Reply to this email directly or view it on GitHub:
#56 (comment)

@karya0
Copy link
Member

karya0 commented Apr 15, 2015

@bbarker: Can you manually checkpoint/restart the test and provide us the output as @gc00 suggested above?

@karya0
Copy link
Member

karya0 commented Apr 15, 2015

@gc00: Now that you can reproduce the bug, can you provide the output of dmtcp_restart ckpt_dmtcp1*.dmtcp?

@gc00
Copy link
Contributor

gc00 commented Apr 15, 2015

Hi all, Since dmtcp-2.3.1 was running correctly, I ran 'git bisect' on this, in the CentOS 7 distro. I found the bug, and a simple one-line bug fix. For a one line fix, I won't do a pull request, but I'd like to analyze the implications of this fix here, and together we'll decide if that is enough.
The bug was introduced in: 1a7d8db
In that commit, mtcp_check_vdso() was turned off by default. ./configure CFLAGS=-DENABLE_VDSO_CHECK turns that function on again, and then the bug goes away.
@karya0: The #ifdef was your code. What is your preference? Shall we permanently remove the #ifdef ENABLE_VDSO_CHECK, and add a comment that CentOS7 requires mtcp_check_vdso()? Or do you want to leave the #ifdef in the code, and add a #define? I would suggest removing the #ifdef and adding a comment.

@gc00
Copy link
Contributor

gc00 commented Apr 15, 2015

P.S.: For ./configure CFLAGS=-DENABLE_VDSO_CHECK to work for you above, you will first have to do: git pull --rebase. Another one of those annoying bugs that crept in while we were preparing for the release (and this one was my fault).

@gc00
Copy link
Contributor

gc00 commented Apr 15, 2015

And there appears to be one remaining bug exposed in CentOS 7. The dlopen1 test is failing. I'll look into it.

@gc00
Copy link
Contributor

gc00 commented Apr 15, 2015

And I meant to write to do: git pull --rebase before ./configure CFLAGS=-DENABLE_VDSO_CHECK. Time to get some more rest before I become too incoherent.

@karya0
Copy link
Member

karya0 commented Apr 15, 2015

@gc00: The bug isn't quite related to ASLR. Apparently, if the variable type is "char_", JNOTE tries to automatically dereference the variable and prints it. However, when restoring the heap, curBrk was declared of type "char_" and points to the current (unmapped) heap. As obvious, JNOTE tried to dereference curBrk and segfaulted.

Flipping the ASLR hid the bug since the current and original value of the brk() was the same and hence no JNOTE was called and thus no dereference of invalid memory area.
:ta
I'll shortly push the fix to github.

I think I also understand the dlopen bug and have created a separate issue (#57) to track it.

@karya0 karya0 closed this as completed in 91a7fdb Apr 15, 2015
@gc00
Copy link
Contributor

gc00 commented Apr 15, 2015

I'll shortly push the fix to github.
I think I also understand the dlopen bug and ....
Very efficient! Thanks.

karya0 added a commit that referenced this issue Apr 16, 2015
During restart, we try to restore the heap as it existed at the time of
ckpt.  A JNOTE was trying to print the original and current values of
brk. The variable holding the current brk (curBrk) was of type 'char*'
and so JNOTE tried to print the corresponding string on the console.
Since the current heap had been unmapped by mtcp_restart, accessing
"curBrk" generated a segfault. The fix is to change the type of
curBrk to "void*", thus forcing JNOTE to print the value of curBrk
instead of dereferencing curBrk itself.
@gc00
Copy link
Contributor

gc00 commented Apr 16, 2015

@karya0 and @jiajuncao: The bug fix by @karya0 definitely fixes DMTCP on CentOS 7. However, @jiajuncao has also been seeing a random bug on restart at Stampede (with MVAPICH). He can take the same checkpoint image and restart many times. A bug appears about 20% of the time. From the core image, we see that memory is corrupted on restart (but only 20% of the time).
We then tested at Stampede/MVAPICH by including the function mtcp_check_vdso(). During about 15 or 20 tests, we did not observe the bug on restart.
I propose to remove the #ifdef ENABLE_VDSO_CHECK and to change the corresponding comment to: // If mtcp_check_vdso isn't called, CentOS 7 fails on dmtcp3, dmtcp5, others
Do you agree, @karya0 ? Thanks.

@gc00
Copy link
Contributor

gc00 commented Apr 16, 2015

@karya0: I've now created pull request #60 to fix this issue.

@karya0
Copy link
Member

karya0 commented Apr 17, 2015

Before we enable mtcp_check_vdso, let's verify the underlying cause of the bug. What memory addresses are causing segfault? I am sure there is a different fix that doesn't involve vdso. Note that vDSO and ASLR are two separate issues and part of the reason I didn't like mtcp_check_vdso is because it tries to handle both. vDSO handling should be done strictly by the newer code, and we should create a mtcp_check_aslr to handle ASLR. But before we do any of that, let's verify the problem and faulty addresses.

@gc00
Copy link
Contributor

gc00 commented Apr 17, 2015

@karya0: @jiajuncao and @rohgarg both have accounts at Stampede, and have both observed this bug there. They were using code that included the commit a4d67bf . They'll be the best ones to examine this newer version of the bug with you (the version that has only been observed on Stampede so far).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants