-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart issues in 2.4.0rc2 #56
Comments
I was a bit worried that the issue was due to my having a funny set of LDFLAGS, CPPFLAGS, etc. in my build environment; after getting rid of these and rebuilding the problem persisted, so finally I tested it on a clean CentOS 7 virtual machine (and previously, by accident, on a CentOS 6.6 or 6.5 system). The verdict: the problem with 2.4.0rc2 persists in all cases so far on CentOS 7, but it was fine on CentOS 6.?. |
Hi Brandon, First, from the root directory of DMTCP, could you send us the output of: Thanks very much,
On Mon, Apr 13, 2015 at 06:36:41AM -0700, Brandon Elam Barker wrote:
|
Hi all, [dmtcp@euca-128-84-11-199 dmtcp]$ cat /proc/22243/maps |
And I can now report that when address randomization is turned off, Probably the effect of GDB is to turn off the randomization. Best,
On Mon, Apr 13, 2015 at 07:49:30AM -0700, Brandon Elam Barker wrote:
|
@gc00: Now that you can reproduce the bug, can you provide the output of |
Hi all, Since dmtcp-2.3.1 was running correctly, I ran 'git bisect' on this, in the CentOS 7 distro. I found the bug, and a simple one-line bug fix. For a one line fix, I won't do a pull request, but I'd like to analyze the implications of this fix here, and together we'll decide if that is enough. |
P.S.: For |
And there appears to be one remaining bug exposed in CentOS 7. The dlopen1 test is failing. I'll look into it. |
And I meant to write to do: |
@gc00: The bug isn't quite related to ASLR. Apparently, if the variable type is "char_", JNOTE tries to automatically dereference the variable and prints it. However, when restoring the heap, curBrk was declared of type "char_" and points to the current (unmapped) heap. As obvious, JNOTE tried to dereference curBrk and segfaulted. Flipping the ASLR hid the bug since the current and original value of the brk() was the same and hence no JNOTE was called and thus no dereference of invalid memory area. I think I also understand the dlopen bug and have created a separate issue (#57) to track it. |
|
During restart, we try to restore the heap as it existed at the time of ckpt. A JNOTE was trying to print the original and current values of brk. The variable holding the current brk (curBrk) was of type 'char*' and so JNOTE tried to print the corresponding string on the console. Since the current heap had been unmapped by mtcp_restart, accessing "curBrk" generated a segfault. The fix is to change the type of curBrk to "void*", thus forcing JNOTE to print the value of curBrk instead of dereferencing curBrk itself.
@karya0 and @jiajuncao: The bug fix by @karya0 definitely fixes DMTCP on CentOS 7. However, @jiajuncao has also been seeing a random bug on restart at Stampede (with MVAPICH). He can take the same checkpoint image and restart many times. A bug appears about 20% of the time. From the core image, we see that memory is corrupted on restart (but only 20% of the time). |
Before we enable |
@karya0: @jiajuncao and @rohgarg both have accounts at Stampede, and have both observed this bug there. They were using code that included the commit a4d67bf . They'll be the best ones to examine this newer version of the bug with you (the version that has only been observed on Stampede so far). |
I'm having issues with restarts passing on my CentOS 7.1 system; these issues don't seem to be present in 2.3.1; in both cases, I've used the default ./configure (no options), but I've included the config.log and config.status here.
Additionally the .cores and checkpoints are available.
The text was updated successfully, but these errors were encountered: