Permalink
Fetching contributors…
Cannot retrieve contributors at this time
192 lines (161 sloc) 7.77 KB
This is intended as one of a series of informal documents to describe
and partially document some of the more subtle DMTCP data structures
and algorithms. These documents are snapshots in time, and they
may become somewhat out-of-date over time (and hopefully also refreshed
to re-sync them with the code again). (Updated May, 2015)
The topics on debugging include:
A. PRINT DEBUGGING
B. STOPPING GDB BEFORE JASSERT CAUSES PROCESS EXIT
C. LAUNCHING A PROCESS UNDER DMTCP CONTROL USING GDB
D. DEBUGGING ON RESTART USING GDB ATTACH
E. TRICKS FOR DEBUGGING WHILE INSIDE GDB
===
A. PRINT DEBUGGING
There are several ways to debug. If you want "print debugging", with
a trace of the major operations, do:
./configure --enable-debug
make -j clean
make -j check-dmtcp1
and look in /tmp/dmtcp-USER@HOST for files of the form jassertlog.*,
which contain copies of calls to JTRACE/JWARN/JNOTE inside DMTCP,
with one file per process, and per exec into a new program.
If there are multiple processes, then a log from each process will
be in its own file in /tmp/dmtcp-USER@HOST/jassertlog*
===
B. STOPPING GDB BEFORE JASSERT CAUSES PROCESS EXIT
If debugging under GDB, and the process exits before you can examine
the cause, try:
(gdb) break FNC
[ for FNC among: exit, _exit, _Exit, mtcp_abort ;
The function _exit is used when Jassert() exits. ]
===
C. LAUNCHING A PROCESS UNDER DMTCP CONTROL USING GDB
JASSERT doesn't generate a core dump by default (because it calls _exit()).
Exporting the environment variable DMTCP_ABORT_ON_FAILED_ASSERT will cause
it to call abort(). This can be useful when debugging distributed processes.
Example:
DMTCP_ABORT_ON_FAILED_ASSERT=1 bin/dmtcp_launch a.out
gdb a.out core
When used with dmtcp_restart, it is important to embed
DMTCP_ABORT_ON_FAILED_ASSERT in the checkpoint image by first using it with
dmtcp_launch, as above. After that, a restarted process will also generate
a core dump.
===
If the bug occurs during launch or possibly in checkpoint-resume, try:
gdb --args dmtcp_launch -i5 test/dmtcp1
Note that GDB will first give you control under 'dmtcp_launch'.
(1) If you want to stop GDB when dmtcp1 enters its main() routine:
(gdb) break execvp
(gdb) run
(gdb) break main
(gdb) continue
(2) If you want to stop GDB when the DMTCP constructor takes control
in dmtcp1 (even before entering the main() routine of dmtcp1), do:
(gdb) break 'dmtcp::DmtcpWorker::DmtcpWorker()'
[ and say yes to: "breakpoint pending on future shared library load" ]
(gdb) run
===
D. DEBUGGING ON RESTART USING GDB ATTACH
Ideally, we would like to just use: gdb --args dmtcp_restart ckpt_*.dmtcp
This does not work (unless you just want to debug the command dmtcp_restart
itself). But the command dmtcp_restart overlays the old memory from
ckpt_*.dmtcp into the current running 'dmtcp_restart' process. As you would
expect, GDB has difficulty following this memory overlay. Therefore,
the best way to debug a restarted process is to use the 'attach' feature
of GDB.
If you want to debug DMTCP functions also, be sure to compile DMTCP with
debugging support. A convenient recipe from the root directory of DMTCP is:
./configure CXXFLAGS="-g3 -O0" CFLAGS="-g3 -O0"
make -j clean
make -j
There are four methods available. Most users will prefer the first,
or possibly the second method. The methods are in order of increasing
interest for developers of DMTCP. Make sure that you compiled DMTCP
with debugging support if you want to also see symbols for DMTCP functions.
1. DMTCP_GDB_ATTACH_ON_RESTART:
2. DMTCP_RESTART_PAUSE2:
3. DMTCP_RESTART_PAUSE:
4. USING GDB WITH MTCP (primarily for developers):
The four methods follow:
1. DMTCP_GDB_ATTACH_ON_RESTART:
In the past, using 'gdb attach' was as easy as:
dmtcp_restart ckpt_a.out_*.dmtcp &
gdb a.out `pgrep -n a.out`
Unfortunately, recent Linux distros have forbidden this type of
'gdb attach' because it can expose security holes. To get around
this on a per-process basis, DMTCP provides the special
environment variable DMTCP_GDB_ATTACH_ON_RESTART:
DMTCP_GDB_ATTACH_ON_RESTART=1 dmtcp_restart ckpt_a.out_*.dmtcp &
gdb a.out `pgrep -n a.out`
2. DMTCP_RESTART_PAUSE2:
If it's important to attach with GDB early in the restart process, then
DMTCP provides this and the following alternatives. First,
MTCP_RESTART_PAUSE2 pauses DMTCP immediately after all the threads
(including the checkpoint thread) have been restored:
DMTCP_RESTART_PAUSE2=1 bin/dmtcp_launch a.out
bin/dmtcp_command --checkpoint
dmtcp_restart ckpt_a.out_*.dmtcp &
# DMTCP will then pause for 15 seconds, and wait for a GDB attach.
3. DMTCP_RESTART_PAUSE:
Next, if you need to attach with GDB _very_ early in the restart process
(usually needed only by DMTCP developers), do:
DMTCP_RESTART_PAUSE=1 bin/dmtcp_launch a.out
# Continue as above.
In this case, you will attach at a time when only the primary thread
exists, and the DMTCP checkpoint thread is not yet distinguished from
the original user thread executing main().
Note that to attach early, you must set the environment variable during
dmtcp_launch. This is needed in order to embed the environment variable
in the memory that is then saved in ckpt_a.out_*.dmtcp. If you set the
environment variable only during restart, then the memory overlay
of dmtcp_restart will overwrite the environment variable that you set.
*** NOTE: The above restriction for DMTCP_RESTART_PAUSE / DMTCP_RESTART_PAUSE2
*** may be relaxed in the future, to allow setting before dmtcp_restart.
4. USING GDB WITH MTCP (primarily for developers):
For developers only, 'gdb attach' can be useful _very_, _very_ early ---
during the initial control of MTCP, the lowest layer of DMTCP.
Here are some instructions for this case.
For low-level debugging on restart, try:
cd src/mtcp
make tidy
// Optionally modify a file in the mtcp directory
make -f Makefile.debug
make gdb
// Follow instructions: e.g., paste 'add-symbol-file ...' msg into GDB
// If the bug is still further, then use the 'hang or crash' trick. Add:
{int x=1; while(x);}
// and look for whether it hangs (reaches this far) or crashes
// Move the while loop until you locate the bug.
// If GDB is fragile, a trick that may work is to run without GDB,
// and then attach at the 'while' loop. Alternatively, if in GDB:
(gdb) detach
gdb test/dmtcp1 PID # for the appropriate pid
// GDB should be more robust after this.
===
E. TRICKS FOR DEBUGGING WHILE INSIDE GDB
Sometimes, you will want GDB to stop exactly at a certain line of code.
If that line occurs early, execution may have already passed that line
before you can attach. One trick to stop GDB there is to add a line of code
(even into DMTCP itself):
{ int x = 1; while (x) {}; }
After that, use one of the above techniques for attaching with GDB, and:
(gdb) where
(gdb) print x = 0
(gdb) next
Most of DMTCP is written in C++. If you want to list or set a breakpoint
in a DMTCP function, it helps to search for full function signatures of
non-static functions by giving a substring:
(gdb) info functions initialize
(gdb) break 'dmtcp::Util::initializ<TAB>
Note that during dmtcp_launch, the first thread will be the
primary thread (the thread that executes the user's function main()).
The second thread created is the checkpoint thread. You'll recognize the
difference because the user's primary thread will have main() on the stack;
and the checkpoint thread will have the function checkpointhread() on the stack.
After dmtcp_restart, the primary thread will not necessarily
be the first of the threads displayed by GDB's 'info threads' command.
However, the threads can still be distinguished by examining the stack
for each thread:
primary thread: main()
checkpoint thread: checkpointhread()
other user thread: anything