Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on restart because of a failure to lseek on a ckpt-ed file #381

Closed
rohgarg opened this issue May 4, 2016 · 14 comments
Closed

Error on restart because of a failure to lseek on a ckpt-ed file #381

rohgarg opened this issue May 4, 2016 · 14 comments
Milestone

Comments

@rohgarg
Copy link
Contributor

rohgarg commented May 4, 2016

@ankitcse07 reports that he's seeing the following error on restart when using the --ckpt-open-files option:

[117000] ERROR at fileconnection.cpp:913 in areFilesEqual; REASON='JASSERT(_real_lseek(fd, 0, SEEK_SET) == 0) failed'
fd = 21
(strerror((*__errno_location ()))) = Illegal seek

The error message suggests that the fd we are seeking on is not a regular file. On further investigation, he noted that _fds[0], equal to 21, on restart, is mapped to some socket according to proc-fd. However, the _path field points to a regular file, /vel/AGARG/demos/dmtcp_demos/c_testbench/dmtcp_veloce_examples/hello_world_dmtcp_2apis_cm_master/veloce.log/veloce_runtime.log/TRACE_RTS.txt.

I'm creating this issue to track this bug.

@rohgarg
Copy link
Contributor Author

rohgarg commented May 4, 2016

Here's a theory that might explain what's going on: At checkpoint time, the fd 21 was mapped to a file. On restart, the file plugin remembers this information and restores the fd in FileConnection::postRestart(). However, @ankitcse07's plugin (or perhaps some other DMTCP plugin) maps the fd 21 back to a socket at a later stage, before the next barrier. As a result, the call to lseek fails in FileConnection::refill().

I think we need to figure out if and when the fd's mapping changed from a file to a socket.

@ankitcse07
Copy link
Collaborator

@rohgarg your theory is true, 21 is first assigned to file TRACE_RTS and then again TcpConnection::PostRestart also has same fd (21) which is assigned back. Can you direct me how to debug this further

Breakpoint 3, dmtcp::TcpConnection::postRestart (this=0x7fc22ffad808) at ipc/socket/socketconnection.cpp:489
489 in ipc/socket/socketconnection.cpp
(gdb) p _fds[0]
$21 = (int &) @0x7fc22ffb3400: 21
(gdb) bt
#0 dmtcp::TcpConnection::postRestart (this=0x7fc22ffad808) at ipc/socket/socketconnection.cpp:489
#1 0x00007fc22f5228c4 in dmtcp::ConnectionList::postRestart (this=0x7fc22ffa7308) at ipc/connectionlist.cpp:469
#2 0x00007fc22f55ec4a in dmtcp::SocketConnList::postRestart (this=0x7fc22ffa7308) at ipc/socket/socketconnlist.cpp:110
#3 0x00007fc22f51ff08 in dmtcp::ConnectionList::eventHook (this=0x7fc22ffa7308, event=DMTCP_EVENT_RESTART, data=0x0)
at ipc/connectionlist.cpp:118
#4 0x00007fc22f55e648 in dmtcp_SocketConnList_EventHook (event=DMTCP_EVENT_RESTART, data=0x0) at ipc/socket/socketconnlist.cpp:20
#5 0x00007fc22f52a4a1 in dmtcp_event_hook (event=DMTCP_EVENT_RESTART, data=0x0) at ipc/ipc.cpp:46
#6 0x00007fc22fb96856 in dmtcp_event_hook (event=DMTCP_EVENT_RESTART, data=0x0) at veloceEmulation.cpp:136
#7 0x00007fc22ede38aa in dmtcp::DmtcpWorker::eventHook (event=DMTCP_EVENT_RESTART, data=0x0) at dmtcpworker.cpp:558
#8 0x00007fc22edf5874 in dmtcp::callbackPostCheckpoint (isRestart=true, mtcpRestoreArgvStartAddr=0x0) at mtcpinterface.cpp:88
#9 0x00007fc22ee01417 in dmtcp::ThreadList::waitForAllRestored (thread=0x7fc22ff96008) at threadlist.cpp:650
#10 0x00007fc22edffda2 in checkpointhread (dummy=0x0) at threadlist.cpp:351
#11 0x00007fc22edf7646 in pthread_start (arg=0x7fc22ffb2f48) at threadwrappers.cpp:159
#12 0x00007fc22d18d7b6 in start_thread () from /lib64/libpthread.so.0
#13 0x00007fc22edf72dd in clone_start (arg=0x7fc22ff96008) at threadwrappers.cpp:68
#14 0x00007fc22eb954bc in clone_start (arg=0x7fc22ffb2e88) at pid/pid_miscwrappers.cpp:110
#15 0x00007fc22d47a9cd in clone () from /lib64/libc.so.6
#16 0x0000000000000000 in ?? ()
(gdb)

@ankitcse07
Copy link
Collaborator

I can see that just prior to dmtcp_checkpoint call fd 21 is still pointing to TRACE_RTS, however when it comes out from this call mapping chnages to socket

ls -l /proc/13151/fd/
total 0

l-wx------ 1 agarg medrd 64 May 5 08:38 21 -> /misc/vel/AGARG/demos/dmtcp_demos/c_testbench/dmtcp_veloce_examples/v1/hello_world_dmtcp_2apis_cm_master/veloce.log/veloce_runtime.log/TRACE_RTS.txt

After dmtcp_checkpoint call

ls -l /proc/13151/fd/
total 0

lrwx------ 1 agarg medrd 64 May 5 08:38 2 -> socket:[268200]
lrwx------ 1 agarg medrd 64 May 5 08:38 20 -> socket:[268200]
l-wx------ 1 agarg medrd 64 May 5 08:38 21 -> socket:[269042]

@ankitcse07
Copy link
Collaborator

I was able to figure out the issue. There is some issue in mapping pipe call to socketpair call, it seems that one of the fds in socketpair is left dead but that is still visible to DMTCP infrastructure resulting in overlapping fds between TcpSocketConnection and FileConnection fds. At the time of checkpoint DMTCP stores the fd for fileconnection in checkpoint database but when it comes to drain of socket results in marking it as dead socket. At restart first fileconnection is restored back but at the time and then again same fd of socket connection is restored back however a dead socket.

So it lseek on the fd it reverts with the above error. I changed the definition of pipe in src/miscwrappers.cpp to always use pipe from libc rather than socketpair and it worked like a charm.

I am uploading a small test reproducing the issue. Let me know if you are able to reproduce otherwise I can provide you a webex for the same.

bug_pipe_to_socketpair.tar.gz

@rohgarg
Copy link
Contributor Author

rohgarg commented May 5, 2016

Thanks for chasing this down, @ankitcse07! This is really useful.
It gives us a stronger motivation to get started on issue #295. :-)

@karya0, here's another test case where promoting pipes to socketpairs breaks semantics.

@rohgarg rohgarg added this to the 3.0.0 milestone May 5, 2016
@ankitcse07
Copy link
Collaborator

@rohgarg fixing this issue is critical for us as this is occurring for very basic case , in fact support of pipe semantics is also important because I found hangs in my system if pipe call was not mapped to socketpair, however mapping to socket pair lead to this issue.

Let me know what could be the timeline for the fix

@rohgarg
Copy link
Contributor Author

rohgarg commented May 5, 2016

You wrote:

... support of pipe semantics is also important because I found hangs in my system if pipe call was not mapped to socketpair, however mapping to socket pair lead to this issue.

Earlier you wrote:

I changed the definition of pipe in src/miscwrappers.cpp to always use pipe from libc rather than socketpair and it worked like a charm.

I'm a little confused by these two statements. Could you please explain the difference between the scenario where it hangs versus the scenario where it works? (Obviously, modifying the pipe wrapper is not a solution because DMTCP doesn't support checkpoint-restart of pipes, and a restarted process may hang up on trying to read from a dead (pipe) fd. The solution could work for the cases when a pipe is short-lived, in which case you could disable checkpointing until the pipe gets closed.)

FYI: I'm able to reproduce the issue with your test program locally.

@ankitcse07
Copy link
Collaborator

ankitcse07 commented May 5, 2016

I meant the attached test worked like charm but my software system broke. Also Rohan I would be stuck without this fix as I had done lot of architecture changes in my plugin which makes this fix critical. So do you already have thought of strategy to fix issues with pipe

@rohgarg
Copy link
Contributor Author

rohgarg commented May 5, 2016

Looking at your test program and description, I think, there are two separate issues here:

  • incorrect handling of socket-pairs, which leads to the socket plugin and the file plugin storing the same fd mapped to two different resources; and
  • promoting a pipe to a socket-pair, which is a cause of a whole lot of other bugs, see Application deadlock #294 for example.

Obviously, the first issue is an artifact of DMTCP promoting pipes to socket-pairs, but I believe it's separate issue. Separating out these concerns will help us unblock you soon by addressing the (possibly) easier part first. That said, figuring out the right way to support pipes natively is definitely our goal in the longer term. As for the timeline, I will try to push some thing out by this weekend.

(I could be wrong, of course, and it may turn out that the first issue is irrelevant.)

@rohgarg
Copy link
Contributor Author

rohgarg commented May 6, 2016

@ankitcse07: Could you please try the following patch?

diff --git a/src/popen.cpp b/src/popen.cpp
index e008514..3cc3a96 100644
--- a/src/popen.cpp
+++ b/src/popen.cpp
@@ -114,7 +114,7 @@ FILE *popen(const char *command, const char *mode)
          child_std_end, it has been already closed by the dup2 syscall
          above.  */
       if (fd != child_std_fd) {
-        _real_fclose(it->first);
+        fclose(it->first);
       }
     }
     _dmtcpPopenPidMap.clear();
@@ -161,7 +161,7 @@ int pclose(FILE *fp)
   }
   _unlock_popen_map();

-  if (pid == -1 || _real_fclose(fp) != 0) {
+  if (pid == -1 || fclose(fp) != 0) {
     return -1;
   }

It fixes the popen()-related wrappers. It should (hopefully) help unblock you for the time being. I'm working on adding support for pipe. I can't give you a definite timeline for that yet, but I'll try to push for this weekend.

@ankitcse07
Copy link
Collaborator

@rohgarg Thanks for the patch, due you intend to commit this patch or this is just a work around for the time being to unblock me

@ankitcse07
Copy link
Collaborator

@rohgarg Thanks Rohan patch is working fine

@rohgarg
Copy link
Contributor Author

rohgarg commented May 6, 2016

This patch will be pushed upstream.

@rohgarg
Copy link
Contributor Author

rohgarg commented May 22, 2016

Fixed via #382.

@rohgarg rohgarg closed this as completed May 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants