-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error on restart because of a failure to lseek on a ckpt-ed file #381
Comments
Here's a theory that might explain what's going on: At checkpoint time, the fd 21 was mapped to a file. On restart, the file plugin remembers this information and restores the fd in I think we need to figure out if and when the fd's mapping changed from a file to a socket. |
@rohgarg your theory is true, 21 is first assigned to file TRACE_RTS and then again TcpConnection::PostRestart also has same fd (21) which is assigned back. Can you direct me how to debug this further Breakpoint 3, dmtcp::TcpConnection::postRestart (this=0x7fc22ffad808) at ipc/socket/socketconnection.cpp:489 |
I can see that just prior to dmtcp_checkpoint call fd 21 is still pointing to TRACE_RTS, however when it comes out from this call mapping chnages to socket ls -l /proc/13151/fd/ l-wx------ 1 agarg medrd 64 May 5 08:38 21 -> /misc/vel/AGARG/demos/dmtcp_demos/c_testbench/dmtcp_veloce_examples/v1/hello_world_dmtcp_2apis_cm_master/veloce.log/veloce_runtime.log/TRACE_RTS.txt After dmtcp_checkpoint call ls -l /proc/13151/fd/ lrwx------ 1 agarg medrd 64 May 5 08:38 2 -> socket:[268200] |
I was able to figure out the issue. There is some issue in mapping pipe call to socketpair call, it seems that one of the fds in socketpair is left dead but that is still visible to DMTCP infrastructure resulting in overlapping fds between TcpSocketConnection and FileConnection fds. At the time of checkpoint DMTCP stores the fd for fileconnection in checkpoint database but when it comes to drain of socket results in marking it as dead socket. At restart first fileconnection is restored back but at the time and then again same fd of socket connection is restored back however a dead socket. So it lseek on the fd it reverts with the above error. I changed the definition of pipe in src/miscwrappers.cpp to always use pipe from libc rather than socketpair and it worked like a charm. I am uploading a small test reproducing the issue. Let me know if you are able to reproduce otherwise I can provide you a webex for the same. |
Thanks for chasing this down, @ankitcse07! This is really useful. @karya0, here's another test case where promoting pipes to socketpairs breaks semantics. |
@rohgarg fixing this issue is critical for us as this is occurring for very basic case , in fact support of pipe semantics is also important because I found hangs in my system if pipe call was not mapped to socketpair, however mapping to socket pair lead to this issue. Let me know what could be the timeline for the fix |
You wrote:
Earlier you wrote:
I'm a little confused by these two statements. Could you please explain the difference between the scenario where it hangs versus the scenario where it works? (Obviously, modifying the pipe wrapper is not a solution because DMTCP doesn't support checkpoint-restart of pipes, and a restarted process may hang up on trying to read from a dead (pipe) fd. The solution could work for the cases when a pipe is short-lived, in which case you could disable checkpointing until the pipe gets closed.) FYI: I'm able to reproduce the issue with your test program locally. |
I meant the attached test worked like charm but my software system broke. Also Rohan I would be stuck without this fix as I had done lot of architecture changes in my plugin which makes this fix critical. So do you already have thought of strategy to fix issues with pipe |
Looking at your test program and description, I think, there are two separate issues here:
Obviously, the first issue is an artifact of DMTCP promoting pipes to socket-pairs, but I believe it's separate issue. Separating out these concerns will help us unblock you soon by addressing the (possibly) easier part first. That said, figuring out the right way to support pipes natively is definitely our goal in the longer term. As for the timeline, I will try to push some thing out by this weekend. (I could be wrong, of course, and it may turn out that the first issue is irrelevant.) |
@ankitcse07: Could you please try the following patch? diff --git a/src/popen.cpp b/src/popen.cpp
index e008514..3cc3a96 100644
--- a/src/popen.cpp
+++ b/src/popen.cpp
@@ -114,7 +114,7 @@ FILE *popen(const char *command, const char *mode)
child_std_end, it has been already closed by the dup2 syscall
above. */
if (fd != child_std_fd) {
- _real_fclose(it->first);
+ fclose(it->first);
}
}
_dmtcpPopenPidMap.clear();
@@ -161,7 +161,7 @@ int pclose(FILE *fp)
}
_unlock_popen_map();
- if (pid == -1 || _real_fclose(fp) != 0) {
+ if (pid == -1 || fclose(fp) != 0) {
return -1;
} It fixes the popen()-related wrappers. It should (hopefully) help unblock you for the time being. I'm working on adding support for pipe. I can't give you a definite timeline for that yet, but I'll try to push for this weekend. |
@rohgarg Thanks for the patch, due you intend to commit this patch or this is just a work around for the time being to unblock me |
@rohgarg Thanks Rohan patch is working fine |
This patch will be pushed upstream. |
Fixed via #382. |
@ankitcse07 reports that he's seeing the following error on restart when using the
--ckpt-open-files
option:The error message suggests that the fd we are seeking on is not a regular file. On further investigation, he noted that
_fds[0]
, equal to 21, on restart, is mapped to some socket according to proc-fd. However, the_path
field points to a regular file,/vel/AGARG/demos/dmtcp_demos/c_testbench/dmtcp_veloce_examples/hello_world_dmtcp_2apis_cm_master/veloce.log/veloce_runtime.log/TRACE_RTS.txt
.I'm creating this issue to track this bug.
The text was updated successfully, but these errors were encountered: