-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flux-job attach not behaving nicely when detached from debugger #2599
Comments
I sent an email to @jdelsign to get his opinion about this. He may have seen this misbehavior from other resource managers. From yesterday's discussion, we may be able to get around this issue by installing a signal handler for But if we ignore it, one concern would be |
@dongahn, I tried to reproduce this issue simply by running a job with some output and attaching to I also had no problem putting |
Sorry, I spoke too early. The |
Hmmm. Ok. Let me try this w/ gdb as well. Maybe looking at the difference between totalview and gdb, the problem becomes more clear.
Yes, fg/bg of the processes spawned at the shell should work fine. The problem for me was when flux-job was forked and exec'ed by totalview. Presumably, this flux-job process doesn't belong to the process group of the top level shell. And when I detached the controlling totalview from this process, it became orphaned with its PPID becomes 1, which appeared to be outside the purview of job control at the top shell. |
Ah, that is the difference then. I was running |
From @jdelsign:
|
I used the following python script to orphan the import os
import sys
if not os.fork():
os.execvp(sys.argv[1], sys.argv[1:]) $ flux mini submit -n1 sh -c 'while :; do sleep 3; date; done'
242212667392
(flux-PYQbrh) grondo@fluke108:~/git/f.git$ python3 ./fork.py flux job attach 242212667392
(flux-PYQbrh) grondo@fluke108:~/git/f.git$ Wed Jan 8 11:59:52 PST 2020
Wed Jan 8 11:59:55 PST 2020
Wed Jan 8 11:59:58 PST 2020
Wed Jan 8 12:00:01 PST 2020
Wed Jan 8 12:00:04 PST 2020
As you can see the process wasn't stopped, and was still able to write output to the terminal, though it wasn't processing input. The So I still don't have a good reproducer for the TV issue. When running this test with |
@grondo: You are almost there. While ahn1@5bd83ef62aec:/usr/src$ flux mini submit -n1 sh -c 'while :; do sleep 3; date; done'
28957474816
ahn1@5bd83ef62aec:/usr/src$ python3 ./fork.py flux job attach 28957474816
ahn1@5bd83ef62aec:/usr/src$ MPIR_being_debugged: 0
Wed Jan 8 20:53:46 UTC 2020
Wed Jan 8 20:53:49 UTC 2020
Wed Jan 8 20:53:52 UTC 2020
ahn1@5bd83ef62aec:/usr/src$ ps x
PID TTY STAT TIME COMMAND
1 pts/0 Ss 0:00 /bin/bash
9927 pts/0 S 0:00 start -s 2
9928 pts/0 Sl 0:00 /usr/libexec/flux/cmd/flux-broker --setattr=rundir=/tmp/flux-9927-gsYBCE --setattr=tbon.endpoint=ipc://%B/req
9929 pts/0 Sl 0:00 /usr/libexec/flux/cmd/flux-broker --setattr=rundir=/tmp/flux-9927-gsYBCE --setattr=tbon.endpoint=ipc://%B/req
9977 pts/0 S 0:00 /bin/bash
9983 pts/0 S 0:00 /usr/libexec/flux/flux-shell 28957474816
9984 pts/0 S 0:00 sh -c while :; do sleep 3; date; done
9991 pts/0 T 0:00 job attach 28957474816
9995 pts/0 S 0:00 sleep 3
9996 pts/0 R+ 0:00 ps x |
That is essentially what I did above. Just a hunch, have you reproduced this behavior outside of Docker on the Mac?
|
I haven't tried this. But this maybe it. (Either Docker on Mac or the Linux kernel level being used in Ubuntu: Linux 5bd83ef62aec 4.9.184-linuxkit #1 SMP Tue Jul 2 22:58:16 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux) |
I was able to reproduce this problem under Docker on both Linux and MacOS. It turns out that I'm confused why the process is even trying to read from stdin -- it seems like a process in the "background" (i.e. without access to the tty) shouldn't register activity on the stdin file descriptor, but maybe I'm wrong about that. In testing, setting I'll try to post a proposed fix soon. |
Oh cool!
Does
Maybe a kernel difference or pseudo terminal semantics difference under docker run like you suspected.
How did you do this out of curiosity? |
Yes.
At first I tried adding a signal watcher for What worked better was to just add signal (SIGTTIN, SIG_IGN); In the |
BTW, I learned while working on this issue that |
I think it makes sense to add this to make the command tostop-terminal-free. |
Ok, I will propose a PR that ignores both |
Thanks! |
Problem: Under certain circumstances, flux-job attach being sent into the background causes the program to be stopped without the ability to resume it by bringing to foreground, nor via SIGCONT. Specifically, this has been seen to occur when detaching a debugger like TotalView. The flux-job process is then orphaned (reparented to init) so is not under purview of shell job control, yet input to the terminal still registers as stdin activity, resulting in immediate SIGTTIN. If the process is continued with SIGCONT, it attempts to read from stdin and gets SIGTTIN again, and so on. Ignoring SIGTTIN appears to make this situation better. The program gets errors when reading from stdin when in the background, effectively ignoring this input. Ignoring SIGTTOU is done for completeness. This will allow flux-job attach to write to the terminal from the background, even for ttys with the TOSTOP output mode set (though this is thought to be rare) Fixes flux-framework#2599
I launched a job under totalview control like
totalview -oldui -verbosity error --args flux mini run -N 2 -n 2 -o stop-tasks-in-exec ./sleep_print 360 5
Once the job started running, I group-detached from the job so totalview is completely detached both from the parallel program and
flux-job
itself and then I exited the tool.flux job attach
process continued to print out the outputs until I entered something into the terminal. But once I typed in something into the controlling terminal, the system putsflux job attach
into the stop state. Probably,SIGTSTP
is raised, which caused this process to stop.This is problematic, though, because it turned out this process doesn't show up as a job with the jobs command in the original shell (so that I could switch it back to the foreground). Sending SIGCONT doesn't continue it either.
I didn't see this issue on srun. So there must be some tricks that it uses... Maybe handle
SIGTSTP
more gracefully?The text was updated successfully, but these errors were encountered: