Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux-job attach not behaving nicely when detached from debugger #2599

Closed
dongahn opened this issue Jan 3, 2020 · 17 comments
Closed

flux-job attach not behaving nicely when detached from debugger #2599

dongahn opened this issue Jan 3, 2020 · 17 comments

Comments

@dongahn
Copy link
Member

dongahn commented Jan 3, 2020

I launched a job under totalview control like

totalview -oldui -verbosity error --args flux mini run -N 2 -n 2 -o stop-tasks-in-exec ./sleep_print 360 5

Once the job started running, I group-detached from the job so totalview is completely detached both from the parallel program and flux-job itself and then I exited the tool. flux job attach process continued to print out the outputs until I entered something into the terminal. But once I typed in something into the controlling terminal, the system puts flux job attach into the stop state. Probably, SIGTSTP is raised, which caused this process to stop.

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ahn1     14083  0.0  0.2  58156  5080 pts/0    T    01:50   0:00 flux-job attach 250646959751168

This is problematic, though, because it turned out this process doesn't show up as a job with the jobs command in the original shell (so that I could switch it back to the foreground). Sending SIGCONT doesn't continue it either.

I didn't see this issue on srun. So there must be some tricks that it uses... Maybe handle SIGTSTP more gracefully?

@dongahn
Copy link
Member Author

dongahn commented Jan 8, 2020

I sent an email to @jdelsign to get his opinion about this. He may have seen this misbehavior from other resource managers.

From yesterday's discussion, we may be able to get around this issue by installing a signal handler for SIGTSTP within flux-job and ignore this signal. Unlike SIGSTOP, SIGTSTP should be maskable.

But if we ignore it, one concern would be stdin. We probably don't want a background process to access stdin interfacing with the foreground process.

@grondo
Copy link
Contributor

grondo commented Jan 8, 2020

@dongahn, I tried to reproduce this issue simply by running a job with some output and attaching to flux job attach with gdb, then detaching. I wasn't able to reproduce the issue in this manner. Am I obviously missing something here?

I also had no problem putting flux job attach in the background and bringing it back into the foreground.

@dongahn
Copy link
Member Author

dongahn commented Jan 8, 2020

But if we ignore it, one concern would be stdin. We probably don't want a background process to access stdin interfacing with the foreground process.

Sorry, I spoke too early. The SIGTSTP signal handler essentially is so that the background process doesn't interface with stdin...

@dongahn
Copy link
Member Author

dongahn commented Jan 8, 2020

@dongahn, I tried to reproduce this issue simply by running a job with some output and attaching to flux job attach with gdb, then detaching. I wasn't able to reproduce the issue in this manner. Am I obviously missing something here?

Hmmm. Ok. Let me try this w/ gdb as well. Maybe looking at the difference between totalview and gdb, the problem becomes more clear.

I also had no problem putting flux job attach in the background and bringing it back into the foreground.

Yes, fg/bg of the processes spawned at the shell should work fine.

The problem for me was when flux-job was forked and exec'ed by totalview. Presumably, this flux-job process doesn't belong to the process group of the top level shell. And when I detached the controlling totalview from this process, it became orphaned with its PPID becomes 1, which appeared to be outside the purview of job control at the top shell.

@grondo
Copy link
Contributor

grondo commented Jan 8, 2020

it became orphaned with its PPID becomes 1, which appeared to be outside the purview of job control at the top shell.

Ah, that is the difference then. I was running flux job attach from one shell and attaching gdb from another.

@dongahn
Copy link
Member Author

dongahn commented Jan 8, 2020

From @jdelsign:

This one I'm not sure I understand 100%. I do know that job control in the shell (to arbitrate access to stdin) is a trick business. It's been a while since I looked at it in detail, but IIRC, the shell manages a "pid" property in the tty so that the kernel knows which process(es) currently own(s) the TTY. If a process tries to read from stdin but doesn't own the TTY at that moment, it is sent a SIGTSTP.

TotalView doesn't do anything special for job control, but maybe it should. Dunno.

@grondo
Copy link
Contributor

grondo commented Jan 8, 2020

I used the following python script to orphan the flux job attach process and try to reproduce the original issue reported above:

import os
import sys
if not os.fork():
    os.execvp(sys.argv[1], sys.argv[1:])
$ flux mini submit -n1 sh -c 'while :; do sleep 3; date; done'
242212667392
(flux-PYQbrh) grondo@fluke108:~/git/f.git$ python3 ./fork.py flux job attach 242212667392
(flux-PYQbrh) grondo@fluke108:~/git/f.git$ Wed Jan  8 11:59:52 PST 2020
Wed Jan  8 11:59:55 PST 2020
Wed Jan  8 11:59:58 PST 2020
Wed Jan  8 12:00:01 PST 2020
Wed Jan  8 12:00:04 PST 2020
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
grondo   18066  0.0  0.0  68484  2888 pts/5    S    12:00   0:00 /g/g0/grondo/git/f.git/src/cmd/.libs/lt-flux-job attach 242212667392

As you can see the process wasn't stopped, and was still able to write output to the terminal, though it wasn't processing input. The flux job attach process was reparented to init and couldn't be foregrounded.

So I still don't have a good reproducer for the TV issue.

When running this test with srun, were you able to foreground the srun process after detaching from TV?

@dongahn
Copy link
Member Author

dongahn commented Jan 8, 2020

@grondo: You are almost there. While flux-job is printing to the terminal, type any character to the terminal and enter at the top shell so that SIGTSTP will be raised.

ahn1@5bd83ef62aec:/usr/src$ flux mini submit -n1 sh -c 'while :; do sleep 3; date; done'
28957474816
ahn1@5bd83ef62aec:/usr/src$ python3 ./fork.py flux job attach 28957474816
ahn1@5bd83ef62aec:/usr/src$ MPIR_being_debugged: 0
Wed Jan  8 20:53:46 UTC 2020
Wed Jan  8 20:53:49 UTC 2020
Wed Jan  8 20:53:52 UTC 2020

ahn1@5bd83ef62aec:/usr/src$ ps x
  PID TTY      STAT   TIME COMMAND
    1 pts/0    Ss     0:00 /bin/bash
 9927 pts/0    S      0:00 start -s 2
 9928 pts/0    Sl     0:00 /usr/libexec/flux/cmd/flux-broker --setattr=rundir=/tmp/flux-9927-gsYBCE --setattr=tbon.endpoint=ipc://%B/req
 9929 pts/0    Sl     0:00 /usr/libexec/flux/cmd/flux-broker --setattr=rundir=/tmp/flux-9927-gsYBCE --setattr=tbon.endpoint=ipc://%B/req
 9977 pts/0    S      0:00 /bin/bash
 9983 pts/0    S      0:00 /usr/libexec/flux/flux-shell 28957474816
 9984 pts/0    S      0:00 sh -c while :; do sleep 3; date; done
 9991 pts/0    T      0:00 job attach 28957474816
 9995 pts/0    S      0:00 sleep 3
 9996 pts/0    R+     0:00 ps x

@grondo
Copy link
Contributor

grondo commented Jan 8, 2020

That is essentially what I did above.

Just a hunch, have you reproduced this behavior outside of Docker on the Mac?

(flux-LquY9G):~/git/f.git$ flux mini submit -n1 sh -c 'while :; do sleep 3; date; done'
90764738560
(flux-LquY9G):~/git/f.git$ python3 ./fork.py flux job attach 90764738560
(flux-LquY9G):~/git/f.git$ Wed Jan  8 14:02:12 PST 2020
Wed Jan  8 14:02:15 PST 2020
Wed Jan  8 14:02:18 PST 2020
Wed Jan  8 14:02:21 PST 2020
ps auxw | grep attach
grondo   26066  0.0  0.0  68484  2888 pts/5    S    14:02   0:00 /g/g0/grondo/git/f.git/src/cmd/.libs/lt-flux-job attach 90764738560
grondo   26115  0.0  0.0   9096   668 pts/5    S+   14:02   0:00 grep attach
(flux-LquY9G):~/git/f.git$ Wed Jan  8 14:02:27 PST 2020
Wed Jan  8 14:02:30 PST 2020
Wed Jan  8 14:02:33 PST 2020
kill -INT 26066
flux-job: one more ctrl-C within 2s to cancel or ctrl-Z to detach
(flux-LquY9G):~/git/f.git$ Wed Jan  8 14:02:36 PST 2020
kill -INT 26066
28.270s: job.exception type=cancel severity=0 interrupted by ctrl-C
flux-job: task(s) exited with exit code 143

@dongahn
Copy link
Member Author

dongahn commented Jan 8, 2020

Just a hunch, have you reproduced this behavior outside of Docker on the Mac?

I haven't tried this. But this maybe it. (Either Docker on Mac or the Linux kernel level being used in Ubuntu: Linux 5bd83ef62aec 4.9.184-linuxkit #1 SMP Tue Jul 2 22:58:16 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux)

@grondo
Copy link
Contributor

grondo commented Jan 9, 2020

I was able to reproduce this problem under Docker on both Linux and MacOS.
Perhaps some difference in the way the pty works under docker run.

It turns out that flux job attach is being stopped by SIGTTIN. When you try to send SIGCONT to the stopped process, it continues, immediately tries to read from stdin and is sent SIGTTIN again.

I'm confused why the process is even trying to read from stdin -- it seems like a process in the "background" (i.e. without access to the tty) shouldn't register activity on the stdin file descriptor, but maybe I'm wrong about that.

In testing, setting SIGTTIN to SIG_IGN worked around this problem, though I also had to add some other minor fixes to stop watching stdin on the first error.

I'll try to post a proposed fix soon.

@dongahn
Copy link
Member Author

dongahn commented Jan 9, 2020

Oh cool!

I'm confused why the process is even trying to read from stdin

Does flux-job attach have stdin forwarding?

it seems like a process in the "background" (i.e. without access to the tty) shouldn't register activity on the stdin file descriptor, but maybe I'm wrong about that.

Maybe a kernel difference or pseudo terminal semantics difference under docker run like you suspected.

setting SIGTTIN to SIG_IGN worked around this problem

How did you do this out of curiosity?

@grondo
Copy link
Contributor

grondo commented Jan 9, 2020

Does flux-job attach have stdin forwarding?

Yes.

How did you do this out of curiosity?

At first I tried adding a signal watcher for SIGTTIN in the hopes that we could shut down the stdin watcher from there. However, for some reason this didn't work. The handler was never called and instead the flux-job process spun on read() from stdin (getting an error from each iteration).

What worked better was to just add

signal (SIGTTIN, SIG_IGN);

In the flux-job attach code.

@grondo
Copy link
Contributor

grondo commented Jan 9, 2020

BTW, I learned while working on this issue that SIGTTOU is only generated when the terminal has the TOSTOP mode set (stty tostop). This is not the default. A process can also ignore SIGTTOU so that it is not stopped even when the TOSTOP terminal mode is active. Not sure if that makes sense for flux job attach.

@dongahn
Copy link
Member Author

dongahn commented Jan 9, 2020

Not sure if that makes sense for flux job attach.

I think it makes sense to add this to make the command tostop-terminal-free.

@grondo
Copy link
Contributor

grondo commented Jan 9, 2020

Ok, I will propose a PR that ignores both SIGTTIN and SIGTTOU in flux-job attach, along with some other fixes I found along the way. Thanks!

@dongahn
Copy link
Member Author

dongahn commented Jan 9, 2020

Thanks!

grondo added a commit to grondo/flux-core that referenced this issue Jan 10, 2020
Problem: Under certain circumstances, flux-job attach being sent
into the background causes the program to be stopped without the
ability to resume it by bringing to foreground, nor via SIGCONT.
Specifically, this has been seen to occur when detaching a debugger
like TotalView. The flux-job process is then orphaned (reparented to
init) so is not under purview of shell job control, yet input to the
terminal still registers as stdin activity, resulting in immediate
SIGTTIN. If the process is continued with SIGCONT, it attempts to
read from stdin and gets SIGTTIN again, and so on.

Ignoring SIGTTIN appears to make this situation better. The program
gets errors when reading from stdin when in the background, effectively
ignoring this input.

Ignoring SIGTTOU is done for completeness. This will allow flux-job
attach to write to the terminal from the background, even for ttys
with the TOSTOP output mode set (though this is thought to be rare)

Fixes flux-framework#2599
@mergify mergify bot closed this as completed in 20bb6c1 Jan 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants