Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang when exiting subshell #7060

Closed
dnicolson opened this issue May 29, 2020 · 29 comments
Closed

Hang when exiting subshell #7060

dnicolson opened this issue May 29, 2020 · 29 comments
Labels
bug Something that's not working as intended
Milestone

Comments

@dnicolson
Copy link
Contributor

If the following code is run as go run subshell.go, the terminal session in iTerm2 can become unresponsive and cause the fish process to run at 99% CPU.

package main

import "os"
import "os/exec"

func main() {
	cmd := exec.Command(os.Getenv("SHELL"), "--login")
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr
	cmd.Stdin = os.Stdin
	cmd.Run()
}
  1. go run subshell.go
  2. Hit Control-C
  3. Hit enter
  4. Hit enter

This results in the following output before the terminal session becomes unresponsive:

$ go run subshell.go
$ signal: interrupt
$ ⏎
$
$
$

The output of the Terminal app is similar, except the terminal session closes and fish stops running.

Performing the above steps with fish 2.7.1 does not have the same issue.

fish 3.1.2
Darwin Kernel Version 19.5.0: root:xnu-6153.121.1~7/RELEASE_X86_64 x86_64
xterm-256color
iTerm 3.3.11beta1

@mqudsi
Copy link
Contributor

mqudsi commented May 29, 2020

That just launches a bash environment for me and nothing breaks (tested under conhost and kitty).

What is your $SHELL?

@dnicolson
Copy link
Contributor Author

dnicolson commented May 30, 2020

$SHELL is /usr/local/bin/fish.

@mqudsi

This comment has been minimized.

@mqudsi
Copy link
Contributor

mqudsi commented May 31, 2020

Ok, I think this one is caused by the fact that fish doesn't forcibly create a new process group with control of the terminal. When you run the go code in your shell, the shell creates a new process group for the go process. go is the only command in this process group and is therefore the process group leader.

When go launches fish, its a naive execution that doesn't do any form of job control. fish is launched into go's process group and inherits all its handles. The go process is still the process group leader, and the process group has full control of the terminal.

When the go process exits (when you hit ^C), the process group is now orphaned (it has no members whose parents are members of another process group in the same session session). Whatever shell you ran that command in (bash, fish, etc) was waiting for go to exit so that it could retake control of the terminal. And retake control of the terminal it does, since it was the parent of that process group leader.

Two different issues are coming into play here:

  1. fish doesn't create a new process group for itself on startup (except under some exceptional circumstances, e.g. firejail) so the shell that spawned go is free to regain control of the terminal
  2. when go exits and is reaped by its parent and the subshell fish becomes a member of an orphaned process group that no longer has control of the terminal, its call to read/readch from STDIN triggers a SIGTTIN (attempting to read from a tty it does not control). The fish sighandler catches this and basically does nothing fruitful about it, ultimately letting the read call continue after failing (with EIO, I believe) immediately afterwards.

The first point is somewhat deliberate, although it might be worth revisiting given the code cleanup that we've done since that time. The basic gist is that by not creating our own process group, we assume that we are being launched by a shell or something else smart enough to do its own job control. But under that assumption, the right thing to do with a SIGTTIN would be to suspend (SIGSTOP) ourselves until we've been given control of the terminal again (not sure how that would happen apart from being directly fg'd by the user, but that is the assumption). If we didn't override the SIGTTIN or we handled it by cleaning up and then pausing, it would be OK (for some definition of that word): the fish subshell, no longer in control of the terminal and not in the foreground, would stop trying to read from STDIN or write to STDOUT and CPU usage would remain at 0.

But what ends up happening is that we (similarly no longer in control of the terminal and similarly no longer in the foreground) ignore the SIGTTIN, keep trying to read from STDIN, keep getting SIGTTIN, rinse and repeat - i.e. the worst of both worlds.

(If we did create our own process group on launch, we'd still be in control of the terminal although go wouldn't be any more and you can imagine the problems that would cause if the go process tried to read/write to the terminal. It's really a lose-lose.)

@ridiculousfish
Copy link
Member

See this blast from the past.

fish has special logic to try to notice when it is orphaned. Perhaps it needs to be updated.

@ridiculousfish ridiculousfish added this to the fish-future milestone May 31, 2020
@mqudsi
Copy link
Contributor

mqudsi commented May 31, 2020

Ah yes, that SO post. I remember coming across it many moons ago!

One option, even if not the most elegant, is to just SIGHUP; logically the situation isn’t too different from a terminal window closing on us, with similarly slim-to-no chance of regaining control of the tty. (We wouldn’t do this automatically on SIGTTIN, we’d have to first ascertain we’ve been actually orphaned, otherwise we could have been backgrounded instead - in which case we should probably SIGSTOP until we have control of the terminal again).

@ridiculousfish
Copy link
Member

This is pretty amusing:

  1. fish launches go launches fish. Call these "fish1" and "fish2". go and fish2 are in the same process group.
  2. SIGINT is delivered. go exits. fish2 is interactive, it swallows SIGINT. Now fish2 is orphaned.
  3. fish1 sees that go has exited and reclaims the terminal.
  4. fish2 soon after decides to exit gracefully - probably it thinks stdin is closed. One of the last things it does is "helpfully" restore the foreground process group to what it initially saw, with is go's orphaned group.
  5. This in turn steals the tty from fish1 without fish1 noticing. So we get the SIGTTIN.

The right fixes here are:

  1. Probably fish should not restore the foreground process group if it is orphaned. But it's hard to know that for sure - probably kill(0, getpgrp()) and check for ESRCH.

  2. Definitely fish should be better about handling SIGTTIN, but it's not obvious how. The usual approach when you get SIGTTIN is to wait until you have ownership of the terminal, but here ownership won't ever arrive.

@mqudsi
Copy link
Contributor

mqudsi commented May 31, 2020

See my comment above about alternative options. We were posting simultaneously.

I think an easy win is on graceful exit to use tcgetpgrp before using tcsetpgrp. If we’re no longer in control of the tty (pgrp doesn’t match ours or we get ENOTTY) then we shouldn’t issue a tcsetpgrp. (I also don’t see how tcsetpgrp can succeed if “fish2” issues it and doesn’t have control of the tty and is merely a process in the same session?)

@ridiculousfish
Copy link
Member

I also don’t see how tcsetpgrp can succeed if “fish2” issues it and doesn’t have control of the tty and is merely a process in the same session?

Heh, it's a bit of Unix sillyness. tcsetpgrp is allowed by a non-foreground proc if SIGTTOU is blocked, which fish does. So then that's the right idea here - unblock SIGTTOU and then tcsetpgrp will fail. Clever.

@ridiculousfish
Copy link
Member

Fixed as 3ae91f1. Awesome debugging @mqudsi! Process groups sure are "fun".

@ridiculousfish
Copy link
Member

Also thanks to @dnicolson for filing.

@mqudsi
Copy link
Contributor

mqudsi commented May 31, 2020

Wow, that was fast! 👍

I do wish man pages were more explicit:

If tcsetpgrp() is called by a member of a background process group in its session, and the calling process is not blocking or ignoring SIGTTOU, a SIGTTOU signal is sent to all members of this background process group.

What I'm not seeing is explicit mention that the call will actually fail, in addition to the SIGTTOU being sent.

More worryingly, FreeBSD's man page makes no mention of the SIGTTOU exception:

If the process has a controlling terminal, the tcsetpgrp() function sets the foreground process group ID associated with the terminal device to pgrp_id. The terminal device associated with fd must be the controlling terminal of the calling process and the controlling terminal must be currently associated with the session of the calling process. The value of pgrp_id must be the same as the process group ID of a process in the same session as the calling process.

The tcsetpgrp() function will fail if:

  • [EBADF] The fd argument is not a valid file descriptor.
  • [EINVAL] An invalid value of pgrp_id was specified.
  • [ENOTTY] The calling process does not have a controlling terminal, or the file represented by fd is not the controlling terminal, or the controlling terminal is no longer associated with the session of the calling process.
  • [EPERM] The pgrp_id argument does not match the process group ID of a process in the same session as the calling process.

@mqudsi
Copy link
Contributor

mqudsi commented May 31, 2020

Update: that's funny, despite it being not mentioned in the FreeBSD man pages, your SIGTTOU fixes it on FreeBSD but not on WSL/Linux?

On FreeBSD reader_force_exit() is called in the fish2 process by the event loop in reader.cpp after event_needing_handling->is_eof() returns true, then the patched SIGTTOU code kicks in and causes the tcsetpgrp call to fail, but on WSL that never triggers and the read() call just keeps returning early with EINTR caused by the SIGTTIN triggered by the read itself - the patched code is never even reached. I need to run this on a real linux to see if it behaves the same.

EDIT: Just WSL. sigh

@ridiculousfish
Copy link
Member

Yeah WSL is probably different. I forget where I learned about the creepy-action-at-a-distance of SIGTTOU <-> tcsetpgrp but it's basically there just so shells don't break.

@zanchey zanchey added the bug Something that's not working as intended label Jun 1, 2020
@dnicolson
Copy link
Contributor Author

@ridiculousfish that change fixed the original example.

I am still getting hangs though, it looks like the problem relates to binaries:

  1. go build subshell.go
  2. ./subshell
  3. Hit up
  4. Hit Control-C
  5. Hit enter

@ridiculousfish ridiculousfish reopened this Jun 2, 2020
@ridiculousfish
Copy link
Member

@dnicolson I cannot reproduce hangs with your latest steps. Is it possible that "$SHELL" is referring to a fish without the fix? Perhaps you have not yet installed from master?

@dnicolson
Copy link
Contributor Author

I think the original commands I used to install the latest fish from master were:

$ brew unlink fish
$ brew install fish --HEAD
$ echo $SHELL
/usr/local/bin/fish
$ /usr/local/bin/fish -v
fish, version 3.1.2-771-g3ae91f197

Here is an asciinema recording, although the problem doesn't exist when recording:
https://asciinema.org/a/xk7Xg2VcDKITYOiDJxcE2WRIs

Different terminals give different results:

  • iTerm2 results in hung fish process
  • kitty just closes the window
  • Terminal shows some history/session output then closes the session

All terminals perform correctly if step 3 is omitted.

@mqudsi
Copy link
Contributor

mqudsi commented Jun 3, 2020

I can't reproduce it on FreeBSD but can on Fedora (I'm guessing @ridiculousfish tried it on a mac).

The spawned fish instance demonstrates the same behavior WSL was displaying, bouncing between SIGTTIN and read and never tries to exit (not reaching the patched code).

mqudsi added a commit to mqudsi/fish-shell that referenced this issue Jun 3, 2020
This is the only way to break out of a tight SIGTTIN/EINTR loop that
causes fish to take up 100% cpu when its process group is orphaned and
no longer has control of the tty. See fish-shell#7060.
@mqudsi
Copy link
Contributor

mqudsi commented Jun 3, 2020

This branch checks for SIGTTIN (sigint_checker_t adapted for use with other signals) and fixes the problem with the inner fish instance getting stuck in SIGTTIN, but I think it breaks the parent fish instance (which also gets a SIGTTIN but previously used to ignore it, I think?): https://github.com/mqudsi/fish-shell/tree/sigttin-checks

mqudsi added a commit to mqudsi/fish-shell that referenced this issue Jun 3, 2020
This is the only way to break out of a tight SIGTTIN/EINTR loop that
causes fish to take up 100% cpu when its process group is orphaned and
no longer has control of the tty. See fish-shell#7060.
@ridiculousfish
Copy link
Member

@mqudsi your proposed changes look great to me!

@mqudsi
Copy link
Contributor

mqudsi commented Jun 4, 2020

@ridiculousfish thanks, but unfortunately it breaks the host fish session somehow; as mentioned it too gets a SIGTTIN and this patch stops it from being able to pretend it didn't happen. It was 2 am though, so I called it a night.

The parent fish session is supposed to be stuck waiting for the child to exit and when ctrl-c is pressed, it takes that as a signal to resume execution. But it's actually not in control of the terminal because the spawned fish instance is still running in the pgrp that was given control of the tty. In particular, is_foreground() in proc.cpp returns true (because the job was created in the foreground) but the checks on lines 1030-1035 only enumerate processes known to exist (e.g. created by fish):

fish-shell/src/proc.cpp

Lines 1029 to 1035 in 4dff15b

if (is_foreground() && is_completed()) {
// Set $status only if we are in the foreground and the last process in the job has
// finished and is not a short-circuited builtin.
auto &p = processes.back();
if (p->status.normal_exited() || p->status.signal_exited()) {
parser.set_last_statuses(get_statuses());
}

We assume this is OK because term_steal() is called which regains control of the tty (which causes SIGTTIN to ultimately fire in the child, because it's been forcibly stripped of tty control), but the sequence of keys pressed by the user creates a race condition where fish1 detects go exited and calls term_steal() but fish2 starts a new loop to redraw the prompt when ctrl-c is not a no-op, taking control of the tty back! They race for control of the terminal, and even though fish2 gives control back to fish1 before exiting, it's too late because fish1 has also run into SIGTTIN (because ownership of the ctty changed in between the call to term_steal() and the call to read()) and forces itself to close (returning the controlling tty to init/sshd/whatever).

Before the SIGTTIN patch in branch, both fish sessions would consume 100% cpu while bouncing between read/eint and SIGTTIN but the parent fish would break out of that tight loop when the child fish instance lost the tussle for control. With the patch, they both surrender -- just one of them too soon.

@ridiculousfish
Copy link
Member

Nice detective work. We should understand whether/how bash handles this case.

mqudsi added a commit that referenced this issue Jun 20, 2020
This makes it possible to expand the signals checked by the type. I can't merge
the sigttin fixes for #7060 yet because they introduce new breakage, but this
will make merging any future fix easier.
mqudsi added a commit to mqudsi/fish-shell that referenced this issue Jun 20, 2020
This is the only way to break out of a tight SIGTTIN/EINTR loop that
causes fish to take up 100% cpu when its process group is orphaned and
no longer has control of the tty. See fish-shell#7060.
mqudsi added a commit to mqudsi/fish-shell that referenced this issue Jun 29, 2020
This is the only way to break out of a tight SIGTTIN/EINTR loop that
causes fish to take up 100% cpu when its process group is orphaned and
no longer has control of the tty. See fish-shell#7060.
@dnicolson
Copy link
Contributor Author

On macOS Big Sur (20A5343i) the behaviour is a little different, the session terminates now instead of hanging with v3.1.2.

Using the latest version (3.1.2-1186-g86b02278b), on both Catalina and Big Sur random characters are output at the end:

$ go run subshell.go
$ signal: interrupt
$
[?2004l

Should this be a new issue?

@faho
Copy link
Member

faho commented Aug 9, 2020

random characters are output at the end:

Those aren't random characters. The disable sequence for bracketed paste is \e[?2004l, so most likely something is swallowing the escape (\e), causing the rest to be displayed.

Weird that it's the disable sequence tho because fish fires that on preexec (before executing a commandline) or exit. It enables bracketed paste for itself.

@dnicolson
Copy link
Contributor Author

I'm not sure if it's entirely random, but I was able to produce other characters such as:
?3200m0(lB⏎
??22000044lh
04l

@faho
Copy link
Member

faho commented Aug 9, 2020

I'm not sure if it's entirely random, but I was able to produce other characters such as:

Those are scrambled escape sequences.

Colors look like \e[30m (black in this case), "reset" is \e(B\e[m.

This seems either like something is still reading here, or multiple things are writing.

@ridiculousfish
Copy link
Member

So we still have a race - this time it's between the child fish responding to SIGINT, and the parent fish attempting to reclaim the tty. What ends up happening is that child fish mucks with the tty modes and then exits, and parent fish is left with a bad tty.

I think the simplest workaround here is to just arrange for fish to be in its own pgroup - that's what happens with bash.

@ridiculousfish
Copy link
Member

That's what I did. It's no good trying to run interactively in someone else's pgroup.

@faho
Copy link
Member

faho commented Dec 6, 2020

This could also fix #5699.

ridiculousfish added a commit to ridiculousfish/fish-shell that referenced this issue Apr 6, 2021
When fish starts, it notices which pgroup owns the tty, and then it
restores that pgroup's tty ownership when it exits. However if fish does
not own the tty, then (on Mac at least) the tcsetpgrp call triggers a
SIGSTOP and fish will hang while trying to exit.

The first change is to ignore SIGTTOU instead of defaulting it. This
prevents the hang; however it risks re-introducing fish-shell#7060.

The second change somewhat mitigates the risk of the first: only do the
restore if the initial pgroup is different than fish's pgroup. This
prevents some useless calls which might potentially steal the tty from
another process (e.g. in fish-shell#7060).
ridiculousfish added a commit that referenced this issue Apr 6, 2021
When fish starts, it notices which pgroup owns the tty, and then it
restores that pgroup's tty ownership when it exits. However if fish does
not own the tty, then (on Mac at least) the tcsetpgrp call triggers a
SIGSTOP and fish will hang while trying to exit.

The first change is to ignore SIGTTOU instead of defaulting it. This
prevents the hang; however it risks re-introducing #7060.

The second change somewhat mitigates the risk of the first: only do the
restore if the initial pgroup is different than fish's pgroup. This
prevents some useless calls which might potentially steal the tty from
another process (e.g. in #7060).
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something that's not working as intended
Projects
None yet
Development

No branches or pull requests

5 participants