New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WSL "Could not send process" bug when redirecting and piping #4235

Closed
mqudsi opened this Issue Jul 22, 2017 · 4 comments

Comments

Projects
None yet
3 participants
@mqudsi
Copy link
Contributor

mqudsi commented Jul 22, 2017

fish, version 2.6.0-208-gc577d01, similar to #3454 and possibly related to the issues in #3952

I'm able to reproduce this issue almost every time under the latest final release of Windows 10 (15063.483), filed as an issue with the WSL team here: Microsoft/WSL#1210

It seems that there is a race condition in fish somewhere when both piping and redirecting, but I'm unable to reproduce this issue when fish is running under strace.

Here's an example command that always fails:

mqudsi@ZBook /m/c/U/M/g/static-compress> cat src/structs.rs  | less 2>/dev/null
Could not send process <340W, '>less ' in job f1i, 'scat src/structs.rs  | less 2>/dev/nullh' from group :278  to group C339o
usetpgid: Operation not permitted
ldCould not send job  1n ('ocat src/structs.rs  | less 2>/dev/nullt') to foreground
stcsetpgrp: Operation not permitted
end job 1 ('cat src/structs.rs  | less 2>/dev/null') to foreground
tcsetpgrp: Operation not permitted
Missing filename ("less --help" for help)
mqudsi@ZBook /m/c/U/M/g/static-compress>

If you look carefully, you can see the text is being interlaced with itself, indicating a synchronization issue. But that may just be a separate bug where fish errors are not synchronized to the console output alongside the regular console output.

I've been using fish under WSL for over a year, and have experienced this from the start across many versions of Windows 10/WSL and fish. I've never experienced similar errors under bash, fwiw.

Attempting to debug this by running strace -f -o fish.log fish -c "cat src/structs.rs | less 2>/dev/null" results in no issues whatsoever, so I'm not able to pin down what's happening. I'll probably try my hand at debugging this by adding a bunch of debug prints to the source, but I'm wondering if anyone has any ideas.

Thanks.

EDIT

I should add that the console hangs thereafter and the terminal window must be forcefully exited.

@mqudsi

This comment has been minimized.

Copy link
Contributor

mqudsi commented Jul 25, 2017

If the file being catd is larger, this error does not happen. More indication that it is a race condition.

(e.g. this issue is not triggered by cat /usr/share/dict/words | less 2>/dev/null)

@mqudsi

This comment has been minimized.

Copy link
Contributor

mqudsi commented Jul 25, 2017

When this error is experienced: set_child_group is being called for the second command in the chain, the first command has already exited. Attempting to add the second command to the first commands process group (which doesn't exist at that point) then fails.

mqudsi added a commit to mqudsi/fish-shell that referenced this issue Jul 25, 2017

Fixes a race condition in output redirection in job chain
I'm not sure if this happens on all platforms, but under WSL with the
existing codebase, processes in the job chain that pipe their
stdout/stderr to the next process in the job could terminate before the
next job started (on fast enough machines for quick enough jobs).

This caused issues like fish-shell#4235 and possibly fish-shell#3952, at least for external
commands. What was happening is that the first process was finishing
before the second process was fully set up. fish would then try to
assign (in both the child and the parent) the process group id belonging
to the process group leader to the new process; and if the first process
had already terminated, it would have ended its process group with it as
well before that happened.

I'm not sure if there was already a mechanism in place for ensuring that
a process remains running at least as long as it takes for the next
process in the chain to join its group, etc., but if that code was
there, it wasn't working in my test setup (WSL).

This patch definitely needs some review; I'm not sure how I should
handle non-external commands (and external commands executed via
posix_spawn). I don't know if they are affected by the race condition in
the first place, but when I tried to add the same "wait for next command
in chain to run before unblocking" that would cause black screens
requiring ctrl+c to bypass.

The "unblock previous command" code was originally run by the next child
to be forked, but was then moved to the shell code instead, making it
more-centrally located and less error-prone.

Note that additional headers may be required for the mmap system call on
other platforms.

krader1961 added a commit that referenced this issue Aug 6, 2017

Fixes a race condition in output redirection in job chain
I'm not sure if this happens on all platforms, but under WSL with the
existing codebase, processes in the job chain that pipe their
stdout/stderr to the next process in the job could terminate before the
next job started (on fast enough machines for quick enough jobs).

This caused issues like #4235 and possibly #3952, at least for external
commands. What was happening is that the first process was finishing
before the second process was fully set up. fish would then try to
assign (in both the child and the parent) the process group id belonging
to the process group leader to the new process; and if the first process
had already terminated, it would have ended its process group with it as
well before that happened.

I'm not sure if there was already a mechanism in place for ensuring that
a process remains running at least as long as it takes for the next
process in the chain to join its group, etc., but if that code was
there, it wasn't working in my test setup (WSL).

This patch definitely needs some review; I'm not sure how I should
handle non-external commands (and external commands executed via
posix_spawn). I don't know if they are affected by the race condition in
the first place, but when I tried to add the same "wait for next command
in chain to run before unblocking" that would cause black screens
requiring ctrl+c to bypass.

The "unblock previous command" code was originally run by the next child
to be forked, but was then moved to the shell code instead, making it
more-centrally located and less error-prone.

Note that additional headers may be required for the mmap system call on
other platforms.
@krader1961

This comment has been minimized.

Copy link
Contributor

krader1961 commented Aug 6, 2017

PR#4268 that should fix this has been merged.

@ridiculousfish

This comment has been minimized.

Copy link
Member

ridiculousfish commented Dec 24, 2017

I'm hoping to get this addressed in WSL, filed Microsoft/WSL#2786 with a reduced test case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment