New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local terminal stdin detatched if ThreadingGroup run includes sleep #1812

Closed
jensenak opened this Issue Jun 25, 2018 · 22 comments

Comments

Projects
None yet
4 participants
@jensenak

jensenak commented Jun 25, 2018

I'm using a threading group to run shell commands. After running a script that includes a sleep, the local terminal is left with stdin detached (keystrokes not visible on command line) and the terminal has to be reset.

I've tried this quite a few times and find that it only happens with ThreadingGroups (SerialGroups are okay). The sleep command can be anywhere in a one-liner (first command, middle, last) and can be joined in the one liner with either semicolons, or double ampersand. All of the commands run as expected, but the terminal remains in a bad state.

Strangely, if the previous run exited with an uncaught exception, the terminal will not be affected.

To reproduce:

from fabric import ThreadingGroup as Group

# raise ValueError()
remotes = Group("host1.example.com", "host2.example.com")
result = remotes.run("echo 1; sleep 1; echo 2")

Run the script above. After it exits, type something on the command line. If you see no output, <ctrl>+c and type reset<enter>. To see the post-exception behavior, uncomment the raise line, run the code, comment the line, and run twice more. The first successful run will leave the terminal in a good state. The second will leave stdin detached.

I discovered this issue with sleep in my testing, but it's possible other commands may have the same effect. There's also a chance I'm just doing something wrong. If that's the case, my apologies.

My setup:
python 3.6.4
fabric 2.1.3
OSX 10.13.5, connecting to Ubuntu 14.04

@bitprophet

This comment has been minimized.

Member

bitprophet commented Jun 28, 2018

See #1814 as possible second reproducible problem case.

This sounds like a legit bug to me and I'm not sure what's causing it offhand. Smells like it could be a generic Unix issue with terminal pipes being attached to multiple subprocesses at once, or (esp going by #1814's example) a race condition around pipe state, or something like that.

Will try to reproduce and muddle out a cause/solution.

@bitprophet

This comment has been minimized.

Member

bitprophet commented Jun 28, 2018

Also, this probably requires fixing at the Invoke level and may be purely in its domain (insofar as I simply haven't done much with threading in a pure-Invoke context yet; but see eg pyinvoke/invoke#194 - it's a thing that should happen there too). In that case I'll move this to a ticket there & the Fabric "fix" would be to upgrade one's Invoke once fix released.

@bitprophet bitprophet added this to the 2.0.x milestone Jun 28, 2018

@nicktimko

This comment has been minimized.

nicktimko commented Jun 28, 2018

I was on Ubuntu 16.04.2 connecting to the same.

@bitprophet

This comment has been minimized.

Member

bitprophet commented Jul 20, 2018

Another report of same issue in #1829. This is in my next-bugfix milestone and I'll be focusing on that hopefully next OSS day (Mon).

@bitprophet

This comment has been minimized.

Member

bitprophet commented Jul 30, 2018

I just tried to reproduce this (2.0 branch, Python 3.6.4, macOS 10.12) and was unable to, unfortunately. First tried double-localhost, then two separate remote cloud instances, no dice either way; my terminal's fine afterwards.

Going to try on a Linux container in a bit just in case that helps, but since the OP was also on macOS, not sure it'll make a difference. I'll also try running it in a loop to see if it's only an occasional repro.

I'll also try it on 2.1 in case we somehow introduced it in 2.1, tho this seems very unlikely.

@jensenak @nicktimko are you reproducing this 100% of the time? 50%? 5%?

@nicktimko

This comment has been minimized.

nicktimko commented Jul 30, 2018

@bitprophet on 2.1.3 it was happening in my actual workflow fairly often (>80%, I was also going in parallel to 6 servers, not 2), though in my contrived example from #1814 it's much lower, maybe 20%. I can try to come up with a Docker setup, or failing that a Vagrant setup to repro.

@jensenak

This comment has been minimized.

jensenak commented Jul 30, 2018

@bitprophet This has been 100% of the time for me. Just to be sure, I started a fresh virtualenv with only fabric installed. I tested 2.0, 2.1, and 2.2. The example code I pasted produced the described behavior each time. In all tests, I was connecting to Ubuntu 14.04 remotes.

I am on a different version of OSX (10.13). Perhaps that's related? Though @nicktimko was not on OSX at all.

In case another version is an issue, here's what pip freeze looked like in my virtualenv:

asn1crypto==0.24.0
bcrypt==3.1.4
cffi==1.11.5
cryptography==2.3
fabric==2.2.1
idna==2.7
invoke==1.1.0
paramiko==2.4.1
pyasn1==0.4.4
pycparser==2.18
PyNaCl==1.2.1
six==1.11.0

Seeing as all these were installed as dependencies of fabric 2.2, I'd expect your versions will look similar.

If there's more I can do to help, I'm more than willing. Just not really sure where else to look.

@nicktimko

This comment has been minimized.

nicktimko commented Jul 30, 2018

What commit should I test with; have you made any changes recently that might affect things? I'll try with the freeze above, you could also provide another frozen reqs.txt and I can see if that works/doesn't for me.

@bitprophet

This comment has been minimized.

Member

bitprophet commented Jul 30, 2018

@nicktimko @jensenak Thanks for the extra info. I'll keep trying to repro it here; at 20% I'd definitely not have tried it enough to trigger. My remotes have been Mac and some older Debians, I can try Ubuntu Trusty in case it's somehow specific to that (which would be odd, but hey, this whole thing is odd.)

Also, what's your local shell environments? Mine is zsh on the (again, macOS 10.12) builtin Terminal.app, inside tmux. I'll also try some permutations around that angle in a bit.

@bitprophet

This comment has been minimized.

Member

bitprophet commented Jul 30, 2018

AHA. This seems to be bash specific! Was still unable to repro under zsh outside of tmux, but the moment I try under bash, I immediately get the symptoms mentioned. Ditto inside of tmux, so tmux has no bearing - it's a shell thing.

Why this would behave differently under bash vs zsh, I have no immediate idea. Could be specific to how they are implemented, or (seems more likely) maybe something in my zsh dotfiles is preventing the problem? Will have to dig...though identifying a solution on the Python side is necessary either way most likely.

EDIT: also, the reproduction happens even when connecting to my localhost's sshd multiple times concurrently, which isn't too surprising. So the remote end seems to not matter.

@bitprophet

This comment has been minimized.

Member

bitprophet commented Jul 30, 2018

Also, I tried to verify the note about "previous run excepting prevents issue for next run only" but that did not occur for me; I get the behavior every time regardless.

@bitprophet

This comment has been minimized.

Member

bitprophet commented Jul 30, 2018

Moar: I removed the sleep to see what would happen; I am still able to reproduce, though it's now slightly more intermittent (though because this is not something easy to repro in an automatic loop, it's all by hand reproduction, meaning low number of test cases, meaning true % of occurrence is gonna be real hard to measure accurately.)

That's also good, the fewer weird-as-heck triggers, the better. This smells like it should be some basic, dumb threading mistake somewhere, which would typically not be impacted by anything specific on the remote or local end aside from length of time making a race condition (or w/e) more likely.

@bitprophet

This comment has been minimized.

Member

bitprophet commented Jul 30, 2018

Wondering if this is related to pyinvoke/invoke#552 which boils down to Invoke's exception handling thread subclass (used in ThreadingGroup here) possibly having screwed up thread-death detection.

Gonna have to make sure I understand that (its potential fix, pyinvoke/invoke#553, wasn't an insta-merge since it seemed odd that we'd have gotten something apparently functional, so wrong) and then see whether applying it makes this symptom disappear.

@nicktimko

This comment has been minimized.

nicktimko commented Jul 30, 2018

I removed the sleep to see what would happen; I am still able to reproduce, though it's now slightly more intermittent

Sounds like the test case I had, where I needed to hit it a few times before it bugged out. Seems like you have a good handle on it

@jensenak

This comment has been minimized.

jensenak commented Jul 30, 2018

I noticed today that I also couldn't reproduce the Exception behavior I had described a month ago... unfortunately I don't remember what I was doing then. :/

I am indeed running bash here. Good find. The fact that the problem is intermittent without the sleep makes me wonder if this is a race condition of some kind.

@bitprophet

This comment has been minimized.

Member

bitprophet commented Jul 30, 2018

You say that, but now I can't repro it again, or at least it's VERY intermittent. Putting the sleep back in makes it come up a lot more often. Gotta love race conditions.

Looking at that Invoke issue, the reporter even mentions a messed-up terminal as a symptom; but strangely I cannot reproduce that symptom even under bash, with his code. Still would not be surprised if the root cause is the same (has to do with a few things around thread death and stdin being closed, or perhaps set back to linewise buffering, properly before exit).

@bitprophet

This comment has been minimized.

Member

bitprophet commented Jul 31, 2018

Checking the spots the other issue mentioned, against the repro case here:

  • the ExceptionHandlingThread.is_dead bit doesn't seem to matter, it's coming up presumably-correct, which makes some sense as it's intended to handle exceptions in the thread and none of these cases actually handle exceptions. is_dead is False for all 3 worker threads (stdin/out/err) when I would expect it to be.
  • the assertion that we're not properly closing the subprocess' stdin feels closer to the mark; if that leaves the controlling terminal's stdin attached to a now-dead file descriptor or something...? (I really should know better what happens in this case anyways.)
    • Except...in Fabric's case, there is no local subprocess and no direct passthrough of file descriptors, so that can't be the case.
    • Meaning the issue is more likely to be something else?

Trying another tack...what exactly about the terminal environment after the bug shows up, has changed? Running stty -a under bash both with and without bug corruption present, the differences I can see are:

  • lflags: bugged-out terminal has -icanon, -echo, -pendin (vs regular term where those all lack a minus sign). Not echoing certainly seems like an issue, assuming that's what that means.
  • iflags: bugged-out has -ixany and ignpar (the first example of something being set, not unset, in the bad setup)
  • oflags and cflags identical, as are cchars (I'd be real weirded out if control chars had changed...)

According to man stty:

  • icanon controls ERASE and KILL processing; probably not a big diff (tho why this is set or unset might be interesting)
  • echo is what it sounds like, whether to echo, and is clearly the biggest practical issue of the bug.
  • pendin states whether input (presuming stdin) is pending after a canonical switch (and since icanon is clearly flipped...yea) and will be re-input when a read becomes pending or more input arrives. Not clear why this matters, or why it's set normally and unset when bugged (I'd have expected the latter, if anything.)
  • ixany allows any character to 'start output' (and when not set, only allows START. ok?)
  • ignpar means to ignore (or unset, to not ignore) characters with parity errors.

All in all it feels like some higher level 'mode' is being applied to the terminal, similar to how we set stdin to character-buffered reading to let us read 1 byte at a time instead of waiting til the user mashes enter.

Which sounds like the behavior under display (sort of...), and which I was wondering about earlier; but reading the code in question (because that Invoke patch mentions it as well, though re: thread death), the mode change is phrased as a contextmanager, so it should always be getting unset regardless of how we break out of that loop. But I'll need to triple check that now.

@bitprophet

This comment has been minimized.

Member

bitprophet commented Jul 31, 2018

Minor: simply saying stty echo to set echo is sufficient to 'fix' a terminal; even if icanon, pendin etc are still unset. Doesn't really help but hey, good to know I guess.

@bitprophet

This comment has been minimized.

Member

bitprophet commented Jul 31, 2018

OK! I think I figured it out, while staring at that contextmanager: it's probably because the contextmanager snapshots current terminal state for restore at close of block. But what are we doing in this case? We're running two separate high level threads, each of which is running its own copy of this context manager!

And in Invoke, while we intend to be threadsafe, we don't currently test anything but our own low-level IO threads; 99% of the "thread-safety" is simply the use of self-contained object state instead of Fabric 1's horribad global module state. So this particular bit of state-keeping is not ever run concurrently with itself (partly because the "state" is literally the controlling terminal, of which there is only ever one, so...global state...).

I haven't 100% proven it yet (about to) but there's no way this isn't it. The thread that runs second is highly likely to snapshot the controlling terminal attributes after the first thread has already set it into character-buffer mode; then, if that second thread also finishes second (again, likely but not certain) it "restores" the bad state, effectively undoing the first thread's restoration.

@bitprophet

This comment has been minimized.

Member

bitprophet commented Jul 31, 2018

Confirmed that the ECHO flag, for example, is definitely being snapshotted by the non-first contextmanager, then restored by same. Working on a solution, which I think will end up just being "try to figure out if setcbreak looks already applied, and no-op in that case instead of doing the snapshot-modify-restore dance".

Should have intended effect, is marginally cleaner to boot (never runs setcbreak >1 time) and avoids a corner case where a naive fix might always just set ECHO, etc to "on" – which would break in situations where the stream in question is tty-like but was already set to not be echoing. (Unlikely, sure, but probably not impossible.)

@bitprophet

This comment has been minimized.

Member

bitprophet commented Jul 31, 2018

Since this is an Invoke-only issue I'm gonna give it a home on that tracker - I'm expecting to get a test and fix done for this soon but if y'all have anything else to add please pop on over to pyinvoke/invoke#559

To be crystal clear, once that's fixed it should be out in Invoke 1.0.2/1.1.1 (and possibly 1.2.0 if I get that out at the same time) and no Fabric upgrades should be necessary, only Invoke.

@bitprophet bitprophet closed this Jul 31, 2018

@sinwoobang

This comment has been minimized.

sinwoobang commented Aug 2, 2018

@bitprophet Great! It works after upgrading Invoke :)
Thank you for your effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment