Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid deadlocks while joining the output streamer #817

Merged
merged 3 commits into from
Aug 12, 2022

Conversation

rmartin16
Copy link
Member

@rmartin16 rmartin16 commented Aug 9, 2022

Interrupting the joining of a thread with CTRL+C can lead to deadlocking certain versions of Python (python/cpython#89437).

Changes:

  • Replaces calls to join() the streamer thread with waiting on is_alive().

  • Instantiate the streamer thread as a daemon to prevent Python itself trying to join the thread.

  • Remove all logic from streamer thread that is not necessary to stream output.

    • This limits any race conditions between the streamer finishing and Python exiting.
    • This was noticeably a problem when the streamer thread was trying to print the return code.
  • As seen in Python itself, give some cushion of time for the CTRL+C to propagate to child processes.

  • Fixes On Windows, pressing Ctrl-C during an Android build or run causes briefcase to hang #809

CC: @mhsmith

PR Checklist:

  • All new features have been tested
  • All new features have been documented
  • I have read the CONTRIBUTING.md file
  • I will abide by the code of conduct

@rmartin16 rmartin16 force-pushed the thread-interrupt branch 2 times, most recently from 73f6367 to c744656 Compare August 9, 2022 20:09
@rmartin16 rmartin16 marked this pull request as ready for review August 9, 2022 20:28
streamer_deadline = time() + 3
while output_streamer.is_alive() and time() < streamer_deadline:
sleep(0.1)
self._log_return_code(popen_process.poll())

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This waiting around for the streamer is rather arbitrary.

In any expected case, is_alive() will return False and we continue on.

It's at least theoretically possible is_alive() returns True. This should mean the streamer is flooding the console with output or readline() is stuck waiting for EOF......or something worse.

That said, waiting around forever (like we were) would probably be worse....and moving forward prematurely presents the possibly to mix subprocess output with briefcase output....although, at that point, the output wouldn't likely be very valuable anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this is a bit arbitrary; but it's kinda forced on us by the limitations of the tools we have available. I can't see any obvious improvement to the approach that doesn't also involve deprecating Python 3.7 and 3.8 on Windows.

The one minor cosmetic improvement I can see is to add some console output to track when these conditions occur.

mock_sub.cleanup.assert_called_once_with("testing", popen_process)


def test_stuck_streamer(mock_sub, popen_process, monkeypatch, capsys):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely open to any feedback for potential improvement

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - this is hard to test; this is about as good as it's going to get AFAICT.

I've pushed a change that mocks time.time(); this allows us to shortcut the loop so it doesn't take a full 3 seconds to finish. Ideally, we'd mock time.sleep() as well - but if you do that, the parent thread never releases control to the child thread (at least, it doesn't in my testing), so some sleeping is required to test the behaviour of the subthread.

@rmartin16 rmartin16 force-pushed the thread-interrupt branch 2 times, most recently from 9b80045 to 6f61a78 Compare August 10, 2022 17:16
Copy link
Member

@freakboy3742 freakboy3742 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, and works well in my testing. I'll wait for @mhsmith to confirm it works for him (since he was the original reporter), then merge once he's given a thumbs up.

streamer_deadline = time() + 3
while output_streamer.is_alive() and time() < streamer_deadline:
sleep(0.1)
self._log_return_code(popen_process.poll())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this is a bit arbitrary; but it's kinda forced on us by the limitations of the tools we have available. I can't see any obvious improvement to the approach that doesn't also involve deprecating Python 3.7 and 3.8 on Windows.

The one minor cosmetic improvement I can see is to add some console output to track when these conditions occur.

mock_sub.cleanup.assert_called_once_with("testing", popen_process)


def test_stuck_streamer(mock_sub, popen_process, monkeypatch, capsys):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - this is hard to test; this is about as good as it's going to get AFAICT.

I've pushed a change that mocks time.time(); this allows us to shortcut the loop so it doesn't take a full 3 seconds to finish. Ideally, we'd mock time.sleep() as well - but if you do that, the parent thread never releases control to the child thread (at least, it doesn't in my testing), so some sleeping is required to test the behaviour of the subthread.

capsys.readouterr().out == "output line 1\n"
"\n"
"output line 3\n"
"Stopping...\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - this is really weird; I'm not seeing the CI failure on any of my test configurations. Investigating...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found the source of the problem. Since the streamer thread isn't doing any substantial work, it was completely draining and exiting before we did our second stop check. The emergence of this behavior will be completely non-deterministic, as it depends on OS-level thread scheduling; however, we can strongly encourage the right behavior by adding a sleep to gently suggest to the OS that the streamer thread should suspend for a bit.

@codecov
Copy link

codecov bot commented Aug 11, 2022

Codecov Report

Merging #817 (e982758) into main (abdcef3) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted Files Coverage Δ
src/briefcase/integrations/subprocess.py 100.00% <100.00%> (ø)
src/briefcase/console.py 100.00% <0.00%> (ø)
src/briefcase/integrations/android_sdk.py 99.09% <0.00%> (ø)
src/briefcase/platforms/android/gradle.py 93.20% <0.00%> (+0.20%) ⬆️

Copy link
Member

@mhsmith mhsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works fine for me in both MSYS2 and cmd, when interrupting both Gradle and Logcat. Just one maintainability comment:

Comment on lines 402 to +403
self.cleanup(label, popen_process)
output_streamer.join()
streamer_deadline = time.time() + 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timeout should always be the same duration as the timeout in cleanup, so both numbers should come from a single source.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please help me understand why these waits should be the same?

As best I can tell, they will happen sequentially and quite independently of each other. Further, I was initially considering making this final wait shorter....lest the user gets frustrated that their first CTRL+C wasn't working out and they just started sending ever more...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you're quite right, I got confused by the threading. The first timeout is to shut down the subprocess, and the second timeout is to shut down the thread reading the process output, which may be delayed slightly in buffers. So this is fine the way it is.

@rmartin16
Copy link
Member Author

I didn't want to add any more changes to this at this point but the originating issue of interrupting an Android build or the emulator got me trying to interrupt arbitrary tasks.

I realized the output streamer is masking CTRL+C; therefore, a lot of upstream code is perceiving a failure leading to alerting the user and creating a briefcase log. Instead, the streamer should probably be re-raising the KeyboardInterrupt to generate the CTRL+C response that happens outside of a Wait Bar (and the streamer more generally). I can pursue this in a separate PR...although IMO that probably shouldn't hold up a release.

Additionally, fyi, I found a real-world use-case where cleanup() actually calls kill(). Turns out Docker does not just exit in response to the terminate(). When I sent a CTRL+C while building an AppImage in Docker, I initially thought it didn't take because (as it turns out) 3 seconds is really long time when you're trying to decide if CTRL+C worked....so, I started sending more which just caused briefcase to completely abort. That may be at least some evidence that these waits should be decreased since users already asked briefcase to exit and it doesn't feel like that is happening in the moment. The log for Stopping... helps but in the case of an AppImage build, it gets lost in the flood of ongoing console logs. Alternatively, after the first time, a user may realize it did actually work and they just need to be more patient next time.

@freakboy3742
Copy link
Member

@rmartin16 Agreed that handling the KeyboardInterrupt in an output stream so that it doesn't generate a log would be preferable; but also agreed that (as well as any other improvements around streamer interaction) can be handled as future work.

@freakboy3742 freakboy3742 merged commit 2f1fd02 into beeware:main Aug 12, 2022
@rmartin16 rmartin16 deleted the thread-interrupt branch January 21, 2023 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

On Windows, pressing Ctrl-C during an Android build or run causes briefcase to hang
3 participants