-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add default cancel_callback to handle common signals #862
add default cancel_callback to handle common signals #862
Conversation
1ae7d69
to
f273b43
Compare
f273b43
to
50a676e
Compare
👍 This appears to me to be doing the right thing. There's a lot of testing work still to be done, of course, but it's important to show people the WIP to set expectations. I could see this being the final patch to the code itself. |
I wrote a test here expecting to reproduce the original bug https://github.com/ansible/ansible-runner/compare/devel...AlanCoding:container_kill_test?expand=1 However, this test was not able to replicate the issue. When the ansible-runner process is killed, the container dies with it. That is without this patch. |
Co-authored-by: Alan Rominger <arominge@redhat.com>
@AlanCoding made a few tweaks to the test that were needed for it to run for me:
With those changes I found that the test passed on devel runner (w/out my patch). Suspecting that this might be a matter of timing, I added a five second sleep after we call I think that even with the five second sleep, the test still represents a very real situation that users could find themselves in. If anything, it seems likely that a user would cancel a job once it's well underway, moreso than cancelling an instant after it launched. @AlanCoding I went ahead and added your changeset into this PR. We can continue polishing it as needed, but it seems like a reliable reproducer of the issue we're trying to address here. |
@samdoran @Shrews - Wanted to check in with you to see if you have any feedback / concerns about getting this in. This PR helps address a scenario we're seeing with awx where receptor sends an ansible-runner process a SIGTERM which results in the container started by runner to be orphaned. The net effect is that the awx job is listed as cancelled despite a container continuing to execute the associated job. We would make use of a cancel_callback, except that runner is being invoked via CLI by receptor. Since that's not an option, thought we could introduce signal-handling behavior as a sane default implementation of the cancel_callback. Let us know what you think. |
* already set under test/data/sleep/env/settings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also add unit/functional tests to ensure the correct the cancel callback is set properly and that the handler is called as expected.
ansible_runner/interface.py
Outdated
logger.warning("Unable to set cancel_callback for signal handling; not in main thread") | ||
cancel_callback = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could potentially be somewhat noisy. Anytime this is run in a thread using the default cancel callback, this warning will be emitted. Maybe use a Sentinel or some other check to make sure this is emitted only once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I didn't want to silently disable the default, but I can see how this could get to be a bit much. I'll see if there's something I can do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wrestling with this one a bit. I tried writing this up so that we had a flag indicating whether or not the warning has been issued already. But then I wondered, where would that flag need to be set so that separate runs were retrieving the same flag? In the scenario you outlined the warning is being generated anytime we're in a new thread. That seems to suggest that when we're in a thread, we need to set a flag so that future threads know not to send a warning. It seems like this would require sending a message via a queue, I'm not sure. Again, I'm just not exactly sure how we would do this. Do you have thoughts on how to make this work? How strongly do you feel about having this warning? It seems like it might be fine to leave it out. Just adding the signal handling behavior by itself is a big improvement, even if we can't always set one (and even if we don't notify the user that they're missing out).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Pending feedback)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅
I think I've about finalized my independent testing script: rm -rf /tmp/ansible-runner
cd /tmp
git clone https://github.com/jladdjr/ansible-runner.git
# git clone https://github.com/alancoding/ansible-runner.git
cd ansible-runner
git checkout add_default_cancel_callback_for_cli
# git checkout container_kill_test
python3 -m venv env
source env/bin/activate
pip install -e .
pip install ansible-core
pip install mock
for i in `seq 10`; do py.test test/integration -k test_cli_kill_cleanup -n0 > /dev/null; echo $?; done
deactivate I run 2 variations of this, where it runs the test in the branch I linked above, or it runs the tests as a part of this PR. Running my branch (with the test but without the fix) produces about 1 or 2 failures over the 10 runs. These failures are failures to teardown the container. Running with this branch consistently gives 0 failures. That's the confirmation I was looking for to say that this does what was intended. Of course this is all with podman (what we care about for AWX). |
@samdoran I think I've gone through most of the feedback at this point. Still waiting on some unit test results locally and from zuul atm. It doesn't look like I have perms needed to mark discussions as resolved so I'll go through and drop a ✅ in places where I believe things have been addressed. |
Only see two items of feedback left here:
@samdoran any thoughts on the warning flag? ^ |
@samdoran - @thenets and I started testing the newest implementation using the threading / signal library and are seeing flakiness. The unit test that we added here has gone from passing 100% of the time to passing 55% of the time. I'm not exactly sure where the flakiness is coming from, but I was wondering how strongly you felt about taking this approach compared to the original one using SignalHandler? For the purposes of testing, going to revert back to the Signal Handler approach. The threading / signal approach that I tried out today can be seen at 303794c. |
b920ce4
to
303794c
Compare
@samdoran - took a break and came back to make fresh pass at the tests again. Very carefully re-ran the tests for the new version of cancel_callback using the threading / signal libraries and can confirm that I got a clean set of results this time, both from the runner unit tests and from awx testing. Here are ten back-to-back test runs of the new unit test added in this PR: So, I think this is looking good. I reviewed everything and think the PR should be good-to-go again. So, the only remaining item to revisit (apart from a general review of how I implemented your suggested signal handler) is:
|
ansible_runner/interface.py
Outdated
@@ -34,7 +34,8 @@ | |||
from ansible_runner.utils import ( | |||
dump_artifacts, | |||
check_isolation_executable_installed, | |||
santize_json_response | |||
santize_json_response, | |||
signal_handler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trailing commas make future diffs easier to read.
signal_handler | |
signal_handler, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅
@samdoran - can you provide more details on what went wrong here? |
e7cab39
to
0d7ee0e
Compare
looks like maybe it was unable to reach the container registry while trying to push an image |
92fc156
to
34d0e76
Compare
recheck |
# clean up settings artifacts generated by test | ||
if os.path.exists(env_dir): | ||
shutil.rmtree(env_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is necessary, then the test is doing something wrong. At least add a FIXME
note here that this can be removed once #847 is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current integration test fails on macOS due to an unrelated issue with the daemon
library (just a guess). I would rather not merge tests that do not pass locally.
I think we also need unit tests of signal_handler()
and functional tests to make sure the signals are handled as expected.
If we decide to keep this integration test, it needs a skipif
so it will be skipped on platforms where the tests do not work.
Can you provide some more details on the failure? How are you invoking the test? I needed to run with The two main points where I've seen an assertion fail in the test fail are:
@AlanCoding I believe that you are running with a docker for mac setup. Can you confirm if the tests fail for you when the fix is in place? |
@samdoran has mentioned that the macos failures are likely unrelated to this PR. Given that this fix addresses a significant error we're seeing in awx (cancel a job, job continues running in EE container), recommend we get this in as soon as we can and not block on existing / unrelated failures. |
Running
|
I'm on Fedora 34 with both podman and docker installed, checking out current branch I'm able to run the docker test.
Failing waiting for the container to start sounds like it might be either a timing issue, or something more basic isn't working. |
Yup, it works fine on Linux run serially. |
FYI, any test that uses |
@Shrews do you know what it is about OS X that isn't supporting the use of the libraries / calls you mentioned? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
mocker.patch('ansible_runner.utils.threading.main_thread', return_value='thread0') | ||
mocker.patch('ansible_runner.utils.threading.current_thread', return_value='thread1') | ||
|
||
assert signal_handler() is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
assert signal_handler()() is False | ||
assert mock_signal.call_args_list[0][0][0] == signal.SIGTERM | ||
assert mock_signal.call_args_list[1][0][0] == signal.SIGINT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
signal_handler() | ||
|
||
with pytest.raises(AttributeError, match='Raised intentionally'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Fixes #861
--
If you invoke runner as a python library, you have the option of specifying a cancel_callback (example). When calling runner via the CLI, however, there isn't an option to pass in a cancel_callback, so it defaults to being left undefined.
For the CLI, it seems like it would be reasonable to default to using a cancel_callback with a signal handler that would catch SIGINT and SIGTERM. This would allow runner to recognize an incoming signal as a request to gracefully shut down the running job, including ensuring that a container spun up as part of process isolation was stopped.
This PR updates runner to use a default cancel_callback when none is provided. The default cancel_callback catches both SIGINT and SIGTERM and gives the runner process (and other spawned processes, such as those started by a container runtime) a chance to exit gracefully.
steps to reproduce
ansible-runner run --process-isolation --process-isolation-executable podman -p test.yml demo