Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow flux_subprocess_kill() on remote processes not yet running #2297

Merged
merged 6 commits into from Aug 12, 2019

Conversation

@grondo
Copy link
Contributor

commented Aug 10, 2019

This PR is somewhat related to #2275.

When signalling a set of remote processes, it is inconvenient to handle the case of processes for which a request has been sent, but a response has not yet been received. Currently, flux_subprocess_kill(3) will return EINVAL on these processes since they are not in the RUNNING state.

Instead of requiring the libsubprocess user to handle this case by tracking which remote processes have pending signals, this PR adds a signal_pending flag for remote subprocesses which tries to do the right thing.

If a user calls flux_subprocess_kill on a remote process which is in the INIT state, then the pending flag is set and an empty future is returned to the caller. Later, if that subprocess reaches the RUNNING state, then the signal is forwarded, and its response fulfills the future returned to the caller earlier.

If the subprocess fails then the future is fulfilled with an error.

For testing new options were added to the t/rexec/rexec test program, which then obviated the need for the signal specific t/rexec/rexec_signal, which was subsequently removed.

All this will make it much easier for the bulk exec code in the job-exec module to cancel a job due to an exception when the job is in the middle of starting up.

grondo added 2 commits Aug 10, 2019
Problem: When a remote process is still in the INIT state, i.e.
a remote exec request has been sent but the response not yet
received, it is inconvenient for a caller to signal that remote
process, since they must wait until the response is received
that puts that process into the RUNNING state.

Adjust flux_subprocess_kill() to allow a signal to be pending
for remote processes that are not yet RUNNING. The signal will
be forwarded once the state changes to RUNNING, or the signal
will be canceled with error if the process fails to exec or
for any other reason doesn't enter the RUNNING state.
Plug some small leaks in the test program under t/rexec/rexec.
This aids in debugging with valgrind.
@grondo grondo force-pushed the grondo:rexec-kill-pending branch from b570d43 to e695eca Aug 10, 2019
@garlick

This comment has been minimized.

Copy link
Member

commented Aug 11, 2019

Nice coverage! This looks OK to me although I admit I don't understand the code very well. Probably be good to get a second look by @chu11.

@chu11
chu11 approved these changes Aug 12, 2019
Copy link
Contributor

left a comment

Great idea for this fix. Everything looks good to me, just a minor nit I noticed. No big deal if we want to merge and get this in ASAP though.


void send_sigterm (flux_subprocess_t *p)
{
flux_future_t *f = flux_subprocess_kill (p, 15);

This comment has been minimized.

Copy link
@chu11

chu11 Aug 12, 2019

Contributor

minor nit, why 15 instead of SIGTERM macro?

This comment has been minimized.

Copy link
@grondo

grondo Aug 12, 2019

Author Contributor

I think it was just to avoid pulling in signal.h. Silly reason, I'll fix it.

grondo added 4 commits Aug 10, 2019
Add two new options to the rexec test program related to
testing flux_subprocess_kill().

 * -k, --kill:
   send SIGTERM to processes once they are running

 * -K, --kill-immediately:
   send SIGTERM to processes immediately after flux_rexec()
Switch t0005-rexec.t to use basic t/rexec/rexec test program
instead of specific rexec/rexec_signal program to drive the
signal tests, now that the -k, -K options have been added.
The basic rexec test program now handles the duties of the more
specific rexec_signal test. Remove rexec_signal since it is no
longer used.
Ensure that flux_subprocess_kill() on remote processes that are not
yet running works as expected.
@grondo grondo force-pushed the grondo:rexec-kill-pending branch from e695eca to 9c3d372 Aug 12, 2019
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Aug 12, 2019

Ok, fixed that minor nit and pushed with a rebase.

@codecov-io

This comment has been minimized.

Copy link

commented Aug 12, 2019

Codecov Report

Merging #2297 into master will increase coverage by <.01%.
The diff coverage is 87.5%.

@@            Coverage Diff             @@
##           master    #2297      +/-   ##
==========================================
+ Coverage   80.88%   80.89%   +<.01%     
==========================================
  Files         214      214              
  Lines       33654    33681      +27     
==========================================
+ Hits        27222    27245      +23     
- Misses       6432     6436       +4
Impacted Files Coverage Δ
src/common/libsubprocess/remote.c 71.22% <85.71%> (+0.58%) ⬆️
src/common/libsubprocess/subprocess.c 89.93% <88.88%> (+0.72%) ⬆️
src/shell/shell.c 80.51% <0%> (-0.52%) ⬇️
src/common/libflux/message.c 80.5% <0%> (-0.38%) ⬇️
src/common/libsubprocess/local.c 79.79% <0%> (-0.35%) ⬇️
@chu11

This comment has been minimized.

Copy link
Contributor

commented Aug 12, 2019

awesome, thanks!

@chu11 chu11 merged commit 405bac8 into flux-framework:master Aug 12, 2019
4 checks passed
4 checks passed
Summary 1 potential rule
Details
codecov/patch 87.5% of diff hit (target 80.88%)
Details
codecov/project 80.89% (+<.01%) compared to ecf3f78
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Aug 12, 2019

Thanks!

@grondo grondo deleted the grondo:rexec-kill-pending branch Aug 12, 2019
@garlick garlick referenced this pull request Sep 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.