Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux-exec: support forced exit & misc fixes #1997

Merged
merged 3 commits into from Feb 10, 2019

Conversation

@chu11
Copy link
Contributor

commented Feb 8, 2019

Support a pdsh-ish mechanism to support forced exits on flux-exec when processes hang or are ignoring signals. Because flux-exec sends a signal on the first ctrl+c (which pdsh doesn't), I decided for it to behave a bit differently than pdsh. Instead, on the second ctrl+c, you get the "please ctrl+c within a second to exit" message. So it really takes 3 ctrl+c to force exit. I know it's a tad different, but it made sense to me.

Also fixed up a corner case where I foolishly output verbosity for each subprocess instead of for all subprocesses.

Also fixed up flux_subprocess_kill() to allow users to send signals multiple times. I have no idea why I didn't allow this at first, it makes no sense.

When the user sends a signal to sub-processes, a line of text
was output for every process the signal was sent to.  This is
excessive.  Instead, output only a single line indicating the
number of sub-processes the signal was sent to.
@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 8, 2019

Instead, on the second ctrl+c, you get the "please ctrl+c within a second to exit" message. So it really takes 3 ctrl+c to force exit. I know it's a tad different, but it made sense to me

I agree with your decision here FWIW

@codecov-io

This comment has been minimized.

Copy link

commented Feb 8, 2019

Codecov Report

Merging #1997 into master will increase coverage by <.01%.
The diff coverage is 76.92%.

@@            Coverage Diff             @@
##           master    #1997      +/-   ##
==========================================
+ Coverage   80.17%   80.17%   +<.01%     
==========================================
  Files         195      195              
  Lines       35146    35151       +5     
==========================================
+ Hits        28178    28184       +6     
+ Misses       6968     6967       -1
Impacted Files Coverage Δ
src/common/libsubprocess/subprocess.c 87.57% <ø> (-0.15%) ⬇️
src/cmd/flux-exec.c 78.68% <76.92%> (+0.2%) ⬆️
src/modules/kvs/kvs.c 66.33% <0%> (-0.15%) ⬇️
src/common/libflux/mrpc.c 88.93% <0%> (+1.18%) ⬆️
src/common/libflux/response.c 82.71% <0%> (+1.23%) ⬆️
@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 8, 2019

Ooh, this is going to be a fun one to test ;-)

@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 8, 2019

One other issue I've run into with flux exec is that it doesn't print an error when tasks fail. It does exit with non-zero status if any subprocess does, but it might be nice (I could probably be convinced otherwise) if it reported subprocesses that exit abnormally at the end with an error message like:

[0-1023]: Exit 1

The case I ran into I do think should be fixed -- I'm embarrassed to admit all my tasks were segfaulting and I had no idea until I listed my cwd. I'd argue flux exec should at least report back tasks that were killed due to a signal and resulted in a coredump.

Let me know if you'd like an issue open on this.

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Feb 8, 2019

Let me know if you'd like an issue open on this.

I'll open up an issue for this.

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Feb 8, 2019

hmmm coverage is < 80% on diff for obvious reasons. It wouldn't be hard to add a simple test that sends SIGINTs to flux-exec. I'll add one.

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Feb 9, 2019

repushed with 1 added unit test

@chu11 chu11 force-pushed the chu11:issue1981 branch from d94a528 to b0f3232 Feb 9, 2019
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Feb 9, 2019

doh, had a bash-ism, repush

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Feb 9, 2019

ugh, I'm wondering if a test of something so timing specific is a bad idea in travis. Maybe I shouldn't have bothered.

+# first sigint is to send it, it should be ignored
+# second & third sigint within 1 second should force-exit flux exec
+test_expect_success NO_CHAIN_LINT 'exec ctrl+c multiple times breaks out' '
+       flux exec -v -r all ${FLUX_SOURCE_DIR}/t/exec/ignore_sigint.sh 30 > sigint.out 2>&1 &
+        pid=$! &&
+        sleep 1 &&
+        kill -INT $pid &&
+        sleep 2 &&
+        kill -INT $pid &&
+        sleep 0.5 &&
+        kill -INT $pid &&
+        ! wait $pid &&
+        grep "sending signal" sigint.out &&
+        grep "interrupt" sigint.out
+
+'
@chu11 chu11 force-pushed the chu11:issue1981 branch from b0f3232 to a29456b Feb 9, 2019
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Feb 9, 2019

i removed the test, as I think something so timing based is doomed for travis. I could have added a small test just to make sure the "signal send to ..." verbosity message was covered, but decided that wasn't worth it. So diff coverage is a tiny bit bad.

gettimeofday (&now, NULL);

timersub (&now, &last, &diff);

This comment has been minimized.

Copy link
@grondo

grondo Feb 9, 2019

Contributor

Our monotime interface (libutil/monotime.h) or at least clock_gettime should be used here instead of gettimeofday, which is obsolesced by POSIX I think. See here

This comment has been minimized.

Copy link
@chu11

chu11 Feb 9, 2019

Author Contributor

ahh wasn't aware, updated & repushed

chu11 added 2 commits Feb 7, 2019
In the event a subprocess is stuck or ignoring SIGINT, allow
users the ability to force flux-exec to exit by sending two
SIGINTs (typically via Ctrl+C) in short succession.

Fixes #1981
Allow flux_subprocess_kill() to send multiple signals to
subprocesses.
@chu11 chu11 force-pushed the chu11:issue1981 branch from a29456b to 8b0e7fa Feb 9, 2019
@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 10, 2019

Nice! Thanks!

@grondo grondo merged commit 9cd864e into flux-framework:master Feb 10, 2019
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.