Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-exec: fix exception handling of jobs in transition states #2309

Merged
merged 16 commits into from Aug 15, 2019

Conversation

@grondo
Copy link
Contributor

commented Aug 14, 2019

This PR fixes some of the issues hit by @dongahn's stress test described in #2275. It doesn't fix the race described in #2279, which requires a rework of the exception protocol.

I'm still trying to figure out exactly how to test some of these conditions, especially the one that triggers cancellation of a partially running job (pending commands are descheduled in the bulk-exec implementation). However, on this branch, the current version of the cancel stress test:

#!/bin/bash

NREPEATS=100
NJOBS=12

flux jobspec srun -n 2 hostname >j
for j in $(seq 1 ${NREPEATS}); do
   t/ingest/submitbench -f 24 -r 100 j || exit 1
   flux job list
   flux job list | grep -v USERID | xargs -L 1 flux job cancel
   echo "Iteration"
   sleep 1
done


flux dmesg|grep "job-manager.err"

No longer reproduces the "start response" error (which was a particularly dangerous case, since it meant the job was still running after resources were released), nor the "emit_event" errors.

I still do see, however, the connector-local EPIPE errors (probably expected and nothing to worry about. Tasks were killed before a message or event could be delivered over the connector-local socket), and the following error from sched-simple, possibly due to #2279?

2019-08-14T15:28:28.971563Z sched-simple.err[0]: alloc received before previous one handled
2019-08-14T15:28:28.971575Z sched-simple.err[0]: alloc: flux_respond_error: Invalid argument

grondo added 16 commits Aug 13, 2019
Problem: Only a `kill` method is currently exported by execution
implementation modules. This method could be construed as to either
implement cancelling a running job, or signalling a running job (i.e.
as in the `kill(2)` system call).

Provide an additional `cancel` method which is meant explicitly to
cancel pending work, or shells which have not yet started execution.
To send a signal to a running job, the `kill` method is still provided.

For the bulk_exec module, provide a cancel that removes all pending
commands and releases their resources immediately.
Problem: The `jobinfo_kill` function in job-exec module has a confusing
name, since it is meant to terminate or cancel executing jobs, not signal
them. Also, the function sends SIGTERM then SIGKILL to running shells,
but there is a race where shells that are still pending launch aren't
canceled/killed.

Rename the `jobinfo_kill` function to `jobinfo_cancel` to avoid some
naming confusion. Also, if the exec implementation offers a `cancel`
method, use this to cancel any to-be-launched shells before they
are started.

However, if there are commands still pending to be run, the function
doesn't cancel those.
Don't log errors when exec_kill() returns ENOENT. This just means there
were no subprocesses to kill, and is not an error condition.
fluxmod_ldadd was used in place of fluxmod_libadd in a couple
module Makefiles. Though it didn't seem to cause an issue, fix
it for consistency.
It may be necessary to associate aux data with a "bulk execution"
object. Add standard bulk_exec_aux_set() and bulk_exec_aux_get()
interface to the bulk-exec module.
Add accessors to the bulk_exec api to return the current number
of starting/running processes, and the total number of processes.
Add the ability to create mock exceptions in the job-exec module
during the initializing phase (before any shells are scheduled to
be launched), or starting phase, when some, but not all, shells
have been launched.

A mock exception is generated if a new jobspec key

 attributes.system.exec.bulkexec.mock_exception

is set to either "init" or "starting".

This allows corner cases for exception handling during normal
job execution to be tested.
Allow bulk_exec_kill() to send signals to pending processes
(in FLUX_SUBPROCESS_INIT), since this is supported by libsubprocess.

This will allow bulk_exec_kill() to properly clean up all
launched subprocesses, instead of just those that have finished
starting.
Problem: A subprocess which fails to enter the running state
doesn't increase the "completed" cound nor terminate an "job" in
the bulk execution module.

Refactor the code that checks for completion of a bulk execution
and count failed proceses as complete.
Problem: bulk_exec_kill() could return an empty composite future
which could never be fulfilled if there are processes on the
exec->processes list, but none are in the INIT or RUNNING state.

Check after looping through all futures to ensure there is at
least one child in the wait_all future and return ENOENT if not.
Problem: while a new exec job is "initializating" (creating namespace,
fetching kvs data {R, J}, etc) a job exception could come in and destroy
the initializing job.

Protect the jobinfo structure by taking a reference during initialization.
In the continuation for the initialization composite future, do not continue
with job execution if an exception was recieved.
Replace the ad-hoc flags for internal exec jobs with flags that
are slightly more clear. Rework when these flags are set and
how they are used in the job cleanup code, especially as it
relates to exception handling.
If an exception is received for a job that is already finalizing, then
there is nothing to do since the job is already cleaning up. Avoid calling
jobinfo_fatal() for this case.
Add a test option to cancel bulk execution after a number of processes
have been started.
Add a timeout of 3s for the future returned by bulk_exec_kill.

This prevents leaking future if all kill requests do not receive
a response for some reason. Additionally, this will generate a
timeout error for an exceptional condition.
Add tests to t2402-job-exec-dummy.t to exercise code handling
exceptions during initialization and starting phases in the
bulk-exec implementation.
@grondo grondo force-pushed the grondo:racy-cancel branch from 80e0082 to 84cde85 Aug 15, 2019
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Aug 15, 2019

Ok, most recent version of this PR fixes a few other races and leaks found while working on the original issue. I think it might be good enough for now assuming Travis passes.

Edit: Also added a method for triggering exceptions during real execution so that we get coverage of the code that handles the corner cases. (This is what triggered the finding of other bugs and leaks)

@codecov-io

This comment has been minimized.

Copy link

commented Aug 15, 2019

Codecov Report

Merging #2309 into master will increase coverage by 0.04%.
The diff coverage is 92.45%.

@@            Coverage Diff             @@
##           master    #2309      +/-   ##
==========================================
+ Coverage   80.83%   80.87%   +0.04%     
==========================================
  Files         214      214              
  Lines       33732    33800      +68     
==========================================
+ Hits        27266    27336      +70     
+ Misses       6466     6464       -2
Impacted Files Coverage Δ
src/modules/job-exec/job-exec.c 74.73% <100%> (+0.05%) ⬆️
src/modules/job-exec/exec.c 78.12% <87.17%> (+3.39%) ⬆️
src/modules/job-exec/bulk-exec.c 78.63% <94.73%> (+3.63%) ⬆️
src/modules/connector-local/local.c 73.26% <0%> (-0.29%) ⬇️
src/common/libflux/message.c 80.88% <0%> (+0.25%) ⬆️
src/common/libutil/veb.c 98.85% <0%> (+3.42%) ⬆️
@garlick

This comment has been minimized.

Copy link
Member

commented Aug 15, 2019

Really nice, thorough job. Test coverage for this is great.

I've been unable to see any problems running the script in the description and some earlier variants. Ready to merge?

@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Aug 15, 2019

Yes, ready.

@garlick garlick merged commit 0b39b27 into flux-framework:master Aug 15, 2019
4 checks passed
4 checks passed
Summary 1 potential rule
Details
codecov/patch 92.45% of diff hit (target 80.83%)
Details
codecov/project 80.87% (+0.04%) compared to 5052842
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@garlick garlick referenced this pull request Sep 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.