Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-exec: fix potential leak of job KVS namespaces #5805

Merged
merged 2 commits into from Mar 19, 2024

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Mar 18, 2024

This PR fixes the issue described in #5790. If a job is canceled or gets a fatal job exception while the namespace create RPC is in flight, the fact that a namespace exists for the job is dropped and the namespace will be leaked.

Also included in this PR is an improvement to the exec_kill error message handling to reduce a non-trivial amount of log noise for large jobs.

Problem: When terminating a job in the midst of starting, or when a
fatal job exception is raised on a job that has already been canceled,
the job-exec module may generate lots of log messages like:

 exec_kill: any (rank 4294967295): No such file or directory

In this case it is likely the subprocess was unable to be signaled,
therefore a broker rank is not associated with the child future for
which the erorr is being printed.

It isn't valuable to print an error in this case, so just skip the
flux_log() when rank == FLUX_NODEID_ANY.

Similarly, it probably isn't useful to log an error if the errno is
ENOENT, since this just means the remote process is no longer running.
Copy link

codecov bot commented Mar 19, 2024

Codecov Report

Merging #5805 (10352e3) into master (2bfe617) will increase coverage by 0.00%.
The diff coverage is 86.66%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5805   +/-   ##
=======================================
  Coverage   83.33%   83.34%           
=======================================
  Files         509      509           
  Lines       82478    82485    +7     
=======================================
+ Hits        68736    68743    +7     
  Misses      13742    13742           
Files Coverage Δ
src/modules/job-exec/bulk-exec.c 77.80% <100.00%> (+0.18%) ⬆️
src/modules/job-exec/job-exec.c 77.00% <77.77%> (-0.02%) ⬇️

... and 11 files with indirect coverage changes

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -696,7 +696,8 @@ static flux_future_t * namespace_move (struct jobinfo *job)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In first line of commit description in fix potential namespace leak, there is a stray a.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fixed and will set MWP.

Problem: The job->has_namespace flag is only set after the response
for the KVS namespace creation RPC is received. This can cause a rare
namespace leak if a job exception occurs after the request is sent,
but before the response is received. The cleanup code may not see
that the has_namespace flag is set and the namespace is never deleted.

Set job->has_namespace just after sending the namespace create RPC, so
that namespace cleanup is forced if there is any chance the namespace
will exist when the job is complete.

Also, to avoid skipping namespace_delete() when some other previous
function fails (such as the final eventlog write or the copy of
the namespace contents), call flux_future_or_then(3) as well as
flux_future_and_then(3) in the chain of namespace cleanup futures.

Fixes flux-framework#5790
@grondo
Copy link
Contributor Author

grondo commented Mar 19, 2024

Hm, the distcheck build failed here:
Most likely unrelated, so I'll restart the build. It isn't readily apparent what failed here so we'll have to keep an eye out for this one.

 expecting success: 
  	jobid=$(flux submit --wait-event=exec.shell.init sleep inf)
  	watchers=$(get_update_watchers)
  	${UPDATE_WATCH} $jobid R > watch10A.out &
  	watchpidA=$! &&
  	wait_update_watchers $((watchers+1)) &&
  	update1=$(expiration_add $jobid 100)
  	${UPDATE_WATCH} $jobid R > watch10B.out &
  	watchpidB=$! &&
  	wait_update_watchers $((watchers+2)) &&
  	${WAITFILE} --count=2 --timeout=30 --pattern="expiration" watch10A.out &&
  	${WAITFILE} --count=1 --timeout=30 --pattern="expiration" watch10B.out &&
  	kill -s USR1 $watchpidA &&
  	wait $watchpidA &&
  	kill -s USR1 $watchpidB &&
  	wait $watchpidB &&
  	test_debug "echo watch10A: \"$(cat watch10A.out)\"" &&
  	test_debug "echo watch10B: \"$(cat watch10B.out)\"" &&
  	flux cancel $jobid &&
  	test $(cat watch10A.out | wc -l) -eq 2 &&
  	test $(cat watch10B.out | wc -l) -eq 1 &&
  	head -n1 watch10A.out | jq -e ".execution.expiration == 0.0" &&
  	tail -n1 watch10B.out | jq -e ".execution.expiration == ${update1}" &&
  	cat watch10B.out | jq -e ".execution.expiration == ${update1}"
  
  waiting for 2 watchers
  2
  waiting for 3 watchers
  3
  {"version": 1, "execution": {"R_lite": [{"rank": "0", "children": {"core": "0"}}], "starttime": 1710864460.7675567, "expiration": 0.0, "nodelist": ["fv-az528-180"]}}
  {"version": 1, "execution": {"R_lite": [{"rank": "0", "children": {"core": "0"}}], "starttime": 1710864460.7675567, "expiration": 1710864560, "nodelist": ["fv-az528-180"]}}
  {"version": 1, "execution": {"R_lite": [{"rank": "0", "children": {"core": "0"}}], "starttime": 1710864460.7675567, "expiration": 0.0, "nodelist": ["fv-az528-180"]}}
  not ok 13 - job-info: update watch can be canceled (multiple watchers)

@mergify mergify bot merged commit 1844972 into flux-framework:master Mar 19, 2024
33 checks passed
@grondo grondo deleted the issue#5790 branch March 19, 2024 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants