New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job-exec: fix potential leak of job KVS namespaces #5805
Conversation
Problem: When terminating a job in the midst of starting, or when a fatal job exception is raised on a job that has already been canceled, the job-exec module may generate lots of log messages like: exec_kill: any (rank 4294967295): No such file or directory In this case it is likely the subprocess was unable to be signaled, therefore a broker rank is not associated with the child future for which the erorr is being printed. It isn't valuable to print an error in this case, so just skip the flux_log() when rank == FLUX_NODEID_ANY. Similarly, it probably isn't useful to log an error if the errno is ENOENT, since this just means the remote process is no longer running.
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #5805 +/- ##
=======================================
Coverage 83.33% 83.34%
=======================================
Files 509 509
Lines 82478 82485 +7
=======================================
+ Hits 68736 68743 +7
Misses 13742 13742
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -696,7 +696,8 @@ static flux_future_t * namespace_move (struct jobinfo *job) | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In first line of commit description in fix potential namespace leak
, there is a stray a
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Fixed and will set MWP.
Problem: The job->has_namespace flag is only set after the response for the KVS namespace creation RPC is received. This can cause a rare namespace leak if a job exception occurs after the request is sent, but before the response is received. The cleanup code may not see that the has_namespace flag is set and the namespace is never deleted. Set job->has_namespace just after sending the namespace create RPC, so that namespace cleanup is forced if there is any chance the namespace will exist when the job is complete. Also, to avoid skipping namespace_delete() when some other previous function fails (such as the final eventlog write or the copy of the namespace contents), call flux_future_or_then(3) as well as flux_future_and_then(3) in the chain of namespace cleanup futures. Fixes flux-framework#5790
Hm, the distcheck build failed here:
|
This PR fixes the issue described in #5790. If a job is canceled or gets a fatal job exception while the namespace create RPC is in flight, the fact that a namespace exists for the job is dropped and the namespace will be leaked.
Also included in this PR is an improvement to the
exec_kill
error message handling to reduce a non-trivial amount of log noise for large jobs.