Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux-perilog-run: improve usefulness of logging when prolog/epilog fails #4054

Merged
merged 2 commits into from
Jan 19, 2022

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Jan 19, 2022

When attempting to set up prolog/epilog support on a system, it was discovered that errors from remote prolog/epilog execution were being copied to the logs without an identifying rank or hostname, which made diagnosis of errors unnecessarily difficult.

This PR improves the logging of the flux perilog-run script when used to execute per-rank prolog and epilog by including the rank and hostname in any lines of stderr generated by these scripts, along with other small improvements.

Problem: When `flux perilog-run --exec-per-rank=CMD` fails, error
messages from individual ranks are not differentiated, so it is
difficult to tell on which rank/host the prolog or epilog error
occurred. Additionally, the default Python logging level is too low
and important "info" messages are dropped.

Adjust the execution of subprocesses so that stderr is captured,
and log any stderr from failed ranks. Also capture the instance
hostlist before running and use this to display the failing hostname
along with its rank for easier diagnosis.

Finally, promote the log message regarding draining nodes from
LOGGER.info to LOGGER.error, so that it is displayed by default.

Fixes flux-framework#4053
Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable.

Problem: No tests of `flux perilog-run` ensure that errors generated
by the prolog/epilog are logged by the perilog-run script along with
their hostname and rank.

Add a test to ensure the above is true.
@grondo
Copy link
Contributor Author

grondo commented Jan 19, 2022

Thanks! Had to fix a broken && chain in the test and force pushed.

@codecov
Copy link

codecov bot commented Jan 19, 2022

Codecov Report

Merging #4054 (c854768) into master (76f00c5) will increase coverage by 0.32%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4054      +/-   ##
==========================================
+ Coverage   83.45%   83.77%   +0.32%     
==========================================
  Files         374      374              
  Lines       61397    54243    -7154     
==========================================
- Hits        51236    45440    -5796     
+ Misses      10161     8803    -1358     
Impacted Files Coverage Δ
src/cmd/flux-perilog-run.py 94.87% <100.00%> (+0.18%) ⬆️
src/common/libutil/setenvf.c 0.00% <0.00%> (-100.00%) ⬇️
src/common/libpmi/pmi_strerror.c 66.66% <0.00%> (-8.34%) ⬇️
src/common/libutil/wallclock.c 50.00% <0.00%> (-7.90%) ⬇️
src/cmd/builtin/heaptrace.c 57.44% <0.00%> (-5.52%) ⬇️
src/shell/pty.c 68.33% <0.00%> (-5.28%) ⬇️
src/common/libutil/xzmalloc.c 51.51% <0.00%> (-5.25%) ⬇️
src/common/libcontent/content-util.c 50.00% <0.00%> (-4.55%) ⬇️
src/common/libschedutil/hello.c 70.96% <0.00%> (-4.04%) ⬇️
src/cmd/builtin/content.c 73.87% <0.00%> (-3.99%) ⬇️
... and 303 more

@mergify mergify bot merged commit 4472e3f into flux-framework:master Jan 19, 2022
@grondo grondo deleted the issue#4053 branch January 19, 2022 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants