Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux-perilog-run exits silently with failure when one or more ranks are not online #5904

Closed
grondo opened this issue Apr 19, 2024 · 1 comment

Comments

@grondo
Copy link
Contributor

grondo commented Apr 19, 2024

While debugging flux-framework/flux-sched#1182 it took much longer than necessary to determine what was going on because the prolog was failing silently.

It so happens that flux-perilog-run.py checks for offline ranks so it can avoid targetting them with flux-exec(1)

    #  Check for any offline ranks and subtract them from targets.
    #  Optionally drain offline ranks with a unique message that prolog/epilog
    #  failed due to offline state:
    #
    offline = offline_ranks(handle) & ranks
    if offline:
        returncode = 1
        LOGGER.info("%s: %s: ranks %s offline. Skipping.", jobid, name, offline)
        ranks.subtract(offline)
        if args.drain_offline:
            drain(handle, offline, f"offline for {jobid} {name}")

I guess LOGGER.info() messages are not emitted by default, because the "Skipping" message is not emitted without -v, thus the silent treatment. Also, the prolog is set to fail if there is any offline ranks.

This should be improved to log the error message. Also, it would be helpful to emit a specific exception here instead of generic "prolog failed with exit code=1" exception.

@grondo
Copy link
Contributor Author

grondo commented Apr 23, 2024

Fixed by #5910

@grondo grondo closed this as completed Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant