Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not report failure when skipping previously-written PSF fits #2059

Merged
merged 1 commit into from Jun 27, 2023

Conversation

marcelo-alvarez
Copy link
Contributor

Solves #1980 by not increasing the error count on ranks corresponding to already-completed PSF fits written by a previous job. Instead, a message is written to the log noting that there is nothing to do for each such rank and the arc job returns the same error code as it would have if those ranks had completed those fits.

I tested this by:

  1. taking the outputs from a previously successful arc exposure (EXPID) processing
  2. removing one of the psf files (i.e. fit-psf-z0-EXPID.fits)
  3. rerunning the arc*.slurm script

The resulting log files are in

/global/cfs/cdirs/desi/users/malvarez/spectro/redux/desispec-spxmidjobfail/specex-midjobfail-20230512-all/run/scripts/night/20230512/arc-20230512-00180239-a0123456789-9829453.log

and

/global/cfs/cdirs/desi/users/malvarez/spectro/redux/desispec-spxmidjobfail/main-20230512-all/run/scripts/night/20230512/arc-20230512-00180239-a0123456789-9829449.log

for this branch and main, respectively. Both main and this branch successfully generated the missing psf file, but only the latter returned success.

@sbailey please have a look and merge if appropriate, thanks.

@marcelo-alvarez marcelo-alvarez linked an issue Jun 1, 2023 that may be closed by this pull request
@akremin akremin requested review from akremin and removed request for sbailey June 5, 2023 22:38
Copy link
Member

@akremin akremin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Marcelo, the changes look good and thank you for the tests.

Can you comment more on why it used to throw an error? Is it ever true that an error can occur such that cmd won't have a camera in it's keys, and run() should be logging an error?

Following the git history it looks like this "if camera not in cmds" was introduced in a refactor you did, which is why I ask. If the original error was a misunderstanding, and there is never an error, then I'm happy with this solution to the current issue.

@marcelo-alvarez
Copy link
Contributor Author

Hi @akremin, thanks for the comment and question. I don't know about whether the original error was a misunderstanding or if it is every true that an error can occur such that cmd won't have a camera in it's keys. @sbailey or @julienguy might be able to answer that.

@marcelo-alvarez
Copy link
Contributor Author

@sbailey do you have any insight on the question of whether the original "if camera not in cmds" error I introduced in a refactor was due to a misunderstanding because there is never an error? I don't have enough context on the overall use cases for specex to say whether this was a misunderstanding because such an error never happens or rather just because it caused the situation that this PR attempts to fix. Thanks.

@sbailey
Copy link
Contributor

sbailey commented Jun 27, 2023

I think the previous behavior was just a misunderstanding. Although it could be reasonable to consider it an error if you ask the code to do something that is already done, elsewhere in desispec we treat "already done" as a non-error so that the code just skips anything that is already done and completes anything that is left to do. This makes it easier to recover from system-induced failures by just re-running the identical command again without having to edit it down to only the cameras left to do (a very early version of the pipeline required editing commands to recover from failures and that was a pain...). Merging now.

@sbailey sbailey merged commit 4915752 into main Jun 27, 2023
24 checks passed
@sbailey sbailey deleted the specex-midjobfail branch June 27, 2023 23:06
sbailey added a commit that referenced this pull request Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

arc jobs not cleanly recovering from mid-job failures
3 participants