Reduce submission times with lower sleep and slurm state caching #2228
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR reduces the sleep time between job commits from 0.5s to 0.1s. It also introduces caching of Slurm jobid states, which can be checked before job submission instead of querying
sacct
. Since querying one ID or ten ID's was roughly the same amount of time, I made the choice of re-querying all jobids if any jobid in the list is not in the cache. This allows us to update information without loss of time and simplifies the code.The reason we need to check sacct is that Slurm has two databases, one for operations that tracks current jobs, and one long-term database for storing job information. Roughly 10 minutes after a job stops, it is purged from the operational database and can only be identified via the long-term database using
sacct
. At that point a new job submitted to depend on the old job with--dependency=afterok:<JOBID>
will be refused since<JOBID>
is no longer in the operational database and Slurm doesn't know how to depend on it without that information. Thus, we need to remove these finished jobs and only submit running or pending jobs to Slurm.I tested this running one night in my personal prod using:
> desi_proc_night -n 20240409 &>>/global/cfs/cdirs/desi/spectro/redux/kremin/run/20240409.log
The queue was filled with jobs with proper dependencies handled:
And the code is correctly using the cached states rather than querying
sacct
:> grep "get_queue_states_from_qid" 20240409.log
And the
sbatch
calls were correct in that they still contained the dependencies (since they were all just submitted and therefore not complete).`> grep "sbatch" 20240409.log