Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce submission times with lower sleep and slurm state caching #2228

Merged
merged 4 commits into from
Apr 30, 2024

Conversation

akremin
Copy link
Member

@akremin akremin commented Apr 29, 2024

This PR reduces the sleep time between job commits from 0.5s to 0.1s. It also introduces caching of Slurm jobid states, which can be checked before job submission instead of querying sacct. Since querying one ID or ten ID's was roughly the same amount of time, I made the choice of re-querying all jobids if any jobid in the list is not in the cache. This allows us to update information without loss of time and simplifies the code.

The reason we need to check sacct is that Slurm has two databases, one for operations that tracks current jobs, and one long-term database for storing job information. Roughly 10 minutes after a job stops, it is purged from the operational database and can only be identified via the long-term database using sacct. At that point a new job submitted to depend on the old job with --dependency=afterok:<JOBID> will be refused since <JOBID> is no longer in the operational database and Slurm doesn't know how to depend on it without that information. Thus, we need to remove these finished jobs and only submit running or pending jobs to Slurm.

I tested this running one night in my personal prod using:
> desi_proc_night -n 20240409 &>>/global/cfs/cdirs/desi/spectro/redux/kremin/run/20240409.log

The queue was filled with jobs with proper dependencies handled:

JOBID            ST USER      NAME          NODES TIME_LIMIT       TIME  SUBMIT_TIME          QOS             START_TIME           FEATURES       NODELIST(REASON

24969241         PD kremin    tilenight-20  1          21:00       0:00  2024-04-29T16:41:22  gpu_regular     N/A                  gpu&a100       (Dependency)   
24969239         PD kremin    ztile-6200-t  1          25:00       0:00  2024-04-29T16:41:21  gpu_regular     N/A                  gpu&a100       (Dependency)   
24969238         PD kremin    tilenight-20  1          21:00       0:00  2024-04-29T16:41:21  gpu_regular     N/A                  gpu&a100       (Dependency)   
24969237         PD kremin    ztile-20356-  1          25:00       0:00  2024-04-29T16:41:20  gpu_regular     N/A                  gpu&a100       (Dependency)   
24969235         PD kremin    tilenight-20  1          21:00       0:00  2024-04-29T16:41:19  gpu_regular     N/A                  gpu&a100       (Dependency)   

[...]

24969212         PD kremin    flat-2024040  1          20:00       0:00  2024-04-29T16:41:02  gpu_regular     N/A                  gpu&a100       (Dependency)   
24969211         PD kremin    flat-2024040  1          20:00       0:00  2024-04-29T16:40:59  gpu_regular     N/A                  gpu&a100       (Dependency)   
24969208         PD kremin    flat-2024040  1          20:00       0:00  2024-04-29T16:40:58  gpu_regular     N/A                  gpu&a100       (Dependency)   
24969194         PD kremin    flat-2024040  1          20:00       0:00  2024-04-29T16:40:52  gpu_regular     N/A                  gpu&a100       (Dependency)   
24969224         PD kremin    nightlyflat-  1           5:00       0:00  2024-04-29T16:41:12  regular_1       N/A                  cpu            (Dependency)   
24969187         PD kremin    psfnight-202  1           5:00       0:00  2024-04-29T16:40:47  regular_1       N/A                  cpu            (Dependency)   
24969186         PD kremin    arc-20240409  3          45:00       0:00  2024-04-29T16:40:43  regular_1       N/A                  cpu            (Dependency)   
24969182         PD kremin    arc-20240409  3          45:00       0:00  2024-04-29T16:40:36  regular_1       N/A                  cpu            (Dependency)   
24969180         PD kremin    arc-20240409  3          45:00       0:00  2024-04-29T16:40:31  regular_1       N/A                  cpu            (Dependency)   
24969179         PD kremin    arc-20240409  3          45:00       0:00  2024-04-29T16:40:28  regular_1       N/A                  cpu            (Dependency)   
24969177         PD kremin    arc-20240409  3          45:00       0:00  2024-04-29T16:40:26  regular_1       N/A                  cpu            (Dependency)   
24969175         PD kremin    ccdcalib-202  2          15:00       0:00  2024-04-29T16:40:25  regular_1       N/A                  cpu            (Priority)     

And the code is correctly using the cached states rather than querying sacct:
> grep "get_queue_states_from_qid" 20240409.log

INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969175]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969175]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969175]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969175]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969175]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969177, 24969179, 24969180, 24969182, 24969186]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969187]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969187]) are cached. Using cached values.

[...]

INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969233]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969224]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969235]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969224]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969238]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969224]) are cached. Using cached values.

And the sbatch calls were correct in that they still contained the dependencies (since they were all just submitted and therefore not complete).

`> grep "sbatch" 20240409.log

INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/ccdcalib-20240409-00235120-a0123456789.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969175', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/arc-20240409-00235125-a0123456789.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969175', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/arc-20240409-00235126-a0123456789.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969175', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/arc-20240409-00235127-a0123456789.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969175', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/arc-20240409-00235128-a0123456789.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969175', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/arc-20240409-00235129-a0123456789.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969177:24969179:24969180:24969182:24969186', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/psfnight-20240409-00235125-a0123456789.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969187', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/flat-20240409-00235141-a0123456789.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969187', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/flat-20240409-00235142-a0123456789.slurm']

[...]

INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969224', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/tilenight-20240409-20329.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969233', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/tiles/cumulative/20329/20240409/ztile-20329-thru20240409.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969224', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/tilenight-20240409-20356.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969235', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/tiles/cumulative/20356/20240409/ztile-20356-thru20240409.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969224', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/tilenight-20240409-6200.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969238', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/tiles/cumulative/6200/20240409/ztile-6200-thru20240409.slurm']
```

@akremin akremin requested a review from sbailey April 29, 2024 23:56
@sbailey
Copy link
Contributor

sbailey commented Apr 30, 2024

This looks good, though I'm itching to write a few units tests for it. I'll leave it open until either we have added some tests exercising the caching, or we need to launch a miniprod and/or make a tag and need this to be merged even without unit tests.

@sbailey sbailey merged commit 2dc018b into main Apr 30, 2024
26 checks passed
@sbailey sbailey deleted the optional_depchecks branch April 30, 2024 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

2 participants