Reduce submission times with lower sleep and slurm state caching #2228

akremin · 2024-04-29T23:54:25Z

This PR reduces the sleep time between job commits from 0.5s to 0.1s. It also introduces caching of Slurm jobid states, which can be checked before job submission instead of querying sacct. Since querying one ID or ten ID's was roughly the same amount of time, I made the choice of re-querying all jobids if any jobid in the list is not in the cache. This allows us to update information without loss of time and simplifies the code.

The reason we need to check sacct is that Slurm has two databases, one for operations that tracks current jobs, and one long-term database for storing job information. Roughly 10 minutes after a job stops, it is purged from the operational database and can only be identified via the long-term database using sacct. At that point a new job submitted to depend on the old job with --dependency=afterok:<JOBID> will be refused since <JOBID> is no longer in the operational database and Slurm doesn't know how to depend on it without that information. Thus, we need to remove these finished jobs and only submit running or pending jobs to Slurm.

I tested this running one night in my personal prod using:
> desi_proc_night -n 20240409 &>>/global/cfs/cdirs/desi/spectro/redux/kremin/run/20240409.log

The queue was filled with jobs with proper dependencies handled:

JOBID            ST USER      NAME          NODES TIME_LIMIT       TIME  SUBMIT_TIME          QOS             START_TIME           FEATURES       NODELIST(REASON

24969241         PD kremin    tilenight-20  1          21:00       0:00  2024-04-29T16:41:22  gpu_regular     N/A                  gpu&a100       (Dependency)   
24969239         PD kremin    ztile-6200-t  1          25:00       0:00  2024-04-29T16:41:21  gpu_regular     N/A                  gpu&a100       (Dependency)   
24969238         PD kremin    tilenight-20  1          21:00       0:00  2024-04-29T16:41:21  gpu_regular     N/A                  gpu&a100       (Dependency)   
24969237         PD kremin    ztile-20356-  1          25:00       0:00  2024-04-29T16:41:20  gpu_regular     N/A                  gpu&a100       (Dependency)   
24969235         PD kremin    tilenight-20  1          21:00       0:00  2024-04-29T16:41:19  gpu_regular     N/A                  gpu&a100       (Dependency)   

[...]

24969212         PD kremin    flat-2024040  1          20:00       0:00  2024-04-29T16:41:02  gpu_regular     N/A                  gpu&a100       (Dependency)   
24969211         PD kremin    flat-2024040  1          20:00       0:00  2024-04-29T16:40:59  gpu_regular     N/A                  gpu&a100       (Dependency)   
24969208         PD kremin    flat-2024040  1          20:00       0:00  2024-04-29T16:40:58  gpu_regular     N/A                  gpu&a100       (Dependency)   
24969194         PD kremin    flat-2024040  1          20:00       0:00  2024-04-29T16:40:52  gpu_regular     N/A                  gpu&a100       (Dependency)   
24969224         PD kremin    nightlyflat-  1           5:00       0:00  2024-04-29T16:41:12  regular_1       N/A                  cpu            (Dependency)   
24969187         PD kremin    psfnight-202  1           5:00       0:00  2024-04-29T16:40:47  regular_1       N/A                  cpu            (Dependency)   
24969186         PD kremin    arc-20240409  3          45:00       0:00  2024-04-29T16:40:43  regular_1       N/A                  cpu            (Dependency)   
24969182         PD kremin    arc-20240409  3          45:00       0:00  2024-04-29T16:40:36  regular_1       N/A                  cpu            (Dependency)   
24969180         PD kremin    arc-20240409  3          45:00       0:00  2024-04-29T16:40:31  regular_1       N/A                  cpu            (Dependency)   
24969179         PD kremin    arc-20240409  3          45:00       0:00  2024-04-29T16:40:28  regular_1       N/A                  cpu            (Dependency)   
24969177         PD kremin    arc-20240409  3          45:00       0:00  2024-04-29T16:40:26  regular_1       N/A                  cpu            (Dependency)   
24969175         PD kremin    ccdcalib-202  2          15:00       0:00  2024-04-29T16:40:25  regular_1       N/A                  cpu            (Priority)

And the code is correctly using the cached states rather than querying sacct:
> grep "get_queue_states_from_qid" 20240409.log

INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969175]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969175]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969175]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969175]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969175]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969177, 24969179, 24969180, 24969182, 24969186]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969187]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969187]) are cached. Using cached values.

[...]

INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969233]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969224]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969235]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969224]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969238]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969224]) are cached. Using cached values.

And the sbatch calls were correct in that they still contained the dependencies (since they were all just submitted and therefore not complete).

`> grep "sbatch" 20240409.log

INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/ccdcalib-20240409-00235120-a0123456789.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969175', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/arc-20240409-00235125-a0123456789.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969175', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/arc-20240409-00235126-a0123456789.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969175', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/arc-20240409-00235127-a0123456789.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969175', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/arc-20240409-00235128-a0123456789.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969175', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/arc-20240409-00235129-a0123456789.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969177:24969179:24969180:24969182:24969186', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/psfnight-20240409-00235125-a0123456789.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969187', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/flat-20240409-00235141-a0123456789.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969187', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/flat-20240409-00235142-a0123456789.slurm']

[...]

INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969224', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/tilenight-20240409-20329.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969233', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/tiles/cumulative/20329/20240409/ztile-20329-thru20240409.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969224', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/tilenight-20240409-20356.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969235', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/tiles/cumulative/20356/20240409/ztile-20356-thru20240409.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969224', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/tilenight-20240409-6200.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969238', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/tiles/cumulative/6200/20240409/ztile-6200-thru20240409.slurm']
```

sbailey · 2024-04-30T17:22:26Z

This looks good, though I'm itching to write a few units tests for it. I'll leave it open until either we have added some tests exercising the caching, or we need to launch a miniprod and/or make a tag and need this to be merged even without unit tests.

akremin added 3 commits April 29, 2024 14:41

Reduce sleeps in submissions and introduce opt-in qid status caching

2c13258

Reduce sleeps in submissions and introduce opt-in qid status caching

4b0e986

Add submitted jobs to queue cache without sacct

dd16c8f

akremin requested a review from sbailey April 29, 2024 23:56

add workflow.queue tests for state cache

599a15f

sbailey merged commit 2dc018b into main Apr 30, 2024
26 checks passed

sbailey deleted the optional_depchecks branch April 30, 2024 19:46

sbailey mentioned this pull request Apr 30, 2024

desi_proc_night production mode checking sacct #2220

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce submission times with lower sleep and slurm state caching #2228

Reduce submission times with lower sleep and slurm state caching #2228

akremin commented Apr 29, 2024

sbailey commented Apr 30, 2024

Reduce submission times with lower sleep and slurm state caching #2228

Reduce submission times with lower sleep and slurm state caching #2228

Conversation

akremin commented Apr 29, 2024

sbailey commented Apr 30, 2024