Memory constraints and load balancing #806

tskisner · 2019-08-09T04:01:42Z

Add memory constraints to the spectral grouping and redshift tasks. Also add functionality to the nersc_job_size function that detects large load imbalance and reduces the job size to compensate.

julienguy · 2019-08-09T14:24:53Z

For the spectra task, you may scale memory with the sum of the number of targets in the healpix_frame matching rows instead of just the number of matching rows; ntargets is an column of the table. This would be a modest improvement, given the number of targets per frame is fixed. But it would still be an improvement as the number of targets from a frame intersecting a pixel varies; and this is also a function of nside.

sbailey · 2019-08-09T16:29:34Z

@julienguy although the output file scales with the number of targets, we read in the entire frame since the unused targets are likely to be needed by a neighboring healpix that will be processed soon. So the memory scales with number of frames (for loading the inputs) and the number of targets that are covered by the healpix (for constructing the output).

I'd like to test this after the cori I/O outage before merging. Thanks.

tskisner · 2019-08-09T17:41:19Z

In addition to what @sbailey said, I also set the memory requirement based only on the frames in healpix_frame with matching pixel- I do not consider the "state=1" readiness, since at the beginning of a fresh production this regrouping job may be scheduled before any of the frames are ready. So the time / memory estimate is a worst case.

tskisner · 2019-08-09T22:08:46Z

I just rebased this against master after the merge of #805

… bug fixes. Add memory and time requirements to spectral regrouping tasks.

… and per-task memory requirements.

…a single slurm script.

… pixeltasks

…e functions are now called both during job setup and at runtime for consistency.

…with a partial node.

…so cleanups to distribution function.

sbailey · 2019-09-01T04:30:55Z

py/desispec/pipeline/run.py

+                .format(nproc)
+            )
+            nworker = nproc
+        taskproc = nproc // nworker


Changing this line to

taskproc = min(nproc // nworker, task_classes[tasktype].run_max_procs())

fixes a problem where serial tasks (like fiberflatnight) could be given communicators with multiple ranks that all tried to do the same thing and would clobber each other. That worked for fixing the fiberflatnight step; I'm now running a larger production from the beginning with that one-line change and will report back if this has some other unintended consequence.

Test production at NERSC in /global/cscratch1/sd/sjbailey/desi/svdc2019d/spectro/redux/v11

The larger test is still running but so far the only problems have been the logfile redirection timeouts and problems that cascade from there due to dependencies. I'll commit this change.

The logic in run.py around this section is wrong, and I had a local commit Saturday on cori to fix it, but did not push because there was another issue... I'll try to reconcile this with my changes and open a PR against this branch.

…n and use that for both run_task_list and dryrun.

…time rather than trying to deduce this from the MPI communicator size. Also fix the use of the realtime queue on cori.

Move common calculation of runtime distribution to a separate function

Implement a "desi_pipe status" command

… of cframe files that are cached.

…dy packing many tasks per node. Also increase per-task time to 40 minutes as a stop-gap until we can speed this code up.

…ed time.

sbailey · 2019-09-19T21:06:37Z

I reverted the starfit --ncpu 1 change, since that made that step take longer instead of run faster. For memory reasons that step is allocated 4 cores anyway (16 hypercores on KNL), and without --ncpu 1 it uses multiprocessing parallelized 16x to run faster.

the tail of the distribution is driven by how many standard stars on on a tile, which is variable (e.g. SV MWS tiles have a lot of standard stars).

This does have the MPI+multiprocessing combination, which has been fragile in the past, though only when we've tried to use the fancier "copy on write" features of multiprocessing that doesn't work with Cray MPI. Empirically this step with simpler MPI+multiprocessing usage has been working robustly, with only one case of a timeout failure.

The current state of affairs is that the pipeline runs through spectra unless it hits a bad node or srun problem, but it is still mis-tuned for the redshift fitting step which consistently gets killed by slurm. The logs just say "srun: error: nid12668: task 1216: Aborted" without mentioning "memory" or "oom", but past cases of this message were also tied to memory problems.

Even though this branch still doesn't succeed at running end to end (even when there aren't any NERSC I/O or node problems), it does get up to the last step so I'm inclined to merge it and then restart the redshift task optimization in a new branch.

sbailey · 2019-09-19T21:08:09Z

For the record: the test failures in this branch were due to unrelated QA code and were fixed in PR #814 and work on master.

tskisner · 2019-09-19T21:10:35Z

That plan sounds good to me (merge sooner rather than later). Many things that were previously completely wrong with the timing estimates in the code (and which worked by happy coincidence) are now fixed and more clear.

tskisner force-pushed the pixeltasks branch from fd64edf to 2fe889c Compare August 9, 2019 22:08

tskisner force-pushed the pixeltasks branch from 2fe889c to 5486a41 Compare August 9, 2019 22:14

sbailey mentioned this pull request Aug 9, 2019

preproc CCDSEC[1-4] vs. CCDSEC[A-D] #807

Merged

Implement load balancing within a single pipeline step. Several small…

a297368

… bug fixes. Add memory and time requirements to spectral regrouping tasks.

tskisner force-pushed the pixeltasks branch from 5486a41 to a297368 Compare August 10, 2019 15:34

tskisner and others added 6 commits August 13, 2019 09:19

Change the way that job sizes are computed to account for per-process…

ecfc208

… and per-task memory requirements.

fix leftover from timing proc refactor

a41da1a

fix nodeproc vs. taskproc in timing call

58285a8

tuning job startup and task timing

82a79b7

add __pycache__ to .gitignore

45403c3

Restrict load balancing only to jobs with a single pipeline step and …

3a886ff

…a single slurm script.

tskisner mentioned this pull request Aug 14, 2019

pipeline overwriting jobs and tasks when jobs are split? #716

Open

Stephen Bailey and others added 5 commits August 15, 2019 16:30

adding task timeout handler

dd8ff95

Merge branch 'pixeltasks' of https://github.com/desihub/desispec into…

334b7ed

… pixeltasks

Move all job planning into a new file desispec.pipeline.plan. The sam…

3839a00

…e functions are now called both during job setup and at runtime for consistency.

Small bug fix to runtime setting of nworkers when using serial tasks …

67ce2eb

…with a partial node.

Small change to job planning to avoid load balancing on long jobs. Al…

03709bc

…so cleanups to distribution function.

sbailey reviewed Sep 1, 2019

View reviewed changes

Stephen Bailey and others added 8 commits September 1, 2019 20:25

serial task MPI worker size fix

c73f89d

Move common calculation of runtime distribution to a separate functio…

f73c133

…n and use that for both run_task_list and dryrun.

Improve logging and fix missing return value.

6f903ff

Propogate the worker size into the generated scripts. Use this at run…

6bf5083

…time rather than trying to deduce this from the MPI communicator size. Also fix the use of the realtime queue on cori.

Merge pull request #809 from desihub/runclear

0f21c73

Move common calculation of runtime distribution to a separate function

[WIP] begin desi_pipe status command

671797a

Implement desi_pipe status.

5df8942

Merge pull request #810 from desihub/statuscommand

1a45cba

Implement a "desi_pipe status" command

tskisner and others added 7 commits September 6, 2019 17:57

Change the spectra task memory requirements to be based on the number…

2c03cde

… of cframe files that are cached.

For the starfit task type, disable multiprocessing since we are alrea…

1566e3b

…dy packing many tasks per node. Also increase per-task time to 40 minutes as a stop-gap until we can speed this code up.

Bump the starfit time again. Even more tasks finish within the allott…

d1979e4

…ed time.

revert starfit --ncpu 1 and super-long runtime

24074c2

fix desi_pipe status --states parsing

a77955a

fix desi_pipe status --states parsing

40aea46

slightly increase runtime for sky and starfit

7416341

sbailey merged commit 7bbc7f2 into master Sep 19, 2019

sbailey deleted the pixeltasks branch September 19, 2019 21:20

sbailey mentioned this pull request Sep 19, 2019

tune/fix pipeline redshift timing and memory #815

Open

sbailey mentioned this pull request Oct 14, 2019

desi_group_spectra memory problem with minitest #818

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory constraints and load balancing #806

Memory constraints and load balancing #806

tskisner commented Aug 9, 2019

julienguy commented Aug 9, 2019

sbailey commented Aug 9, 2019

tskisner commented Aug 9, 2019

tskisner commented Aug 9, 2019

sbailey Sep 1, 2019

sbailey Sep 2, 2019

tskisner Sep 3, 2019

sbailey commented Sep 19, 2019

sbailey commented Sep 19, 2019

tskisner commented Sep 19, 2019

Memory constraints and load balancing #806

Memory constraints and load balancing #806

Conversation

tskisner commented Aug 9, 2019

julienguy commented Aug 9, 2019

sbailey commented Aug 9, 2019

tskisner commented Aug 9, 2019

tskisner commented Aug 9, 2019

sbailey Sep 1, 2019

Choose a reason for hiding this comment

sbailey Sep 2, 2019

Choose a reason for hiding this comment

tskisner Sep 3, 2019

Choose a reason for hiding this comment

sbailey commented Sep 19, 2019

sbailey commented Sep 19, 2019

tskisner commented Sep 19, 2019