-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rationalise discovery of available memory and CPUs #2619
Conversation
…#3039) This adds support for detecting resource (CPU and memory) limits set using `cgroups`. This makes dask resource detection play nicer with container systems (e.g. `docker`), rather than detecting the host memory and cpus avaialable. This also centralizes all queries about the host platform to a single module (`distributed.platform`), with top-level constants defined for common usage.
A few fixes for resource detection using cgroups: - The directory for determining cpu availability isn't standardized across linux distros, could be either `cpuacct,cpu`, or `cpu,cpuacct`. We now check for both. - When allotted fractional cpus (e.g. 1.5), we now round up. Also adds tests for both CPU and memory limit detection under cgroups, by monkeypatching in fake files. Fixes #3053
Incorporate tools from Dask Distributed for discovering memory limits, preserving Git history for the copied file.
Incorporate tools from Dask for discovering available CPUs, preserving Git history for the copied file.
If os.cpu_count() or psutil.Process().cpu_affinity(), handle them more explicitly. If none of the available methods results in a value for the number of CPUs, default to 1.
b3ad5ae
to
db5cfd8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think we need to add a comment somewhere that the static limits (i.e. CPU_COUNT
, MEMORY_LIMIT
) should be used in preference to recalculating. If we had a DIALS contributors guide, I would put it there.
I also note that this is logically a generous limit, since it is the full RAM of the machine: how we use this is an interesting question. However you are explicitly not changing the behaviour in this regard so this comment is out of context & what you are suggesting here is probably a substantial improvement on what came before.
Thank you for the contribution, this is a substantial body of effort.
d1424ab
to
db5cfd8
Compare
Hoping that merging main into this branch will have sorted test failures that seem unrelated to changes here. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2619 +/- ##
==========================================
- Coverage 78.49% 78.40% -0.10%
==========================================
Files 609 611 +2
Lines 75057 75284 +227
Branches 10712 10746 +34
==========================================
+ Hits 58919 59028 +109
- Misses 13970 14076 +106
- Partials 2168 2180 +12 |
Rationalise the discovery of system resources (CPUs and available virtual memory) across DIALS, using tooling adapted from Dask and Dask Distributed. Introduces dials.util.system.CPU_COUNT and dials.util.system.MEMORY_LIMIT. In cases where there is a large memory footprint, do not resort to adding swap space into the total available memory to calculate the appropriate number of multiprocessing tasks. --------- Co-authored-by: Jim Crist <jcrist@users.noreply.github.com> Co-authored-by: Albert DeFusco <albert.defusco@me.com> Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com> Co-authored-by: crusaderky <crusaderky@gmail.com> Co-authored-by: Thomas Grainger <tagrain@gmail.com> Co-authored-by: Samantha Hughes <shughes-uk@users.noreply.github.com> Co-authored-by: Florian Jetter <fjetter@users.noreply.github.com> Co-authored-by: Johan Olsson <johan@jmvo.se>
At present, there are various different approaches within DIALS for discovering the available memory and CPUs on the system. This leads to some patchy behaviour in certain environments, most notably when running on a Slurm-managed cluster node, which uses cgroups to establish the memory allocated to the job. At present, for example,
dials.integrate
will discover a Slurm node's entire memory and try to use it, rather than the memory limit imposed by the scheduler, and so it will frequently be killed by the OOM killer.Rather than reinvent the wheel and contribute to the proliferation of standards within DIALS for discovery of resources, I've tried here to strip out all the existing memory and CPU discovery methods and replace them with tools cribbed from Dask and Dask Distributed, on the basis that the devs there have thought longer and harder about cross-platform compatibility than we have. I've also tried to preserve the commit history from those two projects, both to give credit where credit is due and also to provide some history that explains why things are done in this way.
As a side-effect of these changes, I've removed the fall-back option within DIALS integrate to treat swap space as a simple extension of virtual memory and try to allocate to it. Swapping by design doesn't seem to be a good idea and, in trying to test these changes, I've not actually been able to manufacture a situation in which it actually works — on a RHEL8 workstation with 16GB of RAM and 8GB of swap, the OOM killer still intervenes when
dials.integrate
exceeds the available virtual memory. And, of course, there's no swap anyway in an HPC context, so it's irrelevant there.I've tested these changes fairly extensively using a Slurm cluster and a PC with RHEL8 and have no concerns but would welcome other tests on other platforms if people have the time and are willing.