Rationalise discovery of available memory and CPUs #2619

benjaminhwilliams · 2024-02-27T15:08:22Z

At present, there are various different approaches within DIALS for discovering the available memory and CPUs on the system. This leads to some patchy behaviour in certain environments, most notably when running on a Slurm-managed cluster node, which uses cgroups to establish the memory allocated to the job. At present, for example, dials.integrate will discover a Slurm node's entire memory and try to use it, rather than the memory limit imposed by the scheduler, and so it will frequently be killed by the OOM killer.

Rather than reinvent the wheel and contribute to the proliferation of standards within DIALS for discovery of resources, I've tried here to strip out all the existing memory and CPU discovery methods and replace them with tools cribbed from Dask and Dask Distributed, on the basis that the devs there have thought longer and harder about cross-platform compatibility than we have. I've also tried to preserve the commit history from those two projects, both to give credit where credit is due and also to provide some history that explains why things are done in this way.

As a side-effect of these changes, I've removed the fall-back option within DIALS integrate to treat swap space as a simple extension of virtual memory and try to allocate to it. Swapping by design doesn't seem to be a good idea and, in trying to test these changes, I've not actually been able to manufacture a situation in which it actually works — on a RHEL8 workstation with 16GB of RAM and 8GB of swap, the OOM killer still intervenes when dials.integrate exceeds the available virtual memory. And, of course, there's no swap anyway in an HPC context, so it's irrelevant there.

I've tested these changes fairly extensively using a Slurm cluster and a PC with RHEL8 and have no concerns but would welcome other tests on other platforms if people have the time and are willing.

…#3039) This adds support for detecting resource (CPU and memory) limits set using `cgroups`. This makes dask resource detection play nicer with container systems (e.g. `docker`), rather than detecting the host memory and cpus avaialable. This also centralizes all queries about the host platform to a single module (`distributed.platform`), with top-level constants defined for common usage.

A few fixes for resource detection using cgroups: - The directory for determining cpu availability isn't standardized across linux distros, could be either `cpuacct,cpu`, or `cpu,cpuacct`. We now check for both. - When allotted fractional cpus (e.g. 1.5), we now round up. Also adds tests for both CPU and memory limit detection under cgroups, by monkeypatching in fake files. Fixes #3053

…dulers (#5499)

…0314)

Incorporate tools from Dask Distributed for discovering memory limits, preserving Git history for the copied file.

Incorporate tools from Dask for discovering available CPUs, preserving Git history for the copied file.

If os.cpu_count() or psutil.Process().cpu_affinity(), handle them more explicitly. If none of the available methods results in a value for the number of CPUs, default to 1.

graeme-winter

I also think we need to add a comment somewhere that the static limits (i.e. CPU_COUNT, MEMORY_LIMIT) should be used in preference to recalculating. If we had a DIALS contributors guide, I would put it there.

I also note that this is logically a generous limit, since it is the full RAM of the machine: how we use this is an interesting question. However you are explicitly not changing the behaviour in this regard so this comment is out of context & what you are suggesting here is probably a substantial improvement on what came before.

Thank you for the contribution, this is a substantial body of effort.

newsfragments/2619.misc

src/dials/algorithms/integration/integrator.py

src/dials/algorithms/integration/parallel_integrator.py

src/dials/algorithms/refinement/refiner.py

src/dials/util/system.py

benjaminhwilliams · 2024-02-27T15:59:23Z

Hoping that merging main into this branch will have sorted test failures that seem unrelated to changes here.

codecov · 2024-02-27T18:10:56Z

Codecov Report

Attention: Patch coverage is 44.80198% with 223 lines in your changes are missing coverage. Please review.

Project coverage is 78.40%. Comparing base (e36f939) to head (85c455f).
Report is 14 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2619      +/-   ##
==========================================
- Coverage   78.49%   78.40%   -0.10%     
==========================================
  Files         609      611       +2     
  Lines       75057    75284     +227     
  Branches    10712    10746      +34     
==========================================
+ Hits        58919    59028     +109     
- Misses      13970    14076     +106     
- Partials     2168     2180      +12

Rationalise the discovery of system resources (CPUs and available virtual memory) across DIALS, using tooling adapted from Dask and Dask Distributed. Introduces dials.util.system.CPU_COUNT and dials.util.system.MEMORY_LIMIT. In cases where there is a large memory footprint, do not resort to adding swap space into the total available memory to calculate the appropriate number of multiprocessing tasks. --------- Co-authored-by: Jim Crist <jcrist@users.noreply.github.com> Co-authored-by: Albert DeFusco <albert.defusco@me.com> Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com> Co-authored-by: crusaderky <crusaderky@gmail.com> Co-authored-by: Thomas Grainger <tagrain@gmail.com> Co-authored-by: Samantha Hughes <shughes-uk@users.noreply.github.com> Co-authored-by: Florian Jetter <fjetter@users.noreply.github.com> Co-authored-by: Johan Olsson <johan@jmvo.se>

jcrist and others added 19 commits September 10, 2019 11:26

Support cgroups limits on cpu count in multiprocess and threaded sche…

9e35257

…dulers (#5499)

Import CPU_COUNT from dask.system (#3199)

a6d4334

Encapsulate spill buffer and memory_monitor (#5904)

61c76ba

Pass mypy validation on Windows (#6180)

5ac86fa

Bump pyupgrade and clean up # type: ignore (#6293)

efb0ffc

force __future__.annotations with isort (#6621)

4ab9e56

Support for cgroups v2 and respect soft limits (#7051)

f27a3ef

Run mypy on Windows and Linux (#9530)

622452f

Log information about memory limit (#7160)

9bb9bfc

Configure isort to add from __future__ import annotations (#1…

6dfc3b7

…0314)

Adding support for cgroup v2 to cpu_count (#10419)

a0f91fa

Respect cgroups v2 "low" memory limit (#8243)

5706c74

Incorporate Dask Distributed system tools

bd65668

Incorporate tools from Dask Distributed for discovering memory limits, preserving Git history for the copied file.

Incorporate Dask system tools

3499c77

Incorporate tools from Dask for discovering available CPUs, preserving Git history for the copied file.

Better handle None values in cpu_count

b6e6152

If os.cpu_count() or psutil.Process().cpu_affinity(), handle them more explicitly. If none of the available methods results in a value for the number of CPUs, default to 1.

Use new MEMORY_LIMIT; don't resort to swapping

fe1e7f1

Use new CPU_COUNT

9a01e70

benjaminhwilliams mentioned this pull request Feb 27, 2024

Use new dials.util.system.CPU_COUNT xia2/xia2#780

Merged

benjaminhwilliams and others added 2 commits February 27, 2024 15:24

Add news

48477a2

Rename newsfragments/x.misc to newsfragments/2619.misc

db5cfd8

benjaminhwilliams requested a review from graeme-winter February 27, 2024 15:37

benjaminhwilliams force-pushed the system-info branch from b3ad5ae to db5cfd8 Compare February 27, 2024 15:52

graeme-winter approved these changes Feb 27, 2024

View reviewed changes

benjaminhwilliams force-pushed the system-info branch from d1424ab to db5cfd8 Compare February 27, 2024 15:54

Merge branch 'main'

bc1b508

Minor rewording, as per review

85c455f

benjaminhwilliams merged commit 6372cee into main Feb 29, 2024
16 of 18 checks passed

benjaminhwilliams deleted the system-info branch February 29, 2024 14:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rationalise discovery of available memory and CPUs #2619

Rationalise discovery of available memory and CPUs #2619

benjaminhwilliams commented Feb 27, 2024 •

edited

graeme-winter left a comment

benjaminhwilliams commented Feb 27, 2024

codecov bot commented Feb 27, 2024 •

edited

Rationalise discovery of available memory and CPUs #2619

Rationalise discovery of available memory and CPUs #2619

Conversation

benjaminhwilliams commented Feb 27, 2024 • edited

graeme-winter left a comment

Choose a reason for hiding this comment

benjaminhwilliams commented Feb 27, 2024

codecov bot commented Feb 27, 2024 • edited

Codecov Report

benjaminhwilliams commented Feb 27, 2024 •

edited

codecov bot commented Feb 27, 2024 •

edited