Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rationalise discovery of available memory and CPUs #2619

Merged
merged 23 commits into from
Feb 29, 2024
Merged

Conversation

benjaminhwilliams
Copy link
Member

@benjaminhwilliams benjaminhwilliams commented Feb 27, 2024

At present, there are various different approaches within DIALS for discovering the available memory and CPUs on the system. This leads to some patchy behaviour in certain environments, most notably when running on a Slurm-managed cluster node, which uses cgroups to establish the memory allocated to the job. At present, for example, dials.integrate will discover a Slurm node's entire memory and try to use it, rather than the memory limit imposed by the scheduler, and so it will frequently be killed by the OOM killer.

Rather than reinvent the wheel and contribute to the proliferation of standards within DIALS for discovery of resources, I've tried here to strip out all the existing memory and CPU discovery methods and replace them with tools cribbed from Dask and Dask Distributed, on the basis that the devs there have thought longer and harder about cross-platform compatibility than we have. I've also tried to preserve the commit history from those two projects, both to give credit where credit is due and also to provide some history that explains why things are done in this way.

As a side-effect of these changes, I've removed the fall-back option within DIALS integrate to treat swap space as a simple extension of virtual memory and try to allocate to it. Swapping by design doesn't seem to be a good idea and, in trying to test these changes, I've not actually been able to manufacture a situation in which it actually works — on a RHEL8 workstation with 16GB of RAM and 8GB of swap, the OOM killer still intervenes when dials.integrate exceeds the available virtual memory. And, of course, there's no swap anyway in an HPC context, so it's irrelevant there.

I've tested these changes fairly extensively using a Slurm cluster and a PC with RHEL8 and have no concerns but would welcome other tests on other platforms if people have the time and are willing.

jcrist and others added 19 commits September 10, 2019 11:26
…#3039)

This adds support for detecting resource (CPU and memory) limits set
using `cgroups`. This makes dask resource detection play nicer with
container systems (e.g. `docker`), rather than detecting the host memory
and cpus avaialable.

This also centralizes all queries about the host platform to a single
module (`distributed.platform`), with top-level constants defined for
common usage.
A few fixes for resource detection using cgroups:

- The directory for determining cpu availability isn't standardized
across linux distros, could be either `cpuacct,cpu`, or `cpu,cpuacct`.
We now check for both.
- When allotted fractional cpus (e.g. 1.5), we now round up.

Also adds tests for both CPU and memory limit detection under cgroups,
by monkeypatching in fake files.

Fixes #3053
Incorporate tools from Dask Distributed for discovering memory limits,
preserving Git history for the copied file.
Incorporate tools from Dask for discovering available CPUs, preserving
Git history for the copied file.
If os.cpu_count() or psutil.Process().cpu_affinity(), handle them more
explicitly.  If none of the available methods results in a value for the
number of CPUs, default to 1.
Copy link
Contributor

@graeme-winter graeme-winter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think we need to add a comment somewhere that the static limits (i.e. CPU_COUNT, MEMORY_LIMIT) should be used in preference to recalculating. If we had a DIALS contributors guide, I would put it there.

I also note that this is logically a generous limit, since it is the full RAM of the machine: how we use this is an interesting question. However you are explicitly not changing the behaviour in this regard so this comment is out of context & what you are suggesting here is probably a substantial improvement on what came before.

Thank you for the contribution, this is a substantial body of effort.

newsfragments/2619.misc Outdated Show resolved Hide resolved
src/dials/algorithms/refinement/refiner.py Show resolved Hide resolved
src/dials/util/system.py Show resolved Hide resolved
@benjaminhwilliams
Copy link
Member Author

Hoping that merging main into this branch will have sorted test failures that seem unrelated to changes here.

Copy link

codecov bot commented Feb 27, 2024

Codecov Report

Attention: Patch coverage is 44.80198% with 223 lines in your changes are missing coverage. Please review.

Project coverage is 78.40%. Comparing base (e36f939) to head (85c455f).
Report is 14 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2619      +/-   ##
==========================================
- Coverage   78.49%   78.40%   -0.10%     
==========================================
  Files         609      611       +2     
  Lines       75057    75284     +227     
  Branches    10712    10746      +34     
==========================================
+ Hits        58919    59028     +109     
- Misses      13970    14076     +106     
- Partials     2168     2180      +12     

@benjaminhwilliams benjaminhwilliams merged commit 6372cee into main Feb 29, 2024
16 of 18 checks passed
@benjaminhwilliams benjaminhwilliams deleted the system-info branch February 29, 2024 14:12
jbeilstenedmands pushed a commit to jbeilstenedmands/dials that referenced this pull request Mar 8, 2024
Rationalise the discovery of system resources (CPUs and available virtual memory) across DIALS, using tooling adapted from Dask and Dask Distributed.

Introduces dials.util.system.CPU_COUNT and dials.util.system.MEMORY_LIMIT.

In cases where there is a large memory footprint, do not resort to adding swap space into the total available memory to calculate the appropriate number of multiprocessing tasks.

---------

Co-authored-by: Jim Crist <jcrist@users.noreply.github.com>
Co-authored-by: Albert DeFusco <albert.defusco@me.com>
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
Co-authored-by: crusaderky <crusaderky@gmail.com>
Co-authored-by: Thomas Grainger <tagrain@gmail.com>
Co-authored-by: Samantha Hughes <shughes-uk@users.noreply.github.com>
Co-authored-by: Florian Jetter <fjetter@users.noreply.github.com>
Co-authored-by: Johan Olsson <johan@jmvo.se>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet