You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Slurm reports the warning slurmstepd: Exceeded step memory limit at some point. in the standard error stream many times it doesn't seem justified.
I am using slurm version 14.11.8, running on CentOS Linux release 7.1.1503 (Core). All the nodes of the cluster have at least 128GB of RAM installed.
Here are two examples illustrating this:
Example 1. Script to create a 50MB file, requesting 50MB of RAM, resulting in a warning despite the low RAM usage reported by sacct.
Example 2. Script to create a 50MB file, requesting 100MB of RAM, ran without warning, as expected.
Where I really get into issues is dealing with 5-60GB files, which is quite common in bioinformatics. I don't want to over-request RAM just to avoid a warning that has no effect on the results. That would unnecessarily tie up resources.
When I tried creating 5GB files while requesting 5GB of RAM, I got the same slurmstepd: Exceeded step memory limit at some point. warning. However, increasing the requested RAM to 10GB resulted in no warning. But this seems quite excessive because in no case was more than 2MB of RAM even used (according to the MaxRSS column reported by sacct).
By the way, on the other end of the spectrum, creating up to 20MB files while requesting only 2M of RAM resulted in no warnings.
The text was updated successfully, but these errors were encountered:
@molecules, sorry! This is the llnl upstream slurm repo, but main development has moved to schedmd.com. You can ask this question on the slurm-dev mail list or report issue to schedmd at http://bugs.schedmd.com/.
However, I will note that the memory limit described by "step memory limit" in this error message is not necessarily related to the RSS of your process. This limit is provided and enforced by the cgroup plugin, and memory cgroups track not only RSS of tasks in your job but file cache, mmap pages, etc. If I had to guess you are hitting memory limit due to page cache. In that case, you might be able to just ignore this error since hitting the limit here probably just triggered memory reclaim which freed cached pages (this shouldn't be a fatal error).
If you'd like to avoid the error, and you're only writing out data and don't want it cached, then you could try playing with posix_fadvise(2) using the POSIX_FADV_DONTNEED which hints to the VM that you aren't going to read the pages you're writing out again.
Slurm reports the warning
slurmstepd: Exceeded step memory limit at some point.
in the standard error stream many times it doesn't seem justified.I am using slurm version 14.11.8, running on CentOS Linux release 7.1.1503 (Core). All the nodes of the cluster have at least 128GB of RAM installed.
Here are two examples illustrating this:
Example 1. Script to create a 50MB file, requesting 50MB of RAM, resulting in a warning despite the low RAM usage reported by
sacct
.Example 2. Script to create a 50MB file, requesting 100MB of RAM, ran without warning, as expected.
Script:
Standard Error:
Script:
Standard Error:
According to
sacct
, each job used about 1MB RAM (MaxRSS), and about 200MB of the virtual memory:Miscellaneous additional notes:
Where I really get into issues is dealing with 5-60GB files, which is quite common in bioinformatics. I don't want to over-request RAM just to avoid a warning that has no effect on the results. That would unnecessarily tie up resources.
When I tried creating 5GB files while requesting 5GB of RAM, I got the same
slurmstepd: Exceeded step memory limit at some point.
warning. However, increasing the requested RAM to 10GB resulted in no warning. But this seems quite excessive because in no case was more than 2MB of RAM even used (according to the MaxRSS column reported bysacct
).By the way, on the other end of the spectrum, creating up to 20MB files while requesting only 2M of RAM resulted in no warnings.
The text was updated successfully, but these errors were encountered: