Odd `slurmstepd: Exceeded step memory limit at some point.` warning when it shouldn't be expected. #54

molecules · 2015-09-14T15:45:23Z

Slurm reports the warning slurmstepd: Exceeded step memory limit at some point. in the standard error stream many times it doesn't seem justified.

I am using slurm version 14.11.8, running on CentOS Linux release 7.1.1503 (Core). All the nodes of the cluster have at least 128GB of RAM installed.

Here are two examples illustrating this:
Example 1. Script to create a 50MB file, requesting 50MB of RAM, resulting in a warning despite the low RAM usage reported by sacct.
Example 2. Script to create a 50MB file, requesting 100MB of RAM, ran without warning, as expected.

Script:

#!/bin/env bash
#SBATCH -J create_50MB_file_mem_50M
#SBATCH -o create_50MB_file_mem_50M.o_%j
#SBATCH -e create_50MB_file_mem_50M.e_%j
#SBATCH --mem 50M
dd if=/dev/zero of=file.50MB.mem_50M.txt count=51200 bs=1024

Standard Error:

51200+0 records in
51200+0 records out
52428800 bytes (52 MB) copied, 0.459204 s, 114 MB/s
slurmstepd: Exceeded step memory limit at some point.

Script:

#!/bin/env bash
#SBATCH -J create_50MB_file_mem_100M
#SBATCH -o create_50MB_file_mem_100M.o_%j
#SBATCH -e create_50MB_file_mem_100M.e_%j
#SBATCH --mem 100M
dd if=/dev/zero of=file.50MB.mem_100M.txt count=51200 bs=1024

Standard Error:

51200+0 records in
51200+0 records out
52428800 bytes (52 MB) copied, 0.587496 s, 89.2 MB/s

According to sacct, each job used about 1MB RAM (MaxRSS), and about 200MB of the virtual memory:

       JobID                     JobName     MaxRSS  MaxVMSize   ReqMem     State
------------ --------------------------- ---------- ---------- -------- ---------
       90020    create_50MB_file_mem_50M                           50Mn COMPLETED
 90020.batch                       batch      1044K    204168K     50Mn COMPLETED
       90022   create_50MB_file_mem_100M                          100Mn COMPLETED
 90022.batch                       batch      1040K    204168K    100Mn COMPLETED

Miscellaneous additional notes:

Where I really get into issues is dealing with 5-60GB files, which is quite common in bioinformatics. I don't want to over-request RAM just to avoid a warning that has no effect on the results. That would unnecessarily tie up resources.

When I tried creating 5GB files while requesting 5GB of RAM, I got the same slurmstepd: Exceeded step memory limit at some point. warning. However, increasing the requested RAM to 10GB resulted in no warning. But this seems quite excessive because in no case was more than 2MB of RAM even used (according to the MaxRSS column reported by sacct).

By the way, on the other end of the spectrum, creating up to 20MB files while requesting only 2M of RAM resulted in no warnings.

The text was updated successfully, but these errors were encountered:

grondo · 2015-09-14T16:52:23Z

@molecules, sorry! This is the llnl upstream slurm repo, but main development has moved to schedmd.com. You can ask this question on the slurm-dev mail list or report issue to schedmd at http://bugs.schedmd.com/.

However, I will note that the memory limit described by "step memory limit" in this error message is not necessarily related to the RSS of your process. This limit is provided and enforced by the cgroup plugin, and memory cgroups track not only RSS of tasks in your job but file cache, mmap pages, etc. If I had to guess you are hitting memory limit due to page cache. In that case, you might be able to just ignore this error since hitting the limit here probably just triggered memory reclaim which freed cached pages (this shouldn't be a fatal error).

If you'd like to avoid the error, and you're only writing out data and don't want it cached, then you could try playing with posix_fadvise(2) using the POSIX_FADV_DONTNEED which hints to the VM that you aren't going to read the pages you're writing out again.

molecules · 2022-04-11T15:35:01Z

Closed as bugs should be reported to http://bugs.schedmd.com/ instead. Thanks everyone!

ctb mentioned this issue Feb 8, 2017

normalize-by-median.py bad result dib-lab/khmer#1606

Closed

molecules closed this as completed Apr 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Odd `slurmstepd: Exceeded step memory limit at some point.` warning when it shouldn't be expected. #54

Odd `slurmstepd: Exceeded step memory limit at some point.` warning when it shouldn't be expected. #54

molecules commented Sep 14, 2015

grondo commented Sep 14, 2015

molecules commented Apr 11, 2022

Odd slurmstepd: Exceeded step memory limit at some point. warning when it shouldn't be expected. #54

Odd slurmstepd: Exceeded step memory limit at some point. warning when it shouldn't be expected. #54

Comments

molecules commented Sep 14, 2015

grondo commented Sep 14, 2015

molecules commented Apr 11, 2022

Odd `slurmstepd: Exceeded step memory limit at some point.` warning when it shouldn't be expected. #54

Odd `slurmstepd: Exceeded step memory limit at some point.` warning when it shouldn't be expected. #54