PBS memory excess #178

apatlpo · 2018-10-18T10:18:36Z

For a given calculation I am consistently exceeding the memory I asked for in the PBS script:

dask-worker.o1838772:=>> PBS: job killed: mem job total 115343360 kb exceeded limit 115343360 kb

This is surprising because the memory required for all workers on one job is much smaller than the limit above.
The dask dashboard indicates a nominal memory usage.

It seems that is the cache memory that is the problem:

Do you have any suggestion as to how I could diagnose what is going wrong in more details?

The text was updated successfully, but these errors were encountered:

guillaumeeb · 2018-10-18T13:35:11Z

This is weird. If PBS kills your job, this means that the processes launched inside it are consumming that memory one way or another. What is your setup: dask-jobqueue params, and computing node characteristic?

One workaround could be to just book all the available memory of a compute node using the resource_spec kwarg. This way you won't be able to exceed the limit!

But there's still something weird with this. What king of data are you loading or accessing?

apatlpo · 2018-10-18T14:29:13Z

Each job is launched on 1 node of the cluster.
Each node has a bit more that 120GB of memory (red line on figure above).

Here is how I launched the dask cluster:

cluster = PBSCluster(cores=4, processes=2, memory='50G', resource_spec='select=1:ncpus=28:mem=110GB')

I am loading binary data and rewriting under zarr format.

apatlpo · 2018-10-19T10:30:42Z

I wonder if this is not related to dask/dask#3530

what is surprising is that I did not use to run into this type of issues several months ago

jhamman · 2018-10-19T15:09:15Z

@apatlpo - can I suggest looking at Dask's dashboard during the time your workers are dying. This is often helpful in determining how much memory individual tasks are using.

apatlpo · 2018-10-19T15:13:46Z

I am not sure I follow you comment.
As indicated in the initial post, the dashboard indicate a very moderate memory usage.
Maybe I wasn't clear.

guillaumeeb · 2018-10-19T19:32:21Z

What I was suggesting above is to use exactly the maximum number of bytes allowed in the resource_spec. You can probably have the value using pbsnodes command.

This could allow to avoid Pbs killing, and show you the next problem.

Otherwise could you provide a small reproducible example ? Or maybe at least the code leading to this problem ? What are you using to read binary input data?

guillaumeeb · 2018-11-12T20:45:51Z

@apatlpo any update?

See also dask/distributed#1949, do you use subprocesses?

apatlpo · 2018-11-13T09:34:09Z

no update sorry except that we just realised @lanougue is having similar issues on the same cluster (datarmor)

apatlpo · 2018-11-13T12:17:01Z

Here is a small reproducible example:

import dask.array as da
from dask.distributed import Client
from dask_jobqueue import PBSCluster

cluster = PBSCluster(cores=4, processes=2, memory='50G', resource_spec='select=1:ncpus=28:mem=110GB') #
w = cluster.scale(8)
client = Client(cluster)

for i in range(1e3):
    x = da.random.normal(size=(2e4, 2e4), chunks=(2e4, 100))
    print('%i , mean=%f'%(i, x.mean().compute()))
    # dies after 150 steps approx when writting files
    scratch = '/home/c11-data/Test_aponte/'
    x.to_zarr(scratch+'debug/x%04d.zarr'%i)

This leads to a crash similar to that described above.

Note that there is NO crash when data storage is commented out.

guillaumeeb · 2018-11-13T20:44:19Z

Just testing this, no problem on CNES cluster. After 150 steps, my server is still having 115GB free memory, 1.7GB in cache, and around 3GB consumed by dask workers.

Still running after 230 iterations, and I did not booked complete nodes for every worker. So every worker is having only select=1:ncpus=4:mem=20GB in my test.

Could you try with only cluster = PBSCluster(cores=4, processes=2, memory='20G')? It should fail fast on your cluster. This seems again related to your system, but I have no idea how...

lanougue · 2018-11-14T14:34:58Z

Hi, it does not seems to be dask related. With the following code I get exactly the same behaviour than @apatlpo .
I am working on the same cluster, it look likes to be a machine problem when writing data on the disk ...

import numpy as np
import xarray as xr
import time

x=np.arange(8192)
y=np.arange(8192)
for i in range(150):
    a=xr.DataArray(np.random.rand(8192,8192), dims = ('x','y'), coords={'x':x, 'y':y})
    a.to_netcdf('mat{}.nc'.format(i), engine='netcdf4', format='NETCDF4_CLASSIC')
    time.sleep(5)

guillaumeeb · 2018-11-14T22:15:49Z

As it seems related to your system, I propose to close this issue.

apatlpo · 2018-11-15T10:50:03Z

sure, done

jlevy44 · 2019-07-19T03:40:41Z

Was this a PBS issue? I've also been running into similar problems.

lesteve · 2019-11-14T08:27:45Z

For completeness, https://discourse.pangeo.io/t/very-big-memory-load-when-using-fast-parallel-file-system/160/4 seems to indicate that it was a problem with the cluster-specific configuration rather than Dask.

guillaumeeb added the bug Something isn't working label Oct 27, 2018

apatlpo closed this as completed Nov 15, 2018

lanougue mentioned this issue Nov 16, 2018

Memory leak on dask client and workers dask/distributed#2328

Closed

jlevy44 mentioned this issue Jul 19, 2019

PBS Issues dask/distributed#2856

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PBS memory excess #178

PBS memory excess #178

apatlpo commented Oct 18, 2018

guillaumeeb commented Oct 18, 2018

apatlpo commented Oct 18, 2018

apatlpo commented Oct 19, 2018

jhamman commented Oct 19, 2018

apatlpo commented Oct 19, 2018

guillaumeeb commented Oct 19, 2018

guillaumeeb commented Nov 12, 2018

apatlpo commented Nov 13, 2018

apatlpo commented Nov 13, 2018

guillaumeeb commented Nov 13, 2018

lanougue commented Nov 14, 2018

guillaumeeb commented Nov 14, 2018

apatlpo commented Nov 15, 2018

jlevy44 commented Jul 19, 2019

lesteve commented Nov 14, 2019

PBS memory excess #178

PBS memory excess #178

Comments

apatlpo commented Oct 18, 2018

guillaumeeb commented Oct 18, 2018

apatlpo commented Oct 18, 2018

apatlpo commented Oct 19, 2018

jhamman commented Oct 19, 2018

apatlpo commented Oct 19, 2018

guillaumeeb commented Oct 19, 2018

guillaumeeb commented Nov 12, 2018

apatlpo commented Nov 13, 2018

apatlpo commented Nov 13, 2018

guillaumeeb commented Nov 13, 2018

lanougue commented Nov 14, 2018

guillaumeeb commented Nov 14, 2018

apatlpo commented Nov 15, 2018

jlevy44 commented Jul 19, 2019

lesteve commented Nov 14, 2019