Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PBS memory excess #178

Closed
apatlpo opened this issue Oct 18, 2018 · 15 comments
Closed

PBS memory excess #178

apatlpo opened this issue Oct 18, 2018 · 15 comments
Labels
bug Something isn't working

Comments

@apatlpo
Copy link

apatlpo commented Oct 18, 2018

For a given calculation I am consistently exceeding the memory I asked for in the PBS script:

dask-worker.o1838772:=>> PBS: job killed: mem job total 115343360 kb exceeded limit 115343360 kb

This is surprising because the memory required for all workers on one job is much smaller than the limit above.
The dask dashboard indicates a nominal memory usage.

It seems that is the cache memory that is the problem:

datarmor_mem

Do you have any suggestion as to how I could diagnose what is going wrong in more details?

@guillaumeeb
Copy link
Member

This is weird. If PBS kills your job, this means that the processes launched inside it are consumming that memory one way or another. What is your setup: dask-jobqueue params, and computing node characteristic?

One workaround could be to just book all the available memory of a compute node using the resource_spec kwarg. This way you won't be able to exceed the limit!

But there's still something weird with this. What king of data are you loading or accessing?

@apatlpo
Copy link
Author

apatlpo commented Oct 18, 2018

Each job is launched on 1 node of the cluster.
Each node has a bit more that 120GB of memory (red line on figure above).

Here is how I launched the dask cluster:

cluster = PBSCluster(cores=4, processes=2, memory='50G', resource_spec='select=1:ncpus=28:mem=110GB')

I am loading binary data and rewriting under zarr format.

@apatlpo
Copy link
Author

apatlpo commented Oct 19, 2018

I wonder if this is not related to dask/dask#3530

what is surprising is that I did not use to run into this type of issues several months ago

@jhamman
Copy link
Member

jhamman commented Oct 19, 2018

@apatlpo - can I suggest looking at Dask's dashboard during the time your workers are dying. This is often helpful in determining how much memory individual tasks are using.

@apatlpo
Copy link
Author

apatlpo commented Oct 19, 2018

I am not sure I follow you comment.
As indicated in the initial post, the dashboard indicate a very moderate memory usage.
Maybe I wasn't clear.

@guillaumeeb
Copy link
Member

What I was suggesting above is to use exactly the maximum number of bytes allowed in the resource_spec. You can probably have the value using pbsnodes command.

This could allow to avoid Pbs killing, and show you the next problem.

Otherwise could you provide a small reproducible example ? Or maybe at least the code leading to this problem ? What are you using to read binary input data?

@guillaumeeb guillaumeeb added the bug Something isn't working label Oct 27, 2018
@guillaumeeb
Copy link
Member

@apatlpo any update?

See also dask/distributed#1949, do you use subprocesses?

@apatlpo
Copy link
Author

apatlpo commented Nov 13, 2018

no update sorry except that we just realised @lanougue is having similar issues on the same cluster (datarmor)

@apatlpo
Copy link
Author

apatlpo commented Nov 13, 2018

Here is a small reproducible example:

import dask.array as da
from dask.distributed import Client
from dask_jobqueue import PBSCluster

cluster = PBSCluster(cores=4, processes=2, memory='50G', resource_spec='select=1:ncpus=28:mem=110GB') #
w = cluster.scale(8)
client = Client(cluster)

for i in range(1e3):
    x = da.random.normal(size=(2e4, 2e4), chunks=(2e4, 100))
    print('%i , mean=%f'%(i, x.mean().compute()))
    # dies after 150 steps approx when writting files
    scratch = '/home/c11-data/Test_aponte/'
    x.to_zarr(scratch+'debug/x%04d.zarr'%i)

This leads to a crash similar to that described above.

Note that there is NO crash when data storage is commented out.

@guillaumeeb
Copy link
Member

Just testing this, no problem on CNES cluster. After 150 steps, my server is still having 115GB free memory, 1.7GB in cache, and around 3GB consumed by dask workers.

Still running after 230 iterations, and I did not booked complete nodes for every worker. So every worker is having only select=1:ncpus=4:mem=20GB in my test.

Could you try with only cluster = PBSCluster(cores=4, processes=2, memory='20G')? It should fail fast on your cluster. This seems again related to your system, but I have no idea how...

@lanougue
Copy link

Hi, it does not seems to be dask related. With the following code I get exactly the same behaviour than @apatlpo .
I am working on the same cluster, it look likes to be a machine problem when writing data on the disk ...

import numpy as np
import xarray as xr
import time

x=np.arange(8192)
y=np.arange(8192)
for i in range(150):
    a=xr.DataArray(np.random.rand(8192,8192), dims = ('x','y'), coords={'x':x, 'y':y})
    a.to_netcdf('mat{}.nc'.format(i), engine='netcdf4', format='NETCDF4_CLASSIC')
    time.sleep(5)

@guillaumeeb
Copy link
Member

As it seems related to your system, I propose to close this issue.

@apatlpo
Copy link
Author

apatlpo commented Nov 15, 2018

sure, done

@jlevy44
Copy link

jlevy44 commented Jul 19, 2019

Was this a PBS issue? I've also been running into similar problems.

@lesteve
Copy link
Member

lesteve commented Nov 14, 2019

For completeness, https://discourse.pangeo.io/t/very-big-memory-load-when-using-fast-parallel-file-system/160/4 seems to indicate that it was a problem with the cluster-specific configuration rather than Dask.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants