-
-
Notifications
You must be signed in to change notification settings - Fork 142
-
-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PBS memory excess #178
Comments
This is weird. If PBS kills your job, this means that the processes launched inside it are consumming that memory one way or another. What is your setup: dask-jobqueue params, and computing node characteristic? One workaround could be to just book all the available memory of a compute node using the But there's still something weird with this. What king of data are you loading or accessing? |
Each job is launched on 1 node of the cluster. Here is how I launched the dask cluster:
I am loading binary data and rewriting under zarr format. |
I wonder if this is not related to dask/dask#3530 what is surprising is that I did not use to run into this type of issues several months ago |
@apatlpo - can I suggest looking at Dask's dashboard during the time your workers are dying. This is often helpful in determining how much memory individual tasks are using. |
I am not sure I follow you comment. |
What I was suggesting above is to use exactly the maximum number of bytes allowed in the resource_spec. You can probably have the value using pbsnodes command. This could allow to avoid Pbs killing, and show you the next problem. Otherwise could you provide a small reproducible example ? Or maybe at least the code leading to this problem ? What are you using to read binary input data? |
@apatlpo any update? See also dask/distributed#1949, do you use subprocesses? |
no update sorry except that we just realised @lanougue is having similar issues on the same cluster (datarmor) |
Here is a small reproducible example:
This leads to a crash similar to that described above. Note that there is NO crash when data storage is commented out. |
Just testing this, no problem on CNES cluster. After 150 steps, my server is still having 115GB free memory, 1.7GB in cache, and around 3GB consumed by dask workers. Still running after 230 iterations, and I did not booked complete nodes for every worker. So every worker is having only select=1:ncpus=4:mem=20GB in my test. Could you try with only |
Hi, it does not seems to be dask related. With the following code I get exactly the same behaviour than @apatlpo .
|
As it seems related to your system, I propose to close this issue. |
sure, done |
Was this a PBS issue? I've also been running into similar problems. |
For completeness, https://discourse.pangeo.io/t/very-big-memory-load-when-using-fast-parallel-file-system/160/4 seems to indicate that it was a problem with the cluster-specific configuration rather than Dask. |
For a given calculation I am consistently exceeding the memory I asked for in the PBS script:
dask-worker.o1838772:=>> PBS: job killed: mem job total 115343360 kb exceeded limit 115343360 kb
This is surprising because the memory required for all workers on one job is much smaller than the limit above.
The dask dashboard indicates a nominal memory usage.
It seems that is the cache memory that is the problem:
Do you have any suggestion as to how I could diagnose what is going wrong in more details?
The text was updated successfully, but these errors were encountered: