Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster config for Ifremer #292

Closed
lesteve opened this issue Jul 19, 2019 · 4 comments · Fixed by #294

Comments

@lesteve
Copy link
Member

@lesteve lesteve commented Jul 19, 2019

@lanougue if you have some time to contribute your configuration for your Ifremer cluster, that would be great, since you told me at the French pangeo workshop that it took you some time and effort to get it working.

Either do a PR and add it to docs/source/examples.rst, or copy and paste the python code that you used for creating your *Cluster object in this issue.

@lanougue

This comment has been minimized.

Copy link

@lanougue lanougue commented Jul 19, 2019

Hi @lesteve
Yes, here is my config in jobqueue.yaml

jobqueue:
  pbs:
    name: dask-worker

    # Dask worker options
    cores: 28       # number of processes and core have to be equal to avoid any threads !!! Threads can generate netcdf file access errors
    processes: 28   # number of processes and core have to be equal to avoid any threads !!! Threads can generate netcdf file access errors
    memory: 120GB
    interface: ib0
    death-timeout: 900 # should be large if many workers are launched
    local-directory: $TMPDIR

    # PBS resource manager options
    queue: mpi_1
    project: myPROJ
    walltime: '48:00:00'
    extra: []
    env-extra: []
    resource-spec: select=1:ncpus=28:mem=120GB
    job-extra: ['-m n']

and then three lines to create the client

cluster = PBSCluster()
cluster.start_workers(Nw)
client = Client(cluster)

Before I specified the local-directory, it was taking several minutes for all the workers to be ready, so I increased the death-timeout to 900. But now that there is '$TMPDIR', starting the workers is very fast and you can reduce this timeout

@lesteve

This comment has been minimized.

Copy link
Member Author

@lesteve lesteve commented Jul 19, 2019

Thanks a lot! I'll turn that into a snippet we can use in the doc.

Just a quick question, can you confirm the -m n is to avoid receiving emails (this is what I got from googling).

@lanougue

This comment has been minimized.

Copy link

@lanougue lanougue commented Jul 19, 2019

Just a quick question, can you confirm the -m n is to avoid receiving emails (this is what I got from googling).

No idea ! But if googe says it ... :)

Something else that I use and that you can add in your doc. When I create the cluster, I usually use
cluster = PBSCluster(extra=['--preload /home1/datahome/username/mydir/myfile1.py', '--preload /home1/datahome/username/mydir/myfile2.py'])

when I want some functions defined in myfile1.py and myfile2.py to be preloaded in the workers

@lanougue

This comment has been minimized.

Copy link

@lanougue lanougue commented Jul 19, 2019

Something else: The configuration I gave you is to have the maximum of workers (28) per node, meaning 4 Go of mermory per worker. It can lead to memory issue (and workers killed).
If more memory is needed per worker, decreasing cores and processes numbers in the configuration should help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.