-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Working towards best practices for packages based on dask #5452
Comments
I would say that this depends on the technical sophistication of your users. Assuming that that sophistication is sometimes low, I would say that you should probably expose chunking optionally, and otherwise should help to provide guides to dask's auto-chunking if possible. Assuming that you're dealing with geotiffs it would be good to use the tile size to inform automatic chunking. Line 2313 in 4898c75
This is done in an old PR to Xarray, which might be interesting. https://github.com/pydata/xarray/pull/2255/files#diff-6364b203943c799516f9ecba19e1b119R326
I think that downstream libraries should probably avoid setting this if possible. Creating the scheduler/client should be orthogonal from libraries that create computations.
Ideally, they don't have to use submit/gather, and can just use Of course, as your questions poses, this may not always be possible. I think that using |
So it sounds like to summarize your answers:
|
Few more (euro) cents from another Pytroll core dev. As I'm more concentrated on the operational satellite processing chains, there are some points that I need to consider all the time, both when writing the software and when setting things up. Below are some thoughts that came to mind when I read the 3-item list @djhoese started with. There can be several completely independent chains (we at FMI have 10+) running, and some of them need to share resources. So the chains need to be constrained in CPU and/or RAM usage. The first means setting both As Dask by default uses all the resources available, we Pytroll devs need to advice the users with things like (a bit exaggerated, but not by much) "Try if I guess the point here is: we (the library/application devs) need a simple way to give the user control over the amount of resources the software takes. |
And any package using dask would need to come up with some sort of best practices for their use cases. So maybe it isn't something that dask can do to help any more than it already is, but that dask's best practices for downstream packages would need to discuss this as something people should be concerned about. |
Great stuff in here, thanks for starting this issue! I have one thing I would like to bring up, it's related to IO. In our application (satpy), we do read a lot of data from disk and write at least as much to disk when we're done. However, it's not really practical for us (at my work place) to have these data on shared storage as the data volume is too big and would slow down the entire network, so we have the data locally on the servers. On the other hand, we need to balance the load of our servers, so dask.distributed would be very handy, but I'm not sure how to handle the reading and writing parts as they can only be done on one of the machines. I know that workers can be assigned some flags to represent the resources they have access to, but how do we set this up transparently in a library (satpy) that have to be runable both with and without dask.distributed ? And more generally how can we activate distributed be added transparently to the library ? (I think that last question was also raised indirectly by @djhoese) |
I would encourage satpy to not make constraints about setting up Dask workers. I think that this is handled best on an application by application basis.
Can they all use the same dask cluster? If so, that would load balance for you.
Typically people use a network file system for this sort of thing. If I/O is a bottleneck then you might consider compression, nicer file formats, doing more computation per run, or something similar. |
I thought of some new ones this weekend and an update on another: 2a. A lot of scientific libraries that can benefit from dask are usually wrapping old fortran or C code that is fast because of how it is implemented. Some also use OpenMP, but they are part of the existing users' workflow. Blindly setting things like OpenMP or BLAS number of threads could have major performance penalties on these libraries. Perhaps suggesting sub-library developers to take advantage of projects like https://github.com/joblib/threadpoolctl may be a good suggestion. I've never used it myself, but discovered it when worrying about how Satpy uses the pykdtree library (OpenMP-based). If we access pykdtree from dask workers there isn't a real need for OpenMP...but what if it is faster (leading to profiling, benchmarks, etc only to find it doesn't make a big difference)?
|
|
Were you doing anything fancy, like trying to reuse an event loop or state between tests? Or were you just using the public API? I would think that things would be fine as long as you're closing the cluster down between tests. Reports otherwise would be welcome. |
No, nothing too fancy. When I originally wrote the tests I noticed that dask's own tests had some pytest fixtures for handling things so I assumed that might be necessary and just mocked the Client in my tests. I was creating a LocalCluster with 1 worker, creating a Client with that cluster, then after using them I would run I should note, it passed locally just fine but on Travis it would fail. Here's the error:
So it is likely in the way we are starting our tests (using |
Those are documented at https://distributed.dask.org/en/latest/develop.html#writing-tests. If you're making assertions about the cluster, it may be helpful to use those helpers, but otherwise they may be too tricky to use.
We have one similar to this already. It's about creating a |
It's deprecated (as far as I understand), but essentially calls unittest with the test suite specified in setup.py. |
Just an FYI, scikit-learn is dealing with similar issues: scikit-learn/scikit-learn#14979. |
This topic was brought up by @mrocklin on twitter: what best practices and guidelines should packages that depend on dask and other pydata packages follow? This was specifically brought up when talking about Satpy which is highly dependent on xarray DataArrays using dask arrays underneath. So the question is, what kind of things could be presented/documented to people wanting to build a package dependent on these dask/xarray/others. Some initial thoughts and questions I had:
OMP_NUM_THREADS
andDASK_NUM_WORKERS
(or the related dask.config.set parameter).get_client
inside the package's utilities to get any configured client? Should a user be required to provide a client? If not provided can/should a package create a local client? Is that too much magic happening without the user's knowledge (creating sub-processes, etc)?What other areas could these best practices include?
CC @mraspaud (satpy core developer)
The text was updated successfully, but these errors were encountered: