Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Putting lazy collections into Dask Arrays #5879

Closed
hameerabbasi opened this issue Feb 9, 2020 · 7 comments
Closed

Putting lazy collections into Dask Arrays #5879

hameerabbasi opened this issue Feb 9, 2020 · 7 comments

Comments

@hameerabbasi
Copy link
Contributor

Hello. I was wondering if there was a mechanism for putting lazy arrays into Dask Arrays. I was looking into doing whole-program optimization for PyData/Sparse, which turns out to be hugely better than local optimisation. See https://github.com/tensor-compiler/taco and the related papers/talks for details.

However, when/before transmitting results, I'd like to "signal" to Dask that it should do Dask's equivalent of .compute(), before transmitting results to another node.

I was wondering if I could "turn off" parallelism on one node, as there will be parallelism inherent to the algorithm itself.

@mrocklin
Copy link
Member

mrocklin commented Feb 9, 2020

However, when/before transmitting results, I'd like to "signal" to Dask that it should do Dask's equivalent of .compute(), before transmitting results to another node.

There is nothing in Dask itself, but you might be able to add this to serialization methods like the pickle protocol or Dask's custom serialization methods.

I was wondering if I could "turn off" parallelism on one node, as there will be parallelism inherent to the algorithm itself.

You can have workers that have only a single thread, but there is no way to temporarily claim the entire worker. One could maybe do something fancy with locks and semaphores, but there's nothing like that today and it might take some time to gather enough use cases to make a good design here.

@mrocklin
Copy link
Member

mrocklin commented Feb 9, 2020

For other situration that already have parallelism (like BLAS) we tend to turn off parallelism in the user's library. It tends to be decently efficient most of the time to just let Dask be the only thing that runs in parallel.

@hameerabbasi
Copy link
Contributor Author

There is nothing in Dask itself, but you might be able to add this to serialization methods like the pickle protocol or Dask's custom serialization methods.

This is a nice idea, and probably close to what I want either way.

For other situration that already have parallelism (like BLAS) we tend to turn off parallelism in the user's library. It tends to be decently efficient most of the time to just let Dask be the only thing that runs in parallel.

I'll have to test both and see the performance difference in each case, but since I'm in the exploratory phase at this moment nothing can be said. 😄

@mrocklin
Copy link
Member

mrocklin commented Feb 9, 2020

Usually parallelizing within an algorithm is harder than running it many times in parallel. If you're in a situation where you can use Dask to run things manyt imes, it's probably not a bad idea to use Dask. That at least has been the experience so far.

@hameerabbasi
Copy link
Contributor Author

Usually parallelizing within an algorithm is harder than running it many times in parallel. If you're in a situation where you can use Dask to run things manyt imes, it's probably not a bad idea to use Dask. That at least has been the experience so far.

I'll keep that in mind. 😄

@hameerabbasi
Copy link
Contributor Author

One aspect of this I haven't yet mentioned (and the reason I'm reopening) is: Is there a way to call .compute() on my collection before Dask assembles the results, and not just when shuffling nodes?

@mrocklin
Copy link
Member

mrocklin commented Feb 10, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants