Putting lazy collections into Dask Arrays #5879

hameerabbasi · 2020-02-09T08:44:08Z

Hello. I was wondering if there was a mechanism for putting lazy arrays into Dask Arrays. I was looking into doing whole-program optimization for PyData/Sparse, which turns out to be hugely better than local optimisation. See https://github.com/tensor-compiler/taco and the related papers/talks for details.

However, when/before transmitting results, I'd like to "signal" to Dask that it should do Dask's equivalent of .compute(), before transmitting results to another node.

I was wondering if I could "turn off" parallelism on one node, as there will be parallelism inherent to the algorithm itself.

The text was updated successfully, but these errors were encountered:

mrocklin · 2020-02-09T17:21:05Z

However, when/before transmitting results, I'd like to "signal" to Dask that it should do Dask's equivalent of .compute(), before transmitting results to another node.

There is nothing in Dask itself, but you might be able to add this to serialization methods like the pickle protocol or Dask's custom serialization methods.

I was wondering if I could "turn off" parallelism on one node, as there will be parallelism inherent to the algorithm itself.

You can have workers that have only a single thread, but there is no way to temporarily claim the entire worker. One could maybe do something fancy with locks and semaphores, but there's nothing like that today and it might take some time to gather enough use cases to make a good design here.

mrocklin · 2020-02-09T17:22:13Z

For other situration that already have parallelism (like BLAS) we tend to turn off parallelism in the user's library. It tends to be decently efficient most of the time to just let Dask be the only thing that runs in parallel.

hameerabbasi · 2020-02-09T17:35:14Z

There is nothing in Dask itself, but you might be able to add this to serialization methods like the pickle protocol or Dask's custom serialization methods.

This is a nice idea, and probably close to what I want either way.

For other situration that already have parallelism (like BLAS) we tend to turn off parallelism in the user's library. It tends to be decently efficient most of the time to just let Dask be the only thing that runs in parallel.

I'll have to test both and see the performance difference in each case, but since I'm in the exploratory phase at this moment nothing can be said. 😄

mrocklin · 2020-02-09T17:45:38Z

Usually parallelizing within an algorithm is harder than running it many times in parallel. If you're in a situation where you can use Dask to run things manyt imes, it's probably not a bad idea to use Dask. That at least has been the experience so far.

hameerabbasi · 2020-02-09T17:47:11Z

Usually parallelizing within an algorithm is harder than running it many times in parallel. If you're in a situation where you can use Dask to run things manyt imes, it's probably not a bad idea to use Dask. That at least has been the experience so far.

I'll keep that in mind. 😄

hameerabbasi · 2020-02-10T11:10:51Z

One aspect of this I haven't yet mentioned (and the reason I'm reopening) is: Is there a way to call .compute() on my collection before Dask assembles the results, and not just when shuffling nodes?

mrocklin · 2020-02-10T15:37:53Z

Perhaps post-compute and post-persist? https://docs.dask.org/en/latest/custom-collections.html#__dask_postcompute__

…

On Mon, Feb 10, 2020 at 3:10 AM Hameer Abbasi ***@***.***> wrote: One aspect of this I haven't yet mentioned (and the reason I'm reopening) is: Is there a way to call .compute() on my collection before Dask assembles the results, and not just when shuffling nodes? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5879?email_source=notifications&email_token=AACKZTBPDVUKGTGILYYV7S3RCEY3ZA5CNFSM4KSAEY52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELIDW3I#issuecomment-584072045>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTFH23X6B6WIJTFHPRLRCEY3ZANCNFSM4KSAEY5Q> .

hameerabbasi closed this as completed Feb 9, 2020

hameerabbasi reopened this Feb 10, 2020

hameerabbasi closed this as completed Feb 10, 2020

hameerabbasi mentioned this issue Feb 17, 2020

Lazy arrays for asymptotically better performance pydata/sparse#326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Putting lazy collections into Dask Arrays #5879

Putting lazy collections into Dask Arrays #5879

hameerabbasi commented Feb 9, 2020

mrocklin commented Feb 9, 2020

mrocklin commented Feb 9, 2020

hameerabbasi commented Feb 9, 2020

mrocklin commented Feb 9, 2020

hameerabbasi commented Feb 9, 2020

hameerabbasi commented Feb 10, 2020

mrocklin commented Feb 10, 2020 via email

Putting lazy collections into Dask Arrays #5879

Putting lazy collections into Dask Arrays #5879

Comments

hameerabbasi commented Feb 9, 2020

mrocklin commented Feb 9, 2020

mrocklin commented Feb 9, 2020

hameerabbasi commented Feb 9, 2020

mrocklin commented Feb 9, 2020

hameerabbasi commented Feb 9, 2020

hameerabbasi commented Feb 10, 2020

mrocklin commented Feb 10, 2020 via email