Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Dask Arrays properly #446

Open
phofl opened this issue Nov 30, 2023 · 14 comments
Open

Integrate Dask Arrays properly #446

phofl opened this issue Nov 30, 2023 · 14 comments

Comments

@phofl
Copy link
Collaborator

phofl commented Nov 30, 2023

We can't currently go from DataFrames to arrays, #445 adds to_dask_array but this is only a bandaid for now.

I think in an ideal world we have an Array Collection that captures something like .values, then you can do array computations and go back to a DataFrame collection. Downside is that an Array collection without methods is not very helpful

I am not planning on working on this immediately, but wanted to collect thoughts on this topic

cc @mrocklin @rjzamora

@fjetter
Copy link
Member

fjetter commented Dec 1, 2023

Mid term I would also like to have arrays be implemented with the expression system, if only to get rid of HLGs entirely

@mrocklin
Copy link
Member

mrocklin commented Dec 1, 2023

I'm curious how much benefit a symbolic Dask array helps xarray. Probably less than an expression system for xarray itself, but how much less? Cc @dcherian who can maybe help is think through this.

For example, probably the biggest benefit is more intelligent rechunking. Are chunks stored in the xarray data model or are they just given to the underlying duck arrays?

@dcherian
Copy link

dcherian commented Dec 1, 2023

how much benefit a symbolic Dask array helps xarray. Probably less than an expression system for xarray itself, but how much less?

A question I have struggled with. I don't think we really know without actually trying it out. To me, it seems like GroupBy is where an xarray level system makes a lot of sense.

probably the biggest benefit is more intelligent rechunking.

Absolutely, specifically we'd want to be setting read-time chunk sizes that are adapted to the workload.

Are chunks stored in the xarray data model or are they just given to the underlying duck arrays?

Just given to and taken from the underlying duck array. so a expression based dask array would be easy to slot in.

@mrocklin
Copy link
Member

mrocklin commented Dec 1, 2023

For groupby stuff my recollection is that those optimizations are pretty local, right? You can maybe do that today with a little bit of logic in the current xarray code base? Maybe like how pandas does groupby?

@mrocklin
Copy link
Member

mrocklin commented Dec 1, 2023

Mostly, my thought is that if we target the expression layer at Dask array it's maybe more likely to be done (y'all seem busy) but I'm nervous about not capturing enough of the value.

If mostly all we care about is rechunking and if xarray can use systems like flox intelligently without a fancy expression system then we're good. If there are cases when xarray specific knowledge is valuable then it might not make as much sense to prioritize Dask array in expression form.

@fjetter
Copy link
Member

fjetter commented Dec 4, 2023

not make as much sense to prioritize Dask array in expression

For code complexity alone it makes sense to implement this. I don't think we have to be very sophisticated w.r.t optimizations but getting rid of HLGs would be a huge benefit for maintainability.
For HLG replacement I believe we don't need much more than a blockwise expr and I suspect the Array class itself will be simpler since we don't have to deal with meta and divisions.
I would really like to avoid ending up with three different systems, low level, hlg and expressions. Arrays are still lagging behind in HLG adoption which is already hurting us and I believe we should not make the same mistake again.
I'd like to nuke HLGs in the next couple of months

@mrocklin
Copy link
Member

mrocklin commented Dec 4, 2023 via email

@dcherian
Copy link

dcherian commented Dec 5, 2023

If there are cases when xarray specific knowledge is valuable then it might not make as much sense to prioritize Dask array in expression form.

To me, It feels like most of the value is in automatically chunking for the user, and removing that knob from most workloads. And today Xarray is engineered so that chunks are user-specified and provided to dask.array.

How easy would it be to prototype an expression system for Xarray? a few hours at AGU?

@mrocklin
Copy link
Member

mrocklin commented Dec 5, 2023

To me, It feels like most of the value is in automatically chunking for the user, and removing that knob from most workloads. And today Xarray is engineered so that chunks are user-specified and provided to dask.array.

How easy would it be to prototype an expression system for Xarray? a few hours at AGU?

That sounds possible-but-optimistic. I'd have more confidence in prototyping a numpy/dask.array system and dropping it into xarray.

Given your first answer above I'm inclined to focus more on dask.array anyway rather than xarray. Especially if the following is true:

If mostly all we care about is rechunking and if xarray can use systems like flox intelligently without a fancy expression system then we're good

Do you need a full expression system to do groupbys well? If the answer is "no" and it's easy to do a little light rewriting then my sense is that we don't pursue an xarray expression system, and just do arrays. Although maybe there are other reasons to have high level expressions for xarray, like if Earthmover wanted to record what queries were common within a customerbase.

@dcherian
Copy link

dcherian commented Dec 5, 2023

Do you need a full expression system to do groupbys well? If the answer is "no" and it's easy to do a little light rewriting then my sense is that we don't pursue an xarray expression system, and just do arrays.

I think the answer is "probably not". I made a proposal for these higher level objects that can return "preferred chunks". We would still need to propagate these chunking heuristics down to read-time, but that seems like it could be done at the array level.

Sounds like you'd want to migrate dask.array to an expression system anyway, so let's prototype that and see where we get?

@mrocklin
Copy link
Member

mrocklin commented Dec 8, 2023

Looking through the API, I'm guessing that our hierarchy is mostly based on the following:

  • IO
  • Blockwise
  • TreeReductions
  • Slicing
  • Rechunk
  • MapOverlap

There are some other outliers like the following:

  • cumulative reductions
  • linalg
  • coarsen
  • ravel
  • ...

I'll bet that an initial implementation focuses on from_array, blockwise, reductions, and slicing. I think that we can deliver a lot of functionality with those.

Of course, blockwise and slicing are both pretty hairy and have, I suspect, bus factors of zero today. Someone (maybe me) probably has to go and do those first. Then it's probably easy for other people to come on afterwards.

@dcherian
Copy link

dcherian commented Dec 8, 2023

I wonder if @shoyer has thoughts on the advantages of building an expression system on xarray rather than dask.array. Such a thing would seem to resemble xarray-beam in some ways.

It seems like the classic issues

are solvable at the dask.array level.

@tomwhite
Copy link

tomwhite commented Dec 8, 2023

Very interested to see this work! A couple of questions and thoughts (based on my work in Cubed):

  1. Is it a goal (or possible) for the expression system to be usable on alternative execution engines (in the way Cubed does), like Beam, Lithops, or Modal?
  2. Could the expression system model projected memory requirements? This seems like the right level to do that, and could be very valuable for improving the overall user experience.

@mrocklin
Copy link
Member

mrocklin commented Dec 8, 2023

Is it a goal (or possible) for the expression system to be usable on alternative execution engines (in the way Cubed does), like Beam, Lithops, or Modal?

It is not a goal of mine (I mostly just work on Dask and Coiled) but I don't think it would be hard to do. I recommend looking at the current dataframe implementation. Much of the code is about modeling the pandas API, then there are some methods/protocols like _layer which add Dask logic. One could easily add in a _pandas_apply method everywhere for example to run things on pandas operations.

To be clear, I have no intention of spending energy to support other systems, but if other people want to come in and collaborate I'm sure we could make space for them.

Could the expression system model projected memory requirements? This seems like the right level to do that, and could be very valuable for improving the overall user experience

We know the size of every chunk and the dtype, so presumably yes, on a per-chunk basis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants