Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scan / prefix sum primitive #277

Open
dcherian opened this issue Jul 24, 2023 · 4 comments
Open

Add scan / prefix sum primitive #277

dcherian opened this issue Jul 24, 2023 · 4 comments

Comments

@dcherian
Copy link

If you're looking for something to do :), then scans would be a good thing to add.

Dask calls this "cumreduction" (terrible name!) : and its a quite useful primitive (xarray uses it for ffill, bfill). It's also a fun algorithm to think about: https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda see the blelloch, 1990 section)

@TomNicholas
Copy link
Collaborator

TomNicholas commented Jul 24, 2023

Oops I think I missed this one when creating the ChunkManagerEntrypoint in pydata/xarray#7019. Looks like calling xarray's ffill or bfill on an array with chunked core dimensions will attempt to apply dask.array.cumreduction along the core dims even if the underlying array is a cubed.Array.

@tomwhite
Copy link
Member

Thanks @dcherian - great suggestion. It would be interesting to see how we could implement this in Cubed.

The Python Array API spec has a proposal for cumulative_sum, which is a special case of this operation.

Also, I noticed that bfill in Xarray uses flip twice, so #114 may be relevant (although it might be more efficient to avoid flip in this case, due to Zarr's regular chunks requirement - flip will violate this since the final chunk will become the first chunk, so the whole array needs rewriting).

@tomwhite
Copy link
Member

I think the relevant part of the Nvidia doc is "39.2.4 Arrays of Arbitrary Size", which explains how to apply the algorithm to chunked (or blocked) arrays.

We could implement this by using the NumPy cumsum operation (or equivalent for the generalised operation) on the blocks, and then create the auxiliary arrays of sums (SUMS using the Nvidia terminology) and cumulative sums (INCR).

Naively, INCR could have chunks of size one so that it could be mapped using a blockwise operation with the scanned blocks, but this would not scale well since if there were say 1000 chunks, then a single task would have to write 1000 tiny chunk files when storing INCR as a Zarr file.

Instead we could write INCR as a Zarr file with a single chunk, then a task processing a scanned block i would load the entire INCR file, but just use the i-th value to add to its chunk. This should work as long as the object store could handle many concurrent reads. Cubed has a map_direct operation that allows access to side inputs (INCR) for cases like this.

@dcherian
Copy link
Author

Ah yes that figure in "39.2.4" is what I remember. Such a cool algorithm!

Here's the dask PR where I learnt of this: dask/dask#6675

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants