-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow explicit chunk size for to_zarr
#5953
Comments
Not necessarily. We would need to rechunk anyway. I think that using rechunk before to_zarr is likely as good as we can do here. I'm inclined to leave things as they are. It might make sense to add a line in the docstring of I may not fully understand things. Is there a particular reason why you're concerned about rechunking, or is this just a general concern about uncertainty around rechunking performance. |
Thanks for the response. This is generally a concern about rechunking performance but I very well could be off in my understanding. Maybe I need to look into If the dask chunks sizes are some multiple of the zarr chunks, then would it be possible for that dask chunk to write to multiple zarr chunks rather than needing to rechunk? Maybe this is quite minor in terms of performance... I would like to used delayed + dask array like this, but if I understand correctly using delayed with other dask collections should be avoided: import dask.array as da
from dask import delayed
import zarr
z = zarr.open("my_data.zarr", shape=(50000, 50000), chunks=(512, 512)) # create mutable store
x = da.random.random((50000, 50000), chunks=(4096, 4096))
@delayed
def write_chunks(z, x, selection):
z[selection] = x[selection]
tasks = []
for i in range(0, 50000, 4096):
task = write_chunks(z, x, (slice(i, i + 4096), slice(i, i + 4096)))
tasks.append(task)
dask.compute(tasks) In this senario, each dask chunk is responsible for writing to 16 zarr chunks and there shouldn't be issues with trying to write to the same chunks. |
That sounds like a reasonable use case that is not currently supported . cc @jakirkham , who works in this space in case he has interest.
Using Dask array with Dask delayed is definitely doable, but there are many ways to shoot yourself in the foot. The way that you have proposed is one such way. I recommend not doing that. When you pass a Dask array into a delayed function the dask array will become a numpy array. |
Yeah sounds reasonable. Are you interested in doing a PR? 😉 |
I am interested in doing a PR! However, considering my first suggestion seems to be a foot-gun I would appreciate some guidance in approach. |
Yeah I think the normal way to do this would be to If the chunks in Zarr are smaller than those in Dask and all Zarr chunks fit within Dask chunks, that may be safe to use without locking. However any deviations will require we use Zarr locking. I'm not totally sure how we will handle that when using the Distributed Scheduler. |
Description
Currently working on a project with zarr and dask where the on-disc dataset has less than optimal chunk sizes for computation with dask (specifically creating image tiles). Fortunately,
data.array.from_zarr
allows you to specify the chunk sizes explicitly and optimize for computation, butdask.array.to_zarr
does not allow the reverse when creating an array.Example
A work around at the moment is:
However, it seems that there is likely a bit of overhead with rechunking to this extreme just to write to a store on-disc. Many thanks in advance for having a look at this. I love using dask!
The text was updated successfully, but these errors were encountered: