Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow non-memory zarr stores in to_zarr with distributed #10422

Merged
merged 9 commits into from May 11, 2024

Conversation

GFleishman
Copy link
Contributor

@GFleishman GFleishman commented Jul 20, 2023

In dask.array.core.to_zarr I swapped the check for distributed scheduler + MutableMapping to distributed scheduler + zarr.storage.MemoryStore, which seems to be the only zarr.storage type that is backed by memory. Though I'm unsure about the LMDB store.

I also removed the import of MutableMapping since this was the only place in all of dask.array.core to reference it.

… zarr.storage.Store as long as it isn't MemoryStore
@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

Admins can comment ok to test to allow this one PR to run or add to allowlist to allow all future PRs from the same author to run.

@github-actions github-actions bot added the array label Jul 20, 2023
@GFleishman
Copy link
Contributor Author

Not sure about the etiquette/workflow but maybe I should tag @fjetter since they have already looked at the issue?

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @GFleishman!

It looks like the code linter is unhappy -- could you run pre-commit to handle those? See https://docs.dask.org/en/stable/develop.html#code-formatting for more details

Also could you add a test to make sure we're covering this case (if we're not already)?

@jrbourbeau jrbourbeau changed the title Resolves iss10420 Allow non-memory zarr stores in to_zarr with distributed Jul 20, 2023
@GFleishman
Copy link
Contributor Author

GFleishman commented Jul 21, 2023

@jrbourbeau I think everything is actually working now. The code linter did its thing and I added tests for two new conditions:

(1) calling to_zarr with a zarr array initialized with a directory store (should work)
(2) calling to_zarr with a zarr array initialized with a memory store (should produce a RuntimeError)

The original case of calling to_zarr with a zarr array initialized with no store at all, which will default to a memory backed array, which is actually a zarr.storage.KVStore is covered by a test that was already there (right below the ones I added).

CI is still failing something, but if you look at the error log its not related to anything I added - something to do with dataframes. When I run my tests locally, they work fine. I.e.

(bigstream) fleishmang@h07u01:confocal> ipython
Python 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import dask.tests.test_distributed as tests
   ...: from distributed import Client
   ...: c = Client()
   ...: tests.test_zarr_distributed_with_explicit_directory_store(c)
   ...: tests.test_zarr_distributed_with_explicit_memory_store(c)

In [2]:

If I understand correctly, if either test were to fail then I would get an AssertionError or a RuntimeError.

@GFleishman
Copy link
Contributor Author

Just checking on this PR. I'm currently under the impression that the changes are sufficient and tested but that CI has some kind of problem.

@github-actions github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Sep 4, 2023
@aamster
Copy link

aamster commented Apr 28, 2024

Following up on this. This bug prevents using to_zarr and passing in a disk-backed zarr object as the destination with a local cluster? Can this please be merged in?

Copy link
Contributor

github-actions bot commented Apr 29, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

     15 files  ± 0       15 suites  ±0   3h 23m 21s ⏱️ -21s
 13 123 tests + 2   12 191 ✅ + 1     931 💤 ±0  1 ❌ +1 
162 492 runs  +24  142 454 ✅ +15  20 037 💤 +8  1 ❌ +1 

For more details on these failures, see this check.

Results for commit a92d65f. ± Comparison against base commit 7ace31f.

♻️ This comment has been updated with latest results.

@mrocklin
Copy link
Member

CI issues persist. I agree that they don't seem related. Raised here: #11080.

Let's see what @fjetter has to say (if he has time, no guarantees)

@mrocklin
Copy link
Member

@quasiben maybe this is something your team can help resolve?

@GFleishman
Copy link
Contributor Author

@mrocklin Just checking if there is anything else needed from me to help this out?

@mrocklin
Copy link
Member

@mrocklin Just checking if there is anything else needed from me to help this out?

Sorry for the long delay here. I've updated your test slightly (adding a importorskip("numpy") and am merging.

Also @phofl it looks like there's a pandas nightly error (just some future warning that went away).

@mrocklin mrocklin merged commit d5a81bb into dask:main May 11, 2024
25 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
array needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dask.array.core.to_zarr with distributed scheduler, MutableMapping should be ok if backed by disk
7 participants