Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array nesting: Add the ability to use N5-style nested layout #17

Closed
wants to merge 6 commits into from

Conversation

joshmoore
Copy link

Having many millions of chunk files in a single directory causes
significant performance issues for filesystems. This PR introduces
a "nested" boolean to the ZarrArray class and related helpers in
order to choose between the separators "." and "/". The main
downside of nested storage is that one must search through the
chunk index names in order to determine whether or not an array
is nested.

Discussion:

A few similar strategies exist in the zarr-python code base. NestedStorage was the original attempt with the downside that it was not composable with other stores. A newer version, FSStore, allows passing a "key_separator". Here, I've chosen the boolean to reduce the burden on the caller. A type other than "boolean" ("ChunkNamer"?) might be preferred. Note also that no other implementation that I know of currently tries to detect "nested or not nested" as we're doing here. It may be overkill.

Having many millions of chunk files in a single directory causes
significant performance issues for filesystems. This PR introduces
a "nested" boolean to the ZarrArray class and related helpers in
order to choose between the separators "." and "/". The main
downside of nested storage is that one must search through the
chunk index names in order to determine whether or not an array
is nested.
If a ZarrArray is created but no data is written (as with
bioformats2raw), then the subsequent open will fail since
no chunks can be found for a proper determination.
This commits uses a `Boolean` for storing the nested state
to detect whether or not a non-default value was requested.
If so, then the value is stored in the .zarray metadata and
will be detected on opening the array to prevent the time-
consuming workaround.
@joshmoore
Copy link
Author

Pushed a fairly significant change to store the state of the "nested" Boolean (true, false, null) in the .zarray metadata. This handles an edge-case that I ran into in which no data is written to the ZarrArray that came from createArray and rather a second call to openArray is used for writing data.

joshmoore added a commit to joshmoore/bioformats2raw that referenced this pull request Feb 8, 2021
Make use of the new ZarrArray.nested flag that's currently
open as a PR. This should significantly increase the performance
of reading existing zarrs (which also happens during downsampling)
when the number of chunk files reaches the millions.

see: bcdev/jzarr#17
@SabineEmbacher
Copy link
Collaborator

Dear Josh,

I have written down my thoughts on this topic here: #19
What do you think?
Can this be a viable way that maintains compatibility with the zarr specification v2?

Best Regards
Sabine

@SabineEmbacher
Copy link
Collaborator

By the way ...
at the moment I am not able to merge your pull request, because your branch is 6 commits behind bcdev/master and has conflicts.

Have a great evening!

@joshmoore
Copy link
Author

Thanks for the headsup, @SabineEmbacher! I'll do some more testing on my side and get things tidied up. (After reading #19...)

@joshmoore
Copy link
Author

There are several ideas in #19 that I still need to think through, but for the implementation in this PR, I'm leaning towards changing boolean nested to String keySeparator to match closer the n5-zarr and zarr-python implementations. Assuming I can do that and fix the commits from master, @SabineEmbacher, could you imagine getting this into a release and if so, what would you see the roadmap for that being?

@SabineEmbacher
Copy link
Collaborator

SabineEmbacher commented Mar 2, 2021

Dear Josh,

What we are talking about is the index separator char which shall be used to generate the chunk key. So this character in principle is not equal to a path separator but can be the same.
If a FileSystemStore is used, although a generated key corresponds to a file name, but in the end it is only a key string consisting of a series of indices which are separated from each other by a given character.
I would suggest as name "Chunk Index Separator Char" or something similar. It should be a name that makes it clear that this is not a general separator for the complete Zarr tree (all branches and leaves), but a separator that is used when creating the chunk keys of exactly this array.

I plan to implement the following:

When creating a zarr array:

  • add the index separator char to .zarray json (property name defined by zarr developers)
  • always write a 0-position chunk.

When opening an array:

  • try to read the separator character from the .zarray json.
  • if not available, try to find a 0-position chunk
  • if not available, at every read action try to find chunks with both variants until the situation is clarified. Standard separator list ["/", "."]

If you agree with this plan, you don't have to do anything more to the pull request. I will then implement it as planned in the near future.

Best Regards
Sabine

@joshmoore
Copy link
Author

So this character in principle is not equal to a path separator but can be the same.

True, and to be honest, I don't know how the Python implementations are dealing with this cross-platform!

If you agree with this plan, you don't have to do anything more to the pull request. I will then implement it as planned in the near future.

Sounds great. I will summarize your proposal on the Python side and we'll see if there are any further suggestions (e.g. for the name).

All the best,
~Josh

@chris-allan
Copy link
Contributor

Path separator consistency is one area which could possibly improved when working with jzarr. The Zarr specification is already quite explicit when it comes to "key" uniformity expectations (basically UNIX style path semantics):

Everything else is left up to the implementation and corresponding storage. Consequently, there is probably utility in establishing consistency and uniformity across the jzarr API when it comes to group keys in particular. Notably, there is currently com.bc.zarr.ZarrGroup.openArray(String) and com.bc.zarr.ZarrGroup.createSubGroup(String) but no com.bc.zarr.ZarrGroup.openSubGroup(String). This leaves navigating the hierarchy to the caller and exposes the caller to pathing implementation details.

@SabineEmbacher
Copy link
Collaborator

Consequently, there is probably utility in establishing consistency and uniformity across the jzarr API when it comes to group keys in particular. Notably, there is currently com.bc.zarr.ZarrGroup.openArray(String) and com.bc.zarr.ZarrGroup.createSubGroup(String) but no com.bc.zarr.ZarrGroup.openSubGroup(String). This leaves navigating the hierarchy to the caller and exposes the caller to pathing implementation details.

Good Point!

@SabineEmbacher
Copy link
Collaborator

com.bc.zarr.ZarrGroup.openSubGroup(String)

done

chris-allan pushed a commit to chris-allan/jzarr that referenced this pull request Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants