Array nesting: Add the ability to use N5-style nested layout #17

joshmoore · 2021-02-05T12:47:53Z

Having many millions of chunk files in a single directory causes
significant performance issues for filesystems. This PR introduces
a "nested" boolean to the ZarrArray class and related helpers in
order to choose between the separators "." and "/". The main
downside of nested storage is that one must search through the
chunk index names in order to determine whether or not an array
is nested.

Discussion:

A few similar strategies exist in the zarr-python code base. NestedStorage was the original attempt with the downside that it was not composable with other stores. A newer version, FSStore, allows passing a "key_separator". Here, I've chosen the boolean to reduce the burden on the caller. A type other than "boolean" ("ChunkNamer"?) might be preferred. Note also that no other implementation that I know of currently tries to detect "nested or not nested" as we're doing here. It may be overkill.

Having many millions of chunk files in a single directory causes significant performance issues for filesystems. This PR introduces a "nested" boolean to the ZarrArray class and related helpers in order to choose between the separators "." and "/". The main downside of nested storage is that one must search through the chunk index names in order to determine whether or not an array is nested.

If a ZarrArray is created but no data is written (as with bioformats2raw), then the subsequent open will fail since no chunks can be found for a proper determination.

This commits uses a `Boolean` for storing the nested state to detect whether or not a non-default value was requested. If so, then the value is stored in the .zarray metadata and will be detected on opening the array to prevent the time- consuming workaround.

joshmoore · 2021-02-08T20:47:26Z

Pushed a fairly significant change to store the state of the "nested" Boolean (true, false, null) in the .zarray metadata. This handles an edge-case that I ran into in which no data is written to the ZarrArray that came from createArray and rather a second call to openArray is used for writing data.

Make use of the new ZarrArray.nested flag that's currently open as a PR. This should significantly increase the performance of reading existing zarrs (which also happens during downsampling) when the number of chunk files reaches the millions. see: bcdev/jzarr#17

SabineEmbacher · 2021-02-09T18:41:49Z

Dear Josh,

I have written down my thoughts on this topic here: #19
What do you think?
Can this be a viable way that maintains compatibility with the zarr specification v2?

Best Regards
Sabine

SabineEmbacher · 2021-02-09T18:45:36Z

By the way ...
at the moment I am not able to merge your pull request, because your branch is 6 commits behind bcdev/master and has conflicts.

Have a great evening!

joshmoore · 2021-02-09T19:57:22Z

Thanks for the headsup, @SabineEmbacher! I'll do some more testing on my side and get things tidied up. (After reading #19...)

joshmoore · 2021-03-01T15:36:00Z

There are several ideas in #19 that I still need to think through, but for the implementation in this PR, I'm leaning towards changing boolean nested to String keySeparator to match closer the n5-zarr and zarr-python implementations. Assuming I can do that and fix the commits from master, @SabineEmbacher, could you imagine getting this into a release and if so, what would you see the roadmap for that being?

SabineEmbacher · 2021-03-02T08:39:26Z

Dear Josh,

What we are talking about is the index separator char which shall be used to generate the chunk key. So this character in principle is not equal to a path separator but can be the same.
If a FileSystemStore is used, although a generated key corresponds to a file name, but in the end it is only a key string consisting of a series of indices which are separated from each other by a given character.
I would suggest as name "Chunk Index Separator Char" or something similar. It should be a name that makes it clear that this is not a general separator for the complete Zarr tree (all branches and leaves), but a separator that is used when creating the chunk keys of exactly this array.

I plan to implement the following:

When creating a zarr array:

add the index separator char to .zarray json (property name defined by zarr developers)
always write a 0-position chunk.

When opening an array:

try to read the separator character from the .zarray json.
if not available, try to find a 0-position chunk
if not available, at every read action try to find chunks with both variants until the situation is clarified. Standard separator list ["/", "."]

If you agree with this plan, you don't have to do anything more to the pull request. I will then implement it as planned in the near future.

Best Regards
Sabine

joshmoore · 2021-03-05T16:03:50Z

So this character in principle is not equal to a path separator but can be the same.

True, and to be honest, I don't know how the Python implementations are dealing with this cross-platform!

If you agree with this plan, you don't have to do anything more to the pull request. I will then implement it as planned in the near future.

Sounds great. I will summarize your proposal on the Python side and we'll see if there are any further suggestions (e.g. for the name).

All the best,
~Josh

chris-allan · 2021-03-26T11:24:48Z

Path separator consistency is one area which could possibly improved when working with jzarr. The Zarr specification is already quite explicit when it comes to "key" uniformity expectations (basically UNIX style path semantics):

https://zarr.readthedocs.io/en/stable/spec/v2.html#hierarchies

Everything else is left up to the implementation and corresponding storage. Consequently, there is probably utility in establishing consistency and uniformity across the jzarr API when it comes to group keys in particular. Notably, there is currently com.bc.zarr.ZarrGroup.openArray(String) and com.bc.zarr.ZarrGroup.createSubGroup(String) but no com.bc.zarr.ZarrGroup.openSubGroup(String). This leaves navigating the hierarchy to the caller and exposes the caller to pathing implementation details.

SabineEmbacher · 2021-04-06T07:09:08Z

Consequently, there is probably utility in establishing consistency and uniformity across the jzarr API when it comes to group keys in particular. Notably, there is currently com.bc.zarr.ZarrGroup.openArray(String) and com.bc.zarr.ZarrGroup.createSubGroup(String) but no com.bc.zarr.ZarrGroup.openSubGroup(String). This leaves navigating the hierarchy to the caller and exposes the caller to pathing implementation details.

Good Point!

SabineEmbacher · 2021-04-06T09:21:26Z

com.bc.zarr.ZarrGroup.openSubGroup(String)

done

Modify nested detection loop

Unidata repository

joshmoore added 3 commits February 5, 2021 13:40

Array nesting: Handle empty zarr arrays

e575e5b

If a ZarrArray is created but no data is written (as with bioformats2raw), then the subsequent open will fail since no chunks can be found for a proper determination.

Array nesting: temporarily bump SNAPSHOT

38fcf70

joshmoore mentioned this pull request Mar 5, 2021

Nested storage detection in Zarr V2 zarr-developers/zarr-python#707

Closed

joshmoore mentioned this pull request Mar 23, 2021

Array nesting: Add the ability to use N5-style nested layout glencoesoftware/jzarr#1

Closed

kkoz and others added 2 commits April 20, 2021 06:08

Modify nested detection loop

09638fc

Merge pull request #1 from kkoz/nested-loop-fix

f41cbbd

Modify nested detection loop

SabineEmbacher closed this Jun 8, 2021

chris-allan pushed a commit to chris-allan/jzarr that referenced this pull request Mar 14, 2024

Merge pull request bcdev#17 from joshmoore/unidata

8a0eedc

Unidata repository

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Array nesting: Add the ability to use N5-style nested layout #17

Array nesting: Add the ability to use N5-style nested layout #17

joshmoore commented Feb 5, 2021

joshmoore commented Feb 8, 2021

SabineEmbacher commented Feb 9, 2021

SabineEmbacher commented Feb 9, 2021

joshmoore commented Feb 9, 2021

joshmoore commented Mar 1, 2021

SabineEmbacher commented Mar 2, 2021 •

edited

Loading

joshmoore commented Mar 5, 2021

chris-allan commented Mar 26, 2021

SabineEmbacher commented Apr 6, 2021

SabineEmbacher commented Apr 6, 2021

Array nesting: Add the ability to use N5-style nested layout #17

Array nesting: Add the ability to use N5-style nested layout #17

Conversation

joshmoore commented Feb 5, 2021

Discussion:

joshmoore commented Feb 8, 2021

SabineEmbacher commented Feb 9, 2021

SabineEmbacher commented Feb 9, 2021

joshmoore commented Feb 9, 2021

joshmoore commented Mar 1, 2021

SabineEmbacher commented Mar 2, 2021 • edited Loading

joshmoore commented Mar 5, 2021

chris-allan commented Mar 26, 2021

SabineEmbacher commented Apr 6, 2021

SabineEmbacher commented Apr 6, 2021

SabineEmbacher commented Mar 2, 2021 •

edited

Loading