Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorstore support #26

Open
JoOkuma opened this issue Jan 18, 2023 · 15 comments
Open

tensorstore support #26

JoOkuma opened this issue Jan 18, 2023 · 15 comments
Labels
enhancement New feature or request NGFF OME-NGFF (OME-Zarr format) performance Speed and memory usage of the code
Milestone

Comments

@JoOkuma
Copy link
Member

JoOkuma commented Jan 18, 2023

Tensorstore could be used for the zarr backend.

A few advantages are:

  • It's faster.
  • It supports transposes and other transformations on the coordinate space without loading the data.
  • It loads the data lazily until requested through np.asarray
@JoOkuma JoOkuma added the enhancement New feature or request label Jan 18, 2023
@ziw-liu
Copy link
Collaborator

ziw-liu commented Jan 18, 2023

It supports transposes and other transformations on the coordinate space without loading the data.

This will be useful for alternative viewing in mehta-lab/recOrder#297.

@ziw-liu
Copy link
Collaborator

ziw-liu commented Jan 19, 2023

I'm not sure if I'm reading this from the TensorStore documentation for the file Key-Value Store driver correctly:

This driver is only supported on Windows 10 RS1 or later, due to its reliance on file operations with POSIX semantics.

Does this mean that it does not support other OS? Or this sentence only applies to Windows since Linux and macOS are naturally POSIX-compatible?

@ziw-liu
Copy link
Collaborator

ziw-liu commented Jan 19, 2023

Another question related to #18 (comment):
How does TensorStore handle empty chunks? The zarr-python implementation supports sparse arrays.

@ziw-liu
Copy link
Collaborator

ziw-liu commented Jan 19, 2023

@JoOkuma @AhmetCanSolak From your experience, is TS mature enough to be the only Zarr backend for iohub? Is there any missing feature as compared to zarr-python that we might find blocking in the future?

@JoOkuma
Copy link
Member Author

JoOkuma commented Jan 19, 2023

@ziw-liu , they have the fill value parameter.

Or this sentence only applies to Windows since Linux and macOS are naturally POSIX-compatible?

I think that's the case, because I've been using on Linux for a while.

Is there any missing feature as compared to zarr-python that we might find blocking in the future?

Nothing comes to mind right now. I never tried using with datasets with multiple "wells", but ours light-sheet dataset sometimes have multiples keys, so it should work fine.

@ziw-liu
Copy link
Collaborator

ziw-liu commented Jan 19, 2023

ours light-sheet dataset sometimes have multiples keys, so it should work fine.

Are those multiple keys multiple zarrays or zgroups? I couldn't find documentation about managing dataset hierarchy in TS. Can you try with this example OME-NGFF HCS dataset?

Edit: aka the equivalent for this feature in zarr-python, which is essential for NGFF support.

@AhmetCanSolak
Copy link
Contributor

@JoOkuma @AhmetCanSolak From your experience, is TS mature enough to be the only Zarr backend for iohub? Is there any missing feature as compared to zarr-python that we might find blocking in the future?

I think it can be the only Zarr backend, we don't need to support pre-Windows10, we don't have any such machine at hub AFAIK. It is pretty feature complete also.

TensorStore supports nested Zarr groups. What do you want to test @ziw-liu with the example OME-NGFF HCS dataset? If you could give me more details, I am happy to help testing.

@ziw-liu
Copy link
Collaborator

ziw-liu commented Jan 19, 2023

What do you want to test @ziw-liu with the example OME-NGFF HCS dataset?

Just to see if TS can do basic reading/writing with it. I'm a bit confused as the words 'group' and 'hierarchy' don't seem to exist at all in the doc and issues.

@ziw-liu
Copy link
Collaborator

ziw-liu commented Jan 20, 2023

Also not sure about how the Zarr attributes (.zattrs file described in the Zarr spec) is handled by TS. The zarr-python implementation takes care of things like caching, consolidation, and (de)serialization. This feature is used by the NGFF as well as some of iohub's downstream packages (e.g. recOrder, microDL) to store metadata.

@AhmetCanSolak
Copy link
Contributor

The way I understand it (please correct me where I am wrong @JoOkuma), you can use zarr to open any group load any array with tensorstore. TS supports caching, (de)serialization and so. According to TS documentation, only filters are not supported: https://google.github.io/tensorstore/driver/zarr/index.html#limitations

I opened OME-NGFF HCS Dataset without any issues with following script:

import tensorstore as ts
import zarr


def main():
    store_path = "20200812-CardiomyocyteDifferentiation14-Cycle1.zarr"
    internal_path = "B/03/0/0"
    
    # open zarr store
    zs = zarr.open(store_path, mode='a')
    array = zs[internal_path]

    # read zarr
    dataset = ts.open({
        "driver": "zarr",
        "kvstore": {
            "driver": "file",
            "path": f"{store_path}/{internal_path}",
        },
        "metadata": {
            "shape": array.shape,
            "order": array.order,
            "dtype": array.dtype.str,
        }
    }, open=True).result()

    print(dataset)


if __name__ == '__main__':
    main()

Output:

TensorStore({
  'context': {
    'cache_pool': {},
    'data_copy_concurrency': {},
    'file_io_concurrency': {},
  },
  'driver': 'zarr',
  'dtype': 'uint16',
  'kvstore': {
    'driver': 'file',
    'path': '20200812-CardiomyocyteDifferentiation14-Cycle1.zarr/B/03/0/0/',
  },
  'metadata': {
    'chunks': [1, 1, 2160, 2560],
    'compressor': {
      'blocksize': 0,
      'clevel': 5,
      'cname': 'lz4',
      'id': 'blosc',
      'shuffle': 1,
    },
    'dimension_separator': '/',
    'dtype': '<u2',
    'fill_value': 0,
    'filters': None,
    'order': 'C',
    'shape': [1, 2, 2160, 5120],
    'zarr_format': 2,
  },
  'transform': {
    'input_exclusive_max': [[1], [2], [2160], [5120]],
    'input_inclusive_min': [0, 0, 0, 0],
  },
})

@ziw-liu
Copy link
Collaborator

ziw-liu commented Jan 20, 2023

you can use zarr to open any group load any array with tensorstore

Sorry I wasn't quite clear, by:

is TS mature enough to be the only Zarr backend for iohub

I meant to ask if TS is a complete implementation of the Zarr specification and iohub can stop depending on zarr-python if it used TS. And now it seems that TS only implemented the array storage protocol. Then essentially we still need zarr-python around, and adding TS is a '+1 dependency' move. I think it will significantly increase the complexity of code, as the thing (reading an array) we can do with one line via zarr-python now needs 10+ lines constructing an extra json object.

TS supports caching, (de)serialization and so.

I couldn't find how TS handles .zattrs as zarr-python does with zarr.attrs.Attributes. Can you provide a snippet for this too? (e.g. read the channel name OME metadata field from the example dataset).

@JoOkuma
Copy link
Member Author

JoOkuma commented Jan 20, 2023

I dug a little deeper into this in some more general use cases that we did not try in DEXP.

tensorstore loads zarr arrays and not groups / hierarchies, so it cant replace zarr-python completely, and it doesn't interact with .zattrs because they can live at a group level.

Therefore, I think it should be an optional backend, and it can be implemented later on top of the default zarr backend, I'm glad to help with this.
I'm ok with some iohub functionalities (e.g., lazy loading of arbitrary chunks) being exclusive to a tensorstore backend if it makes things simpler.

@JoOkuma
Copy link
Member Author

JoOkuma commented Jan 20, 2023

I don't think it's worth implementing our own code to transverse the zarr hierarchy to avoid depending on zarr-python.

tensorstore adds more complexity than the default zarr-python, and in my opinion, it shouldn't be a priority since it isn't a drop-in replacement.

@ziw-liu
Copy link
Collaborator

ziw-liu commented Jan 20, 2023

@JoOkuma @AhmetCanSolak Thanks for the input! For now I will keep using zarr-python. And I imagine adding tensorstore as an alternative array protocol can be done later without too much refactoring. Maybe something like:

from iohub.ngff import HCSDataset
import numpy as np

dataset = HCSDataset.open("hcs.zarr", mode="a")
image = np.random.randint((1, 2, 3, 1024, 1024), dtype=np.uint16)

# uses zarr-python
dataset["B/5/2/0"] = image
# uses tensorstore
write_future = dataset["B/5/3/0"].delayed_write(image)
write_future.result()

@AhmetCanSolak
Copy link
Contributor

I would suggest us to have TS as a backend option @ziw-liu in the future. Datasets read with iohub will be pass around for different purposes and I like to make sure once TS backend is selected all assignment/write/read functionalities using TS. It would be a bit of a refactoring, but I am happy to work on it together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request NGFF OME-NGFF (OME-Zarr format) performance Speed and memory usage of the code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants