Read images from cloud storage #84

CMCDragonkai · 2018-10-08T03:57:47Z

I was wondering what the relationship between this package and the function dask.array.image.imread that's already part of dask.

Especially as I detected that dask.array.image.imread doesn't actually make use of the remote data, so I couldn't give it a s3:// protocol.

The text was updated successfully, but these errors were encountered:

jakirkham · 2018-10-10T20:16:14Z

Sure. So this package provides a suite of image processing tools in the spirit of scipy.ndimage. Certainly loading image data into Dask is one piece of functionality it provides. Though most of the functionality is different types of image processing techniques that one may want to apply to the data once loaded into Dask Arrays (however one might choose to do that). It may be the case in the future that existing image utilities in Dask are relocated to or deprecated in favor of dask-image as was done with dask-ml, but we are not quite there yet.

To be more specific, dask-image includes a large variety of filters (e.g. smoothing, denoising, edge detection, Fourier, etc.), morphological operations, and some operations on label images. There is certainly room for this package to grow in these areas based on the needs of the community. The functionality here is designed specifically to handle the fact that not all data needed for an operation may be in the same chunk. So will add overlaps with other chunks for filters or pull out relevant pieces of different chunks when working with label images. Hopefully that clarifies what dask-image is trying to solve and how it differs from functionality in Dask.

Loading image data is generally a hard problem (even outside of Dask) due to the large variety of formats, image format extensions or specializations for specific fields, the requirements of different imaging modalities, file constraints (size, dimensionality, ordering, etc.), compression, access patterns, encryption, etc. As a consequence there are more than a few libraries that can be used to load image data with various tradeoffs that range between how closely the loaded data should match the format to smoothing out differences between many different formats by loading array data generally.

For imread specifically, dask-image has made some choices that are different from the imread function in Dask. These are done to improve graph construction performance. In either case, the actual data loading step is handed off to an external library. In Dask's case, this is scikit-image or a user defined function. In dask-image, this is PIMS, which then can use any of a number of different things including (but not limited to) scikit-image depending on what is installed and available to it. Both can be reasonable choices.

Would be happy to discuss the particular problem you are working with, but would need a bit more detail. Namely what image formats are involved, how the data is split up (if at all), whether authentication of some kind is needed, etc. There could be short term solutions using things like dask.delayed, S3fs, s3fs-fuse, etc., which would give you a way to load data that you need to analyze now. Longer term solutions would be integrating these things in a more appetizing way for users like yourself.

jakirkham · 2018-10-29T13:31:13Z

Does that help? Any other questions or comments here?

skeller88 · 2019-10-16T22:08:12Z

I have a related question. What's the recommended way to use dask to read .tiff images stored at gcs? I posted my question here: https://stackoverflow.com/questions/58422292/is-it-possible-to-read-a-tiff-file-from-a-remote-service-with-dask

GenevieveBuckley · 2019-10-17T01:36:47Z

So I don't really use google cloud storage, but here's where I'd start:

Use gcfs to get the remote filenames (see https://gcsfs.readthedocs.io/en/latest/). Check this is working with the simple text file reading example they have.
Try to read a single remote tiff file with a non-dask image library (maybe imageio, PIL, skimage, pims, whatever you use most often. For tiff images I think most or all of these will use Christoph Gohlke's tifffile to do the image reading). Does this bit work? If so, great, and I'd try it with dask next.
Try loading gcs images with dask. John has written a blog post that might be helpful here: https://blog.dask.org/2019/06/20/load-image-data
You can try:

>>> import dask_image
>>> x = dask_image.imread.imread('path/to/remote/location/*.tif')

which uses the pims library to read in images. I'm not really sure how well this plays with gcs, but if the first two steps above work this should work too.

OR

Or if what you're doing is a little more custom, you can take the same approach John does in the section "Lazily load images with Dask Array", which uses imageio.imread with dask-delayed to read data (and optionally joins them together with dask.array.block or dask.array.stack).

Finally, will you let me know how this goes? It's something that would be very helpful for other people to know too, so I'd like to add an example for this to the docs (or maybe see a post on it in the dask blog).

skeller88 · 2019-10-17T20:51:01Z

Thanks for the suggestions. I'll circle back after I explore.

…

On Wed, Oct 16, 2019 at 6:36 PM Genevieve Buckley ***@***.***> wrote: So I don't really use google cloud storage, but here's where I'd start: 1. Use gcfs to get the remote filenames (see https://gcsfs.readthedocs.io/en/latest/). Check this is working with the simple text file reading example they have. 2. Try to read a single remote tiff file with a non-dask image library (maybe imageio, PIL, skimage, pims, whatever you use most often. For tiff images I think most or all of these will use Christoph Gohlke's tifffile to do the image reading). Does this bit work? If so, great, and I'd try it with dask next. 3. Try loading gcs images with dask. John has written a blog post that might be helpful here: https://blog.dask.org/2019/06/20/load-image-data You can try: >>> import dask_image>>> x = dask_image.imread.imread('path/to/remote/location/*.tif') which uses the pims <https://soft-matter.github.io/pims/v0.4.1/index.html> library to read in images. I'm not really sure how well this plays with gcs, but if the first two steps above work this should work too. *OR* Or if what you're doing is a little more custom, you can take the same approach John does in the section *"Lazily load images with Dask Array"*, which uses imageio.imread with dask-delayed to read data (and optionally joins them together with dask.array.block or dask.array.stack). Finally, will you let me know how this goes? It's something that would be very helpful for other people to know too, so I'd like to add an example for this to the docs (or maybe see a post on it in the dask blog <https://github.com/dask/dask-blog>). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#84?email_source=notifications&email_token=ABKAKASFFTXQ6I3WZI2V3GLQO66TBA5CNFSM4FZQ4DMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBOOHLY#issuecomment-542958511>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKAKAS7BTH325KLJSC3O3LQO66TBANCNFSM4FZQ4DMA> .

skeller88 · 2019-11-15T04:38:44Z

Hi, I spun up a Kubernetes cluster and am running dask on it using the helm chart. I followed the dask blog post you shared as a template.

import time
import dask
import dask.array as da
import gcsfs
import imageio
import numpy as np
from distributed import Client

client = Client()

def read_filenames_from_gcs(filenames):
    bands = ["B02", "B03", "B04"]
    fs = gcsfs.GCSFileSystem(project='big_earth')

    def read(filename):
        imgs = []
        for band in bands:
            image_path = f"{filename}{filename.split('/')[-2]}_{band}.tif"
            r = fs.cat(image_path)
            imgs.append(imageio.core.asarray(imageio.imread(r, 'TIFF')))
        return np.stack(imgs, axis=-1).flatten()

    delayed_read = dask.delayed(read)
    # each image is 120 x 120, 3 bands total
    return [da.from_delayed(delayed_read(filename), shape=(14400 * 3, ), dtype=np.uint16) for filename in filenames]

fs = gcsfs.GCSFileSystem(project='big_earth')
filenames = fs.ls("big_earth/raw_rgb/tiff")

imgs = read_filenames_from_gcs(filenames)
imgs = da.stack(imgs, axis=0)

at this point the array dimensions are:

so I rechunk the array:

imgs_rechunked = imgs.rechunk((50, 43200))

then I attempt to persist the images to the cluster, but the jupyter notebook crashes:

imgs_rechunked = client.persist(imgs_rechunked)

See anything that I'm obviously doing wrong?

GenevieveBuckley · 2020-05-07T08:28:41Z

I'm sorry I completely missed your last message @skeller88

My best guess is that you just don't have enough RAM available to persist the whole array in memory. https://examples.dask.org/array.html#Persist-data-in-memory

There's also a memory leak issue with persist and dask-distributed, you can keep an eye on that conversation over dask/dask#2625

skeller88 · 2020-05-08T17:20:53Z

No worries. Thanks.

…

On Thu, May 7, 2020 at 1:28 AM Genevieve Buckley ***@***.***> wrote: I'm sorry I completely missed your last message @skeller88 <https://github.com/skeller88> My best guess is that you just don't have enough RAM available to persist the whole array in memory. https://examples.dask.org/array.html#Persist-data-in-memory There's also a memory leak issue with persist and dask-distributed, you can keep an eye on that conversation over dask/dask#2625 <dask/dask#2625> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#84 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKAKATRFIRDI46SZ4MJVV3RQJWMTANCNFSM4FZQ4DMA> .

mrocklin changed the title ~~Comparison with dask.array.image.imread~~ Read images from cloud storage Jan 16, 2020

mrocklin mentioned this issue Jan 16, 2020

Remote data / compute meta issue napari/napari#881

Open

8 tasks

jrbourbeau mentioned this issue Apr 20, 2023

[WIP] Image analysis workflow coiled/benchmarks#801

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read images from cloud storage #84

Read images from cloud storage #84

CMCDragonkai commented Oct 8, 2018 •

edited

Loading

jakirkham commented Oct 10, 2018

jakirkham commented Oct 29, 2018

skeller88 commented Oct 16, 2019

GenevieveBuckley commented Oct 17, 2019

skeller88 commented Oct 17, 2019 via email

skeller88 commented Nov 15, 2019 •

edited

Loading

GenevieveBuckley commented May 7, 2020

skeller88 commented May 8, 2020 via email

Read images from cloud storage #84

Read images from cloud storage #84

Comments

CMCDragonkai commented Oct 8, 2018 • edited Loading

jakirkham commented Oct 10, 2018

jakirkham commented Oct 29, 2018

skeller88 commented Oct 16, 2019

GenevieveBuckley commented Oct 17, 2019

skeller88 commented Oct 17, 2019 via email

skeller88 commented Nov 15, 2019 • edited Loading

GenevieveBuckley commented May 7, 2020

skeller88 commented May 8, 2020 via email

CMCDragonkai commented Oct 8, 2018 •

edited

Loading

skeller88 commented Nov 15, 2019 •

edited

Loading