Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

image reading from s3 is broken with dask_image.imread in current pip / conda build #234

Open
kpasko opened this issue May 25, 2021 · 8 comments

Comments

@kpasko
Copy link

kpasko commented May 25, 2021

What happened:
Directly reading images from URLs is deprecated since 3.4 and will no longer be supported two minor releases later. Please open the URL for reading and pass the result to Pillow, e.g. with PIL.Image.open(urllib.request.urlopen(url)).

What you expected to happen:
Reads correctly from s3 url and maintains backward compatibility

Minimal Complete Verifiable Example:

import dask.dataframe as dd
df = dd.read_csv('s3://mybucket/mycsv.csv')   #no problem


import dask_image.imread
img = dask_image.imread.imread('s3://mybucket/myimg.png') # yes problem

Anything else we need to know?:

Environment:

dask 2021.5.0 pyhd8ed1ab_0 conda-forge
dask-core 2021.5.0 pyhd8ed1ab_0 conda-forge
dask-image 0.6.0 pyhd8ed1ab_0 conda-forge
pillow 8.2.0 py39h5fdd921_1 conda-forge
pims 0.5 pyh9f0ad1d_1 conda-forge
s3fs 2021.5.0 pyhd8ed1ab_0 conda-forge

  • Python version: 3.9.4
  • Operating System: OSX Big Sir v 11.3
@jakirkham
Copy link
Member

Think this is one of the cases that dask.array.image.imread handles well. Would try using that here

@kpasko
Copy link
Author

kpasko commented May 25, 2021

same env but scikit-image installed,
dask.array.image.imread('s3://mybucket/myimg.png')

returns
No files found under name s3://mybucket/myimg.png

FWIW, this works fine

import skimage.io
import io
import boto3

boto_session = boto3.Session()
s3 = boto_session.client("s3")
stream = s3.get_object(Bucket="mybucket" , Key="myimg.png")['Body']
data = io.BytesIO(stream.read( ) )
img = skimage.io.imread(data)

@jakirkham
Copy link
Member

Hmm...interesting. Seem to recall that working in the past. Maybe it doesn't any longer

In any event, read_csv is doing lots of clever things. Though maybe some of it could be repurposed to handle the image loading case better.

For now a reasonable thing to do would be just use dask.delayed to roll your own reader.

@kpasko
Copy link
Author

kpasko commented May 27, 2021

looks like the dask.array.image code is using glob. I'd imagine a recent change (or version incompatibility) with s3fs/fsspec/boto3/etc means s3:/ is not being mounted locally and so can't be glob'd.

My workaround is to use aws-data-wrangler to query the glob string from s3 (could of course use paginator or whatnot as well), and then

def ski_read(fn):
    output = dask.bytes.read_bytes(fn,include_path=False,sample=False)
    data = output[1][0][0]
    return skimage.io.imread(io.BytesIO(data.compute()))

Have to imagine there's a cleaner way, but I wanted to avoid injecting boto3 clients or sessions into delayed/distributed calls

@jakirkham
Copy link
Member

Glad you figured out something that works :)

Yeah that's the downside of the dask.array.image.imread. I don't think it gets the same amount of attention as imread here. So not too surprised that it has drifted out-of-sync

Agree there's probably room to improve. Handling cloud based storage seems desirable. Currently we hand things off to PIMS. There's an open issue about handling URLs ( soft-matter/pims#310 ). Not seeing one related to S3. Maybe worth raising?

@kpasko
Copy link
Author

kpasko commented May 27, 2021 via email

@jakirkham
Copy link
Member

Yep that's true. Though we already knew imread in dask-image didn't work. dask.array's did work previously (so the fact it doesn't is news)

@kpasko
Copy link
Author

kpasko commented May 27, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants