-
Notifications
You must be signed in to change notification settings - Fork 339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
differences between gcsfs and fsspec behavior when opening gs:// url paths #476
Comments
Should be
|
If I use your recommended syntax, it doesn't work with distributed because the file is closed. from dask.distributed import Client
client = Client()
dsgcs.surface.mean().compute()
|
In that case you take responsibility for the file
|
I don't understand what you mean. Surely you understand what I want to do here: open the file and compute on it with a distributed cluster. If I am doing it wrong, please tell me the recommended way to do this. |
I would also still be interested in getting an answer to this question. |
I just came across this while trying various opening options via intake-xarray (intake/intake-xarray#88).
@martindurant. I'm struggling with how best to do this for the intake-xarray case. If I understand correctly the issue is needing to use a context manager to have access to OpenFile methods, right? But how do we do that in the case of intake-xarray, given the current syntax? This issue is not unique to GCSFS, it also applies to S3 and HTTP - the following snippet leads to the same traceback in the first comment: import intake
uri = 's3://its-live-data.jpl.nasa.gov/icesat2/alt06/rel003/ATL06_20181230162257_00340206_003_01.h5'
ds = intake.open_netcdf(uri,
xarray_kwargs=dict(group='gt1l/land_ice_segments', engine='h5netcdf'),
storage_options=dict(anon=True)
).to_dask()
print(ds.h_li.mean()) /srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/backends/api.py in _get_engine_from_magic_number(filename_or_obj)
113 magic_number = filename_or_obj[:8]
114 else:
--> 115 if filename_or_obj.tell() != 0:
116 raise ValueError(
117 "file-like object read/write pointer not at zero "
AttributeError: 'OpenFile' object has no attribute 'tell' |
A partial workaround is using import intake
uri = 'simplecache::s3://its-live-data.jpl.nasa.gov/icesat2/alt06/rel003/ATL06_20181230162257_00340206_003_01.h5'
ds = intake.open_netcdf(uri,
chunks=dict(delta_time=20000),
xarray_kwargs=dict(group='gt1l/land_ice_segments', engine='h5netcdf'),
storage_options=dict(s3={'anon': True}, #default_cache_type='all',
#simplecache=dict(cache_storage="/tmp/atl06", same_names=True),
)
).to_dask()
# GatewayCluster
with Client(cluster) as client:
result = ds['h_li'].mean().compute() /srv/conda/envs/notebook/lib/python3.8/site-packages/h5py/_hl/files.py in make_fid()
171 if swmr and swmr_support:
172 flags |= h5f.ACC_SWMR_READ
--> 173 fid = h5f.open(name, flags, fapl=fapl)
174 elif mode == 'r+':
175 fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/h5f.pyx in h5py.h5f.open()
OSError: Unable to open file (unable to open file: name = '/tmp/tmphulzch2c/96968a7bb03c66de5724914b4116a866819162a33560f08134284e08f670ad38', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0) One possibility is using a globally available S3:// scratch location for the cache: |
Was this situation working before? I expect that xarray's open with fsspec paths of file objects must have been working for some time, so I'm not sure why this is different. We should figure out why intake-xarray is apparently getting this wrong.
I wouldn't mind making this happen - most of the code would stay the same, but we would need to be careful to to read/write cache metadata often. |
I'm not sure. I'm not sure remote HDF or NetCDF files without OpenDAP was really explored or tested before. I opened a PR in intake-xarray to explore possible fixes intake/intake-xarray#93
Seems like this could be particularly powerful, especially for rechunker or pangeo-forge conversion workflows for datasets on legacy servers. To avoid getting too side-tracked with import xarray as xr
import fsspec
import s3fs
# Works
s3 = s3fs.S3FileSystem(anon=True)
url = 's3://its-live-data.jpl.nasa.gov/icesat2/alt06/rel003/ATL06_20181230162257_00340206_003_01.h5'
openfile = s3.open(url)
ds = xr.open_dataset(openfile, group='gt1l/land_ice_segments', engine='h5netcdf', chunks={})
# AttributeError: 'OpenFile' object has no attribute 'tell'
url = 's3://its-live-data.jpl.nasa.gov/icesat2/alt06/rel003/ATL06_20181230162257_00340206_003_01.h5'
openfile = fsspec.open(url)
ds = xr.open_dataset(openfile, group='gt1l/land_ice_segments', engine='h5netcdf', chunks={}) |
The file-like objects are also context managers: they are auto-closed when leaving the context. |
I'm facing what I think is the same issue. The |
I would like to open a gcs path using fsspec's url resolver and then read it with xarray:
this raises the following error:
However, if I do the same thing with gcsfs, it works
This feels like a bug. And it breaks my mental model of how fsspec works. I thought that fsspec was just dispatching to gcsfs based on url matching. Help me understand why that is not the case.
xref pydata/xarray#4591, which helped me discover this (but is about something different)
The text was updated successfully, but these errors were encountered: