-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trouble loading netcdf4 files with xarray on s3 #168
Comments
That is quite a traceback! |
Same problem for me, I can't read a netCDF on S3 using >>> s3 = s3fs.S3FileSystem(key=os.environ['AWS_DS_AGENT_KEY_ID'],
secret=os.environ['AWS_DS_AGENT_ACCESS_KEY'])
>>> fileobj = s3.open(s3_fp)
>>> nc = h5netcdf.File(fileobj,'r', invalid_netcdf=True) Traceback
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in readinto(self, b)
1498 data = self.read()
-> 1499 b[:len(data)] = data
1500 return len(data)
~/miniconda3/envs/s3test/lib/python3.7/site-packages/h5py/h5fd.cpython-37m-x86_64-linux-gnu.so in View.MemoryView.memoryview.__setitem__()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/h5py/h5fd.cpython-37m-x86_64-linux-gnu.so in View.MemoryView.memoryview.setitem_slice_assign_scalar()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/h5py/h5fd.cpython-37m-x86_64-linux-gnu.so in View.MemoryView._memoryviewslice.assign_item_from_object()
TypeError: an integer is required
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
1235 """
-> 1236 if not self.readable():
1237 raise ValueError('Seek only available in read mode')
SystemError: PyEval_EvalFrameEx returned a result with an error set
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
1235 """
-> 1236 if not self.readable():
1237 raise ValueError('Seek only available in read mode')
SystemError: PyEval_EvalFrameEx returned a result with an error set
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
1235 """
-> 1236 if not self.readable():
1237 raise ValueError('Seek only available in read mode')
SystemError: PyEval_EvalFrameEx returned a result with an error set
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
1235 """
-> 1236 if not self.readable():
1237 raise ValueError('Seek only available in read mode')
SystemError: PyEval_EvalFrameEx returned a result with an error set
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
1235 """
-> 1236 if not self.readable():
1237 raise ValueError('Seek only available in read mode')
SystemError: PyEval_EvalFrameEx returned a result with an error set
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
1235 """
-> 1236 if not self.readable():
1237 raise ValueError('Seek only available in read mode')
SystemError: PyEval_EvalFrameEx returned a result with an error set
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
1235 """
-> 1236 if not self.readable():
1237 raise ValueError('Seek only available in read mode')
SystemError: PyEval_EvalFrameEx returned a result with an error set
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
1235 """
-> 1236 if not self.readable():
1237 raise ValueError('Seek only available in read mode')
SystemError: PyEval_EvalFrameEx returned a result with an error set
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
1235 """
-> 1236 if not self.readable():
1237 raise ValueError('Seek only available in read mode')
SystemError: PyEval_EvalFrameEx returned a result with an error set
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
1235 """
-> 1236 if not self.readable():
1237 raise ValueError('Seek only available in read mode')
SystemError: PyEval_EvalFrameEx returned a result with an error set
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
1235 """
-> 1236 if not self.readable():
1237 raise ValueError('Seek only available in read mode')
SystemError: PyEval_EvalFrameEx returned a result with an error set
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
1235 """
-> 1236 if not self.readable():
1237 raise ValueError('Seek only available in read mode')
SystemError: PyEval_EvalFrameEx returned a result with an error set
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
1235 """
-> 1236 if not self.readable():
1237 raise ValueError('Seek only available in read mode')
SystemError: PyEval_EvalFrameEx returned a result with an error set
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
1235 """
-> 1236 if not self.readable():
1237 raise ValueError('Seek only available in read mode')
SystemError: PyEval_EvalFrameEx returned a result with an error set
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
1235 """
-> 1236 if not self.readable():
1237 raise ValueError('Seek only available in read mode')
SystemError: PyEval_EvalFrameEx returned a result with an error set
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
1235 """
-> 1236 if not self.readable():
1237 raise ValueError('Seek only available in read mode')
SystemError: PyEval_EvalFrameEx returned a result with an error set
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
1235 """
-> 1236 if not self.readable():
1237 raise ValueError('Seek only available in read mode')
SystemError: PyEval_EvalFrameEx returned a result with an error set
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
1235 """
-> 1236 if not self.readable():
1237 raise ValueError('Seek only available in read mode')
SystemError: PyEval_EvalFrameEx returned a result with an error set
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
1235 """
-> 1236 if not self.readable():
1237 raise ValueError('Seek only available in read mode')
SystemError: PyEval_EvalFrameEx returned a result with an error set
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
1235 """
-> 1236 if not self.readable():
1237 raise ValueError('Seek only available in read mode')
SystemError: PyEval_EvalFrameEx returned a result with an error set
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
1235 """
-> 1236 if not self.readable():
1237 raise ValueError('Seek only available in read mode')
SystemError: PyEval_EvalFrameEx returned a result with an error set
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
<ipython-input-7-49848b8cfbdc> in <module>
----> 1 nc = h5netcdf.File(fileobj,'r', invalid_netcdf=True)
~/miniconda3/envs/s3test/lib/python3.7/site-packages/h5netcdf/core.py in __init__(self, path, mode, invalid_netcdf, **kwargs)
603 else:
604 self._preexisting_file = mode in {'r', 'r+', 'a'}
--> 605 self._h5file = h5py.File(path, mode, **kwargs)
606 except Exception:
607 self._closed = True
~/miniconda3/envs/s3test/lib/python3.7/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, **kwds)
392 fid = make_fid(name, mode, userblock_size,
393 fapl, fcpl=make_fcpl(track_order=track_order),
--> 394 swmr=swmr)
395
396 if swmr_support:
~/miniconda3/envs/s3test/lib/python3.7/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
168 if swmr and swmr_support:
169 flags |= h5f.ACC_SWMR_READ
--> 170 fid = h5f.open(name, flags, fapl=fapl)
171 elif mode == 'r+':
172 fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/h5f.pyx in h5py.h5f.open()
h5py/defs.pyx in h5py.defs.H5Fopen()
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()
~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
1234 from start of file, current location or end of file, resp.
1235 """
-> 1236 if not self.readable():
1237 raise ValueError('Seek only available in read mode')
1238 if whence == 0: |
If you can post the file somewhere public, I can try to find out what's going on. |
I am keen to see a way to do this without a fuse mount - here is an open file:
Produces roughly the same stack trace |
I am getting
however, the problem for you seems to be here:
The message implies that that the data being inserted is the wrong size; it would be good to debug at that point to see what the buffer |
Yes - I think there were some recent updates that changed this behaviour to
allow h5netcdf with file-like objects
I was using these versions from pip to get the previously mentioned error:
h5netcdf-0.7.1
h5py-2.9.0
pytz-2018.9
xarray-0.12.0
…On Wed, Apr 3, 2019 at 9:36 AM Martin Durant ***@***.***> wrote:
I am getting
5
6 fobj = fs.open(s3path)
----> 7 ds = xr.open_dataset(fobj,engine='h5netcdf')
~/anaconda/envs/py36/lib/python3.6/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, autoclose, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables, backend_kwargs)
345 else:
346 if engine is not None and engine != 'scipy':
--> 347 raise ValueError('can only read file-like objects with '
348 "default engine or engine='scipy'")
349 # assume filename_or_obj is a file-like object
ValueError: can only read file-like objects with default engine or engine='scipy'
however, the problem for you seems to be here:
-> 1499 b[:len(data)] = data
The message implies that that the data being inserted is the wrong size;
it would be good to debug at that point to see what the buffer b and data
contain.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#168 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AM3bQKm461Vg3QIEDlOcEDeoWk8xFVfOks5vdAWwgaJpZM4bTe4Y>
.
|
You are correct - if I check back through the stack trace I am getting the error here:
I will have a go at setting a breakpoint there and taking a look |
Not entirely clear to me how all these libraries tie together but it seems that
is calling out to the hdf5 c-library function "H5Fopen", which is expecting a string filepath, where as name at this point is an s3fs.S3File object. Somehow passing the "name" parameter is invoking s3fs:core.py
where b is a However even if that succeeded I dont know how this would work anyway, given the c library is expecting a string file path, rather than a binary memory view?
If I get time I will take a look at why the gcsfs is working |
OK, so something "new" :) btw: the difficulties with hdf are the main reason for interest in libraries like zarr (or zarr as a backend for netcdf), which is known to work well with s3fs/gcsfs/etc. It may or may not be a viable alternative for you. |
Thanks Martin - testing out a 'simple' (albeit slow) way converting of
converting netCDF that is already present in cloud storage into Zarr
format, with out formally 'mirroring' the files locally somewhere.
Noting that if performance is crucial and you have the AWS budget, then
setting up AWS FSx is the obvious way to go (i.e.
https://jiaweizhuang.github.io/blog/fsx-experiments/)
…On Wed, Apr 3, 2019 at 10:32 PM Martin Durant ***@***.***> wrote:
OK, so something "new" :)
I would suspect that the memoryview has a complex type other than bytes,
and s3fs is trying to fill the buffer with bytes (although it doesn't
appear to be an exact multiple). readinto is very rarely used anywhere,
surprised to see it, but I suppose the memory must have been allocated in
C-land.
btw: the difficulties with hdf are the main reason for interest in
libraries like zarr (or zarr as a backend for netcdf), which is known to
work well with s3fs/gcsfs/etc. It may or may not be a viable alternative
for you.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#168 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AM3bQDEM5VDX0QSbBQ8-XDMwLV1dDS0yks5vdLttgaJpZM4bTe4Y>
.
|
@martindurant, I'm also hitting this as well. Was hoping to follow up on @rabernat's suggestion to include this in our testing of different options for accessing NetCDF4/HDF5 on s3 (in addition to Zarr and HSDS). I got the same error you did:
when I forgot to install |
@pbranson , did you manage to learn anything about this issue? |
Hm, with everything updated (h5netcdf , h5py, s3fs), the first invocation just worked:
but via xarray it does not. |
@martindurant - how are you invoking xarray? |
|
That's weird, because it does work with gcsfs. What xarray error are you getting? |
|
Could we point this discussion to a public file instead? That would make debugging easier for me. I don't have any credentials to try the file in question. When I try with the ERA5 public data, I can't even open it with h5py fs = s3fs.S3FileSystem(anon=True)
s3path = 'era5-pds/2008/01/data/air_temperature_at_2_metres.nc'
file_obj = fs.open(s3path)
h5 = h5py.File(file_obj, 'r')
|
OK, solved it - and it seems this only happens for some specific files! |
Sorry that it took do long for me to dig this out! |
I was excited to try this out, but my simple test below is not working for some reason: import xarray as xr
import s3fs
import h5netcdf
print(xr.__version__)
print(s3fs.__version__)
print(h5netcdf.__version__)
fs = s3fs.S3FileSystem(anon=True)
fileObj = fs.open('esip-pangeo/pangeo/adcirc/adcirc_01.nc')
print(fileObj.info()) produces:
but then this causes the kernel to die: ds = xr.open_dataset(fileObj, engine='h5netcdf', chunks={'time':10, 'node':141973}) @martindurant , any ideas? |
@rabernat , you verified this working for some other .nc files, correct?
has not caused an error for me yet, but it seems to be downloading everything and filling up memory (I know the file is extremely big), so it's possible that the metadata is laid out in a particularly unfriendly way. That still doesn't explain your crash. Perhaps would be better with I am looking into implementing #177 across all filesystems in fsspec, which would be just the thing for a case like this. |
@martindurant , yes! fs = s3fs.S3FileSystem(anon=True, default_fill_cache=False)
fileObj = fs.open('esip-pangeo/pangeo/adcirc/adcirc_01.nc')
ds = xr.open_dataset(fileObj, engine='h5netcdf', chunks={'time':10, 'node':141973}) works within a few seconds! |
Good, but also annoying! Making options that tend to work for most people most of the time is hard... |
(I suppose this is why you want to encode all the options required for smooth working of a particular dataset into a catalog...) |
@martindurant do you think this a import xarray as xr
import s3fs
fs = s3fs.S3FileSystem(anon=True, default_fill_cache=False)
fileObj = fs.open('nwm-archive/2010/201001110000.CHRTOUT_DOMAIN1.comp')
print(fileObj.size)
ds = xr.open_dataset(fileObj, engine='h5netcdf') which fails with:
|
None of that traceback appears to be in s3fs - are you sure it loads OK from local? If yes, then finding the problem will be tricky, as apparently, any exception is being hidden. |
I'm curious how these caches behave with dask / distributed. Are the cache contents serialized, or is the cache cleared before pickling the file object? |
The files are not sent around at all. What you actually send is an OpenFile object ( https://github.com/dask/dask/blob/master/dask/bytes/core.py#L143 ), which only creates the S3FileSystem object in a |
@rsignell-usgs , how did you comment from the future? :) |
I would exclude xarray here. What happens within h5py when it calls s3 is a bit of a mystery - perhaps more logging in s3fs would help, set logger "s3fs.core" to DEBUG and you'll get some. |
@martindurant , I can read the file locally with Thanks! |
I'm also really confused about how I managed to post a comment 5 hours from now. 🙄 |
PS: the file-system is serialised in this process, including directory listings. This is good or bad - you avoid potentially slow lookups when opening the file, but the instance is bigger. I notice that gcsfs does not preserve the listings cache. gcsfs came later and is, in some ways, better designed (hence my attempt to consolidate such things into fsspec). |
If I take a slice from a netcdf opened with s3fs+h5netcdf is it doing some
form of byte range request or essentially downloading the entire file into
a memory cache and then slicing?
In which case we should always chunk on a file basis when using this method?
…On Wed., 8 May 2019, 11:19 pm Martin Durant, ***@***.***> wrote:
Are the cache contents serialized, or is the cache cleared before pickling
the file object?
PS: the file-system *is* serialised in this process, including directory
listings. This is good or bad - you avoid potentially slow lookups when
opening the file, but the instance is bigger. I notice that gcsfs does not
preserve the listings cache. gcsfs came later and is, in some ways, better
designed (hence my attempt to consolidate such things into fsspec).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#168 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADG5WQCGSUQI7DCVYQS4TZ3PULVQ5ANCNFSM4G2N5YMA>
.
|
I don't know the internals of h5netcdf, but i would hope it's a range. You could time reading a whole array versus reading a single value; but it will not be linear, due to fixed costs of each connection and metadata lookups. For a slice, it would depend on exact layout and chunking. You may want to turn on s3fs debug logging.
…On May 12, 2019 8:35:48 PM EDT, Paul Branson ***@***.***> wrote:
If I take a slice from a netcdf opened with s3fs+h5netcdf is it doing
some>
form of byte range request or essentially downloading the entire file
into>
a memory cache and then slicing?>
>
In which case we should always chunk on a file basis when using this
method?>
>
On Wed., 8 May 2019, 11:19 pm Martin Durant,
***@***.***>>
wrote:>
>
> Are the cache contents serialized, or is the cache cleared before
pickling>
> the file object?>
>>
> PS: the file-system *is* serialised in this process, including
directory>
> listings. This is good or bad - you avoid potentially slow lookups
when>
> opening the file, but the instance is bigger. I notice that gcsfs
does not>
> preserve the listings cache. gcsfs came later and is, in some ways,
better>
> designed (hence my attempt to consolidate such things into fsspec).>
>>
> —>
> You are receiving this because you were mentioned.>
> Reply to this email directly, view it on GitHub>
> <#168 (comment)>, or
mute>
> the thread>
>
<https://github.com/notifications/unsubscribe-auth/ADG5WQCGSUQI7DCVYQS4TZ3PULVQ5ANCNFSM4G2N5YMA>>
> .>
>>
>
>
-- >
You are receiving this because you were mentioned.>
Reply to this email directly or view it on GitHub:>
#168 (comment)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
I'm working on allowing direct access to netcdf4/hdf5 file-like objects (pydata/xarray#2782). This seems to be working fine with gcsfs, but not s3fs (versions 0.2 from conda-forge). Here is a gist with the relevant code and error traceback:
https://gist.github.com/scottyhq/304a3c4b4e198776b8d82fb3a9f300e3
and an abbreviated traceback here:
any guidance as to what might be going on here would be appreciated!
The text was updated successfully, but these errors were encountered: