Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trouble loading netcdf4 files with xarray on s3 #168

Closed
scottyhq opened this issue Feb 27, 2019 · 36 comments · Fixed by #178
Closed

trouble loading netcdf4 files with xarray on s3 #168

scottyhq opened this issue Feb 27, 2019 · 36 comments · Fixed by #178

Comments

@scottyhq
Copy link

I'm working on allowing direct access to netcdf4/hdf5 file-like objects (pydata/xarray#2782). This seems to be working fine with gcsfs, but not s3fs (versions 0.2 from conda-forge). Here is a gist with the relevant code and error traceback:

https://gist.github.com/scottyhq/304a3c4b4e198776b8d82fb3a9f300e3

and an abbreviated traceback here:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/Documents/GitHub/xarray/xarray/backends/file_manager.py in acquire(self, needs_lock)
    166             try:
--> 167                 file = self._cache[self._key]
    168             except KeyError:

~/Documents/GitHub/xarray/xarray/backends/lru_cache.py in __getitem__(self, key)
     40         with self._lock:
---> 41             value = self._cache[key]
     42             self._cache.move_to_end(key)

KeyError: [<function _open_h5netcdf_group at 0x11d8b0ae8>, (<S3File grfn-content-prod/S1-GUNW-A-R-137-tops-20181129_20181123-020010-43220N_41518N-PP-e2c7-v2_0_0.nc>,), 'r', (('group', '/science/grids/data'),)]

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/test_env/lib/python3.6/site-packages/s3fs/core.py in readinto(self, b)
   1498         data = self.read()
-> 1499         b[:len(data)] = data
   1500         return len(data)

~/miniconda3/envs/test_env/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-darwin.so in View.MemoryView.memoryview.__setitem__()

~/miniconda3/envs/test_env/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-darwin.so in View.MemoryView.memoryview.setitem_slice_assignment()

~/miniconda3/envs/test_env/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-darwin.so in View.MemoryView.memoryview_copy_contents()

~/miniconda3/envs/test_env/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-darwin.so in View.MemoryView._err_extents()

ValueError: got differing extents in dimension 0 (got 8 and 59941567)

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/test_env/lib/python3.6/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

any guidance as to what might be going on here would be appreciated!

@martindurant
Copy link
Member

That is quite a traceback!
I am surprised that gcsfs worked, rather than that s3fs did not - hdf5 is a C-level reader that likes to have a local real file to read from. I have heard that they have tried to allow for python file-like objects, but I don't know how that's implemented - apparently something is subtly different between the two file implementation classes.

@leroygr
Copy link

leroygr commented Mar 28, 2019

Same problem for me, I can't read a netCDF on S3 using s3fs with h5netcdf:

>>> s3 = s3fs.S3FileSystem(key=os.environ['AWS_DS_AGENT_KEY_ID'],
                    secret=os.environ['AWS_DS_AGENT_ACCESS_KEY'])
>>> fileobj = s3.open(s3_fp)
>>> nc = h5netcdf.File(fileobj,'r', invalid_netcdf=True)
Traceback

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in readinto(self, b)
   1498         data = self.read()
-> 1499         b[:len(data)] = data
   1500         return len(data)

~/miniconda3/envs/s3test/lib/python3.7/site-packages/h5py/h5fd.cpython-37m-x86_64-linux-gnu.so in View.MemoryView.memoryview.__setitem__()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/h5py/h5fd.cpython-37m-x86_64-linux-gnu.so in View.MemoryView.memoryview.setitem_slice_assign_scalar()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/h5py/h5fd.cpython-37m-x86_64-linux-gnu.so in View.MemoryView._memoryviewslice.assign_item_from_object()

TypeError: an integer is required

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
<ipython-input-7-49848b8cfbdc> in <module>
----> 1 nc = h5netcdf.File(fileobj,'r', invalid_netcdf=True)

~/miniconda3/envs/s3test/lib/python3.7/site-packages/h5netcdf/core.py in __init__(self, path, mode, invalid_netcdf, **kwargs)
    603                 else:
    604                     self._preexisting_file = mode in {'r', 'r+', 'a'}
--> 605                     self._h5file = h5py.File(path, mode, **kwargs)
    606         except Exception:
    607             self._closed = True

~/miniconda3/envs/s3test/lib/python3.7/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, **kwds)
    392                 fid = make_fid(name, mode, userblock_size,
    393                                fapl, fcpl=make_fcpl(track_order=track_order),
--> 394                                swmr=swmr)
    395 
    396             if swmr_support:

~/miniconda3/envs/s3test/lib/python3.7/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
    168         if swmr and swmr_support:
    169             flags |= h5f.ACC_SWMR_READ
--> 170         fid = h5f.open(name, flags, fapl=fapl)
    171     elif mode == 'r+':
    172         fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5f.pyx in h5py.h5f.open()

h5py/defs.pyx in h5py.defs.H5Fopen()

h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence)
   1234             from start of file, current location or end of file, resp.
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')
   1238         if whence == 0:

@martindurant
Copy link
Member

If you can post the file somewhere public, I can try to find out what's going on.

@pbranson
Copy link

pbranson commented Apr 3, 2019

I am keen to see a way to do this without a fuse mount - here is an open file:

import xarray as xr
import s3fs
fs = s3fs.S3FileSystem(anon=True)
s3path = 'imos-data/IMOS/SRS/OC/gridded/aqua/P1D/2010/05/A.P1D.20100507T000000Z.aust.ipar.nc'

fobj = fs.open(s3path)
ds = xr.open_dataset(fobj,engine='h5netcdf')

Produces roughly the same stack trace

@martindurant
Copy link
Member

I am getting

      5
      6 fobj = fs.open(s3path)
----> 7 ds = xr.open_dataset(fobj,engine='h5netcdf')

~/anaconda/envs/py36/lib/python3.6/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, autoclose, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables, backend_kwargs)
    345     else:
    346         if engine is not None and engine != 'scipy':
--> 347             raise ValueError('can only read file-like objects with '
    348                              "default engine or engine='scipy'")
    349         # assume filename_or_obj is a file-like object

ValueError: can only read file-like objects with default engine or engine='scipy'

however, the problem for you seems to be here:

-> 1499         b[:len(data)] = data

The message implies that that the data being inserted is the wrong size; it would be good to debug at that point to see what the buffer b and data contain.

@pbranson
Copy link

pbranson commented Apr 3, 2019 via email

@pbranson
Copy link

pbranson commented Apr 3, 2019

You are correct - if I check back through the stack trace I am getting the error here:

/opt/conda/lib/python3.6/site-packages/s3fs/core.py in readinto(self, b)
   1498         data = self.read()
-> 1499         b[:len(data)] = data
   1500         return len(data)

/opt/conda/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-x86_64-linux-gnu.so in View.MemoryView.memoryview.__setitem__()

/opt/conda/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-x86_64-linux-gnu.so in View.MemoryView.memoryview.setitem_slice_assignment()

/opt/conda/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-x86_64-linux-gnu.so in View.MemoryView.memoryview_copy_contents()

/opt/conda/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-x86_64-linux-gnu.so in View.MemoryView._err_extents()

ValueError: got differing extents in dimension 0 (got 8 and 24247921)

I will have a go at setting a breakpoint there and taking a look

@pbranson
Copy link

pbranson commented Apr 3, 2019

Not entirely clear to me how all these libraries tie together but it seems that

h5py:files.py 170 fid = h5f.open(name, flags, fapl=fapl)

is calling out to the hdf5 c-library function "H5Fopen", which is expecting a string filepath, where as name at this point is an s3fs.S3File object. Somehow passing the "name" parameter is invoking

s3fs:core.py

1497        def readinto(self, b):
                    data = self.read()
                    b[:len(data)] = data
                    return len(data)

where b is a <MemoryView of 'array' at 0x1b458a73be0> but I cant workout where b is instantiated - it only has a length of 8, where as the binary data read from s3 is much larger - hence the exception.

However even if that succeeded I dont know how this would work anyway, given the c library is expecting a string file path, rather than a binary memory view?

I am surprised that gcsfs worked, rather than that s3fs did not - hdf5 is a C-level reader that likes to have a local real file to read from.

If I get time I will take a look at why the gcsfs is working

@martindurant
Copy link
Member

OK, so something "new" :)
I would suspect that the memoryview has a complex type other than bytes, and s3fs is trying to fill the buffer with bytes (although it doesn't appear to be an exact multiple). readinto is very rarely used anywhere, surprised to see it, but I suppose the memory must have been allocated in C-land.

btw: the difficulties with hdf are the main reason for interest in libraries like zarr (or zarr as a backend for netcdf), which is known to work well with s3fs/gcsfs/etc. It may or may not be a viable alternative for you.

@pbranson
Copy link

pbranson commented Apr 4, 2019 via email

@rsignell-usgs
Copy link

@martindurant, I'm also hitting this as well. Was hoping to follow up on @rabernat's suggestion to include this in our testing of different options for accessing NetCDF4/HDF5 on s3 (in addition to Zarr and HSDS).

I got the same error you did:

ValueError: can only read file-like objects with default engine or engine='scipy'

when I forgot to install h5netcdf and h5py into my environment.

@martindurant
Copy link
Member

@pbranson , did you manage to learn anything about this issue?

@martindurant
Copy link
Member

Hm, with everything updated (h5netcdf , h5py, s3fs), the first invocation just worked:

In [5]: fs = s3fs.S3FileSystem(anon=True)
   ...: s3path = 'imosdata/IMOS/SRS/OC/gridded/aqua/P1D/2010/05/A.P1D.20100507T000000Z.aust.ipar.nc'
   ...:
   ...: fobj = fs.open(s3path)
In [6]: nc = h5netcdf.File(fobj, 'r', invalid_netcdf=True)
In [7]: nc
Out[7]:
<h5netcdf.File 'A.P1D.20100507T000000Z.aust.ipar.nc>' (mode r)>
Dimensions:
    latitude: 7001
    longitude: 10001
    time: 1
Groups:
Variables:
    time: ('time',) float64
    latitude: ('latitude',) float64
    longitude: ('longitude',) float64
    ipar: ('time', 'latitude', 'longitude') float32
Attributes:
    history: b'File initialised at 2015-12-17T19:03:50.793738\nInitialised var ipar at 2015-12-17T19:04:36.563452\nAdd Granule A20100507_0230.20150923161152.L2OC_BASE.ipar.nc at 2015-12-17T19:04:38.498914\nAdd Granule A20100507_0235.20150923151200.L2OC_BASE.ipar.nc at 2015-12-17T19:04:38.975299\nAdd Granule A20100507_0240.20150923134822.L2OC_BASE.ipar.nc at 2015-12-17T19:04:39.483551\nAdd Granule A20100507_0245.20150923143121.L2OC_BASE.ipar.nc at 2015-12-17T19:04:39.793043\nAdd Granule A20100507_0405.20150923141146.L2OC_BASE.ipar.nc at 2015-12-17T19:04:40.401902\nAdd Granule A20100507_0410.20150923162326.L2OC_BASE.ipar.nc at 2015-12-17T19:04:40.977119\nAdd Granule A20100507_0415.20150923133857.L2OC_BASE.ipar.nc at 2015-12-17T19:04:41.430398\nAdd Granule A20100507_0420.20150923150036.L2OC_BASE.ipar.nc at 2015-12-17T19:04:41.923474\nAdd Granule A20100507_0540.20150923152402.L2OC_BASE.ipar.nc at 2015-12-17T19:04:42.336277\nAdd Granule A20100507_0545.20150923154421.L2OC_BASE.ipar.nc at 2015-12-17T19:04:43.116328\nAdd Granule A20100507_0550.20150923140042.L2OC_BASE.ipar.nc at 2015-12-17T19:04:43.709527\nAdd Granule A20100507_0555.20150923155628.L2OC_BASE.ipar.nc at 2015-12-17T19:04:44.321537\nAdd Granule A20100507_0600.20150923165701.L2OC_BASE.ipar.nc at 2015-12-17T19:04:44.871419\nAdd Granule A20100507_0720.20150923142308.L2OC_BASE.ipar.nc at 2015-12-17T19:04:45.394833\nAdd Granule A20100507_0725.20150923132636.L2OC_BASE.ipar.nc at 2015-12-17T19:04:46.131246\nAdd Granule A20100507_0730.20150923163350.L2OC_BASE.ipar.nc at 2015-12-17T19:04:46.614609\nAdd Granule A20100507_0735.20150923153102.L2OC_BASE.ipar.nc at 2015-12-17T19:04:47.083167\nAdd Granule A20100507_0740.20150923144622.L2OC_BASE.ipar.nc at 2015-12-17T19:04:47.608014'
    Conventions: b'CF-1.6'
    source_path: b'imos-srs/archive/oc/aqua/1d/v201508/2010/05/A20100507.L2OC_BASE.aust.ipar.ncimos-srs/archive/oc/aqua/1d/v201508/2010/05/A20100507.L2OC_BASE.aust.ipar.ncimos-srs/archive/oc/aqua/1d/v201508/2010/05/A20100507.L2OC_BASE.aust.ipar.nc'

but via xarray it does not.

@rabernat
Copy link

@martindurant - how are you invoking xarray?

@martindurant
Copy link
Member

fobj = fs.open(s3path)
ds = xr.open_dataset(fobj, engine='h5netcdf')

@rabernat
Copy link

That's weird, because it does work with gcsfs.

What xarray error are you getting?

@martindurant
Copy link
Member

ValueError: can only read file-like objects with default engine or engine='scipy'
(also a few comments higher up the thread)

@rabernat
Copy link

Could we point this discussion to a public file instead? That would make debugging easier for me. I don't have any credentials to try the file in question.

When I try with the ERA5 public data, I can't even open it with h5py

fs = s3fs.S3FileSystem(anon=True)
s3path = 'era5-pds/2008/01/data/air_temperature_at_2_metres.nc'
file_obj = fs.open(s3path)
h5 = h5py.File(file_obj, 'r')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

/srv/conda/lib/python3.6/site-packages/s3fs/core.py in readinto(self, b)
   1498         data = self.read()
-> 1499         b[:len(data)] = data
   1500         return len(data)

/srv/conda/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-x86_64-linux-gnu.so in View.MemoryView.memoryview.__setitem__()

/srv/conda/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-x86_64-linux-gnu.so in View.MemoryView.memoryview.setitem_slice_assignment()

/srv/conda/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-x86_64-linux-gnu.so in View.MemoryView.memoryview_copy_contents()

/srv/conda/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-x86_64-linux-gnu.so in View.MemoryView._err_extents()

ValueError: got differing extents in dimension 0 (got 8 and 1157316538)

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

/srv/conda/lib/python3.6/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

@martindurant
Copy link
Member

OK, solved it - and it seems this only happens for some specific files!
The reason it works with gcsfs is that it simply doesn't have a readinto method (but it should!), so it seems h5py falls back to read

@martindurant
Copy link
Member

Sorry that it took do long for me to dig this out!

@rabernat rabernat mentioned this issue Apr 18, 2019
@rsignell-usgs
Copy link

rsignell-usgs commented Apr 23, 2019

I was excited to try this out, but my simple test below is not working for some reason:

import xarray as xr
import s3fs
import h5netcdf

print(xr.__version__)
print(s3fs.__version__)
print(h5netcdf.__version__)

fs = s3fs.S3FileSystem(anon=True)
fileObj = fs.open('esip-pangeo/pangeo/adcirc/adcirc_01.nc')
print(fileObj.info())

produces:

0.12.1
0.2.1
0.7.1
{'ETag': '"79ca97f44f5fed750f6dea35a16f6ac9-4986"', 'Key': 'esip-pangeo/pangeo/adcirc/adcirc_01.nc', 'LastModified': datetime.datetime(2019, 4, 12, 17, 46, 44, tzinfo=tzutc()), 'Size': 26140007264, 'StorageClass': 'STANDARD', 'VersionId': None}

but then this causes the kernel to die:

ds = xr.open_dataset(fileObj, engine='h5netcdf', chunks={'time':10, 'node':141973})

@martindurant , any ideas?

@martindurant
Copy link
Member

@rabernat , you verified this working for some other .nc files, correct?
A dead kernel suggests an exception in the C library, which would be very hard to diagnose.
Running

h5py.File(fileObj, 'r')

has not caused an error for me yet, but it seems to be downloading everything and filling up memory (I know the file is extremely big), so it's possible that the metadata is laid out in a particularly unfriendly way. That still doesn't explain your crash. Perhaps would be better with default_fill_cache=False for the fs.

I am looking into implementing #177 across all filesystems in fsspec, which would be just the thing for a case like this.

@rsignell-usgs
Copy link

rsignell-usgs commented Apr 23, 2019

@martindurant , yes!

fs = s3fs.S3FileSystem(anon=True, default_fill_cache=False)
fileObj = fs.open('esip-pangeo/pangeo/adcirc/adcirc_01.nc')
ds = xr.open_dataset(fileObj, engine='h5netcdf', chunks={'time':10, 'node':141973})

works within a few seconds!

@martindurant
Copy link
Member

Good, but also annoying! Making options that tend to work for most people most of the time is hard...

@martindurant
Copy link
Member

(I suppose this is why you want to encode all the options required for smooth working of a particular dataset into a catalog...)

@rsignell-usgs
Copy link

@martindurant do you think this a s3fs, xarray, h5netcdf or h5py issue? 😕

import xarray as xr
import s3fs

fs = s3fs.S3FileSystem(anon=True, default_fill_cache=False)
fileObj = fs.open('nwm-archive/2010/201001110000.CHRTOUT_DOMAIN1.comp')
print(fileObj.size)
ds = xr.open_dataset(fileObj, engine='h5netcdf')

which fails with:

18815129
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-2-99def1d8f6d1> in <module>()
      5 fileObj = fs.open('nwm-archive/2010/201001110000.CHRTOUT_DOMAIN1.comp')
      6 print(fileObj.size)
----> 7 ds = xr.open_dataset(fileObj, engine='h5netcdf')

/opt/conda/lib/python3.6/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, autoclose, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables, backend_kwargs, use_cftime)
    392 
    393     with close_on_error(store):
--> 394         ds = maybe_decode_store(store)
    395 
    396     # Ensure source filename always stored in dataset object (GH issue #2550)

/opt/conda/lib/python3.6/site-packages/xarray/backends/api.py in maybe_decode_store(store, lock)
    322             store, mask_and_scale=mask_and_scale, decode_times=decode_times,
    323             concat_characters=concat_characters, decode_coords=decode_coords,
--> 324             drop_variables=drop_variables, use_cftime=use_cftime)
    325 
    326         _protect_dataset_variables_inplace(ds, cache)

/opt/conda/lib/python3.6/site-packages/xarray/conventions.py in decode_cf(obj, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables, use_cftime)
    468         encoding = obj.encoding
    469     elif isinstance(obj, AbstractDataStore):
--> 470         vars, attrs = obj.load()
    471         extra_coords = set()
    472         file_obj = obj

/opt/conda/lib/python3.6/site-packages/xarray/backends/common.py in load(self)
    118         """
    119         variables = FrozenOrderedDict((_decode_variable_name(k), v)
--> 120                                       for k, v in self.get_variables().items())
    121         attributes = FrozenOrderedDict(self.get_attrs())
    122         return variables, attributes

/opt/conda/lib/python3.6/site-packages/xarray/backends/h5netcdf_.py in get_variables(self)
    135     def get_variables(self):
    136         return FrozenOrderedDict((k, self.open_store_variable(k, v))
--> 137                                  for k, v in self.ds.variables.items())
    138 
    139     def get_attrs(self):

/opt/conda/lib/python3.6/site-packages/xarray/core/utils.py in FrozenOrderedDict(*args, **kwargs)
    330 
    331 def FrozenOrderedDict(*args, **kwargs):
--> 332     return Frozen(OrderedDict(*args, **kwargs))
    333 
    334 

/opt/conda/lib/python3.6/site-packages/xarray/backends/h5netcdf_.py in <genexpr>(.0)
    135     def get_variables(self):
    136         return FrozenOrderedDict((k, self.open_store_variable(k, v))
--> 137                                  for k, v in self.ds.variables.items())
    138 
    139     def get_attrs(self):

/opt/conda/lib/python3.6/site-packages/xarray/backends/h5netcdf_.py in open_store_variable(self, name, var)
    101         data = indexing.LazilyOuterIndexedArray(
    102             H5NetCDFArrayWrapper(name, self))
--> 103         attrs = _read_attributes(var)
    104 
    105         # netCDF4 specific encoding

/opt/conda/lib/python3.6/site-packages/xarray/backends/h5netcdf_.py in _read_attributes(h5netcdf_var)
     42     # bytes attributes to strings
     43     attrs = OrderedDict()
---> 44     for k, v in h5netcdf_var.attrs.items():
     45         if k not in ['_FillValue', 'missing_value']:
     46             v = maybe_decode_bytes(v)

/opt/conda/lib/python3.6/_collections_abc.py in __iter__(self)
    742     def __iter__(self):
    743         for key in self._mapping:
--> 744             yield (key, self._mapping[key])
    745 
    746 ItemsView.register(dict_items)

/opt/conda/lib/python3.6/site-packages/h5netcdf/attrs.py in __getitem__(self, key)
     17         if key in _HIDDEN_ATTRS:
     18             raise KeyError(key)
---> 19         return self._h5attrs[key]
     20 
     21     def __setitem__(self, key, value):

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

/opt/conda/lib/python3.6/site-packages/h5py/_hl/attrs.py in __getitem__(self, name)
     79 
     80         arr = numpy.ndarray(shape, dtype=dtype, order='C')
---> 81         attr.read(arr, mtype=htype)
     82 
     83         if len(arr.shape) == 0:

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5a.pyx in h5py.h5a.AttrID.read()

h5py/_proxy.pyx in h5py._proxy.attr_rw()

OSError: Unable to read attribute (no appropriate function for conversion path)

@martindurant
Copy link
Member

None of that traceback appears to be in s3fs - are you sure it loads OK from local? If yes, then finding the problem will be tricky, as apparently, any exception is being hidden.

@rabernat
Copy link

rabernat commented May 8, 2019

I'm curious how these caches behave with dask / distributed. Are the cache contents serialized, or is the cache cleared before pickling the file object?

@martindurant
Copy link
Member

Are the cache contents serialized, or is the cache cleared before pickling the file object?

The files are not sent around at all. What you actually send is an OpenFile object ( https://github.com/dask/dask/blob/master/dask/bytes/core.py#L143 ), which only creates the S3FileSystem object in a with block - so caches do not survive tasks.

@martindurant
Copy link
Member

@rsignell-usgs , how did you comment from the future? :)

@martindurant
Copy link
Member

do you think this a s3fs, xarray, h5netcdf or h5py issue

I would exclude xarray here. What happens within h5py when it calls s3 is a bit of a mystery - perhaps more logging in s3fs would help, set logger "s3fs.core" to DEBUG and you'll get some.

@rsignell-usgs
Copy link

@martindurant , I can read the file locally with xarray using the netcdf4 engine, but not with the h5netcdf engine. I can also read the file locally using h5py, so I guess that make it a h5netcdf issue.

Thanks!

@rsignell-usgs
Copy link

rsignell-usgs commented May 8, 2019

I'm also really confused about how I managed to post a comment 5 hours from now. 🙄

@martindurant
Copy link
Member

Are the cache contents serialized, or is the cache cleared before pickling the file object?

PS: the file-system is serialised in this process, including directory listings. This is good or bad - you avoid potentially slow lookups when opening the file, but the instance is bigger. I notice that gcsfs does not preserve the listings cache. gcsfs came later and is, in some ways, better designed (hence my attempt to consolidate such things into fsspec).

@pbranson
Copy link

pbranson commented May 13, 2019 via email

@martindurant
Copy link
Member

martindurant commented May 13, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants