Feature/hdf4 subdatasets #410

mpu-creare · 2020-07-03T02:04:28Z

Support reading subdatasets in the rasterio Node.
This is motivated by reading MODIS data downloaded from the source.

coveralls · 2020-07-03T02:10:33Z

Coverage decreased (-1.5%) to 89.668% when pulling 30e283f on feature/hdf4_subdatasets into cdedd85 on develop.

podpac/core/data/rasterio_source.py

mpu-creare · 2020-07-09T15:49:06Z

podpac/core/data/rasterio_source.py

+
+    @property
+    def subdatasets(self):
+        return self.dataset.subdatasets


We could potentially make the subdatasets as different outputs and then open multiple datasets to composite the results. That would take another branch in the get_data function but would results in a much nicer user experience.

Yeah. A dataset can already have multiple outputs, at least if you open it with the xarray Dataset or H5PY nodes. Could you end up with a situation with several subdatasets that each have multiple outputs, but no way to have "nested" outputs?

that makes me want to scream -- I don't know. I hope the answer is NO but I suspect it's a resounding Yes.

Well, the simple answer then is not to make the subdatasets different outputs, basically require one node per subdataset. Hopefully the file can be open concurrently for multiple read-only file-pointers.

Based on the research I've done, there should not be any multi-band subdatasets.

I think I'll end up making the subdatasets multiple outputs -- make it really easy for users. I've now heard from two people that they gave up on PODPAC because they couldn't read the HDF files and they didn't know why. So, it should just work without them having to know about the intricacies of subdatasets.

Also, the s3 case is not a real use-case since reading and hdf file from S3 seems to involve reading the whole file anyway. In other words, it's very slow.

A related questions: right now we read band 1 by default. Should we just go ahead and read all bands by default? What do you think.

Okay. Yeah, I guess we should read all of the bands by default. The FileKeysMixin that is used by CSV, Zarr, Dataset, and H5PY loads all available data_keys by default as separate outputs, unless there is only one, in which case outputs is None and the node is a standard single-output node.

@tl.default("data_key") def _default_data_key(self): if len(self.available_data_keys) == 1: return self.available_data_keys[0] else: return self.available_data_keys @tl.default("outputs") def _default_outputs(self): if not isinstance(self.data_key, list): return None else: return self.data_key

Rasterio should be basically the same, something like allowing band to be either a list or a single value and using this:

@tl.default("band") def _default_band(self): if len(self.band_keys == 1: return self.band_keys[0] else: return self.band_keys @tl.default("outputs") def _default_outputs(self): if not isinstance(self.band, list): return None else: return self.band

Do you want to make an issue for this, or fix it now before merging this issue? Do you have time, or shall I?

I think fixing this inside this PR is fine. And don't worry about timing -- this is complex enough that I think we should do it for the next release and not rush something.

I ended up creating #420 for this instead. This PR was getting pretty old and it should be in develop to support our team right now.

…o still be open.

jmilloy · 2020-07-13T15:11:38Z

I don't really know what to do here.

When I try to open the subdataset using rasterio (no podpac), I get "No such file or directory". Am I doing it correctly?

>>> rasterio.open('HDF4_EOS:EOS_GRID:"MOD13Q1.A2013033.h08v05.006.2015256072248.hdf":MODIS_Grid_16DAY_250m_500m_VI:"250m 16 days NDVI"')
Traceback (most recent call last):
  File "rasterio/_base.pyx", line 216, in rasterio._base.DatasetBase.__init__
  File "rasterio/_shim.pyx", line 67, in rasterio._shim.open_dataset
  File "rasterio/_err.pyx", line 205, in rasterio._err.exc_wrap_pointer
rasterio._err.CPLE_OpenFailedError: HDF4_EOS:EOS_GRID:"MOD13Q1.A2013033.h08v05.006.2015256072248.hdf":MODIS_Grid_16DAY_250m_500m_VI:"250m 16 days NDVI": No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jmilloy/Creare/Pipeline/_podpac38_/lib/python3.8/site-packages/rasterio/env.py", line 433, in wrapper
    return f(*args, **kwds)
  File "/home/jmilloy/Creare/Pipeline/_podpac38_/lib/python3.8/site-packages/rasterio/__init__.py", line 218, in open
    s = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
  File "rasterio/_base.pyx", line 218, in rasterio._base.DatasetBase.__init__
rasterio.errors.RasterioIOError: HDF4_EOS:EOS_GRID:"MOD13Q1.A2013033.h08v05.006.2015256072248.hdf":MODIS_Grid_16DAY_250m_500m_VI:"250m 16 days NDVI": No such file or directory

When I try to just use rasterio to open the hdf file directly, it doesn't work, either. Should it?

>>> rasterio.open('MOD13Q1.A2013033.h08v05.006.2015256072248.hdf')
Traceback (most recent call last):
  File "rasterio/_base.pyx", line 216, in rasterio._base.DatasetBase.__init__
  File "rasterio/_shim.pyx", line 67, in rasterio._shim.open_dataset
  File "rasterio/_err.pyx", line 205, in rasterio._err.exc_wrap_pointer
rasterio._err.CPLE_OpenFailedError: 'MOD13Q1.A2013033.h08v05.006.2015256072248.hdf' not recognized as a supported file format.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jmilloy/Creare/Pipeline/_podpac38_/lib/python3.8/site-packages/rasterio/env.py", line 433, in wrapper
    return f(*args, **kwds)
  File "/home/jmilloy/Creare/Pipeline/_podpac38_/lib/python3.8/site-packages/rasterio/__init__.py", line 218, in open
    s = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
  File "rasterio/_base.pyx", line 218, in rasterio._base.DatasetBase.__init__
rasterio.errors.RasterioIOError: 'MOD13Q1.A2013033.h08v05.006.2015256072248.hdf' not recognized as a supported file format.

mpu-creare · 2020-07-13T15:13:48Z

What version of rasterio are you using? Do you have hdf4 installed?

jmilloy · 2020-07-13T15:15:43Z

rasterio 1.1.5

I'm checking hdf4 now. I guess it is not part of the extra dependencies?

jmilloy · 2020-07-13T15:17:11Z

Oh, it's a system library. Okay, I would have expected a better error message from rasterio, but instally hdf4 should help!

jmilloy · 2020-07-13T15:17:36Z

No difference.

mpu-creare · 2020-07-13T15:17:39Z

hdf4 should come with rasterio -- if installed through conda. I'm using 1.1.3 and it works fine.

mpu-creare · 2020-07-13T15:18:17Z

What's the cwd and the path of the file?

mpu-creare · 2020-07-13T15:19:36Z

So,

df = rasterio.open('MOD13Q1.A2013033.h08v05.006.2015256072248.hdf')
df.datasets

should work.

Also, what's the filesize? Maybe something got broken in transfer. It should be 189MB (198,229,812 bytes)

jmilloy · 2020-07-13T15:20:54Z

What's the cwd and the path of the file?

The file is in the cwd. The fact that the rasterio.open('<filename>.hdf') command fails because of the filetype makes me feel like the path/filename isn't the issue.

jmilloy · 2020-07-13T15:21:51Z

seems okay? 190M, or 198229812 bytes.

md5sum: 1d41e6e86c0a247581b66666bd9c9f9b
sha1sum: d1a79009b63d9cac65e13a42c747407ecec35275

mpu-creare · 2020-07-13T15:28:26Z

Yep, MD5SUM is correct.... hmmm... A mystery... here's some of my output:

In [1]: import rasterio

In [2]: df = rasterio.open("MOD13Q1.A2013033.h08v05.006.2015256072248.hdf")
C:\Anaconda3\envs\soilmap\lib\site-packages\rasterio\__init__.py:219: NotGeoreferencedWarning: Dataset has no geotransform set. The identity matrix may be returned.
  s = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)

In [3]: df.driver
Out[3]: 'HDF4'

In [4]: df.subdatasets
Out[4]:
['HDF4_EOS:EOS_GRID:MOD13Q1.A2013033.h08v05.006.2015256072248.hdf:MODIS_Grid_16DAY_250m_500m_VI:250m 16 days NDVI',
 'HDF4_EOS:EOS_GRID:MOD13Q1.A2013033.h08v05.006.2015256072248.hdf:MODIS_Grid_16DAY_250m_500m_VI:250m 16 days relative azimuth angle',
 'HDF4_EOS:EOS_GRID:MOD13Q1.A2013033.h08v05.006.2015256072248.hdf:MODIS_Grid_16DAY_250m_500m_VI:250m 16 days composite day of the year',
 'HDF4_EOS:EOS_GRID:MOD13Q1.A2013033.h08v05.006.2015256072248.hdf:MODIS_Grid_16DAY_250m_500m_VI:250m 16 days pixel reliability',
 'HDF4_EOS:EOS_GRID:MOD13Q1.A2013033.h08v05.006.2015256072248.hdf:MODIS_Grid_16DAY_250m_500m_VI:250m 16 days EVI',
 'HDF4_EOS:EOS_GRID:MOD13Q1.A2013033.h08v05.006.2015256072248.hdf:MODIS_Grid_16DAY_250m_500m_VI:250m 16 days VI Quality',
 'HDF4_EOS:EOS_GRID:MOD13Q1.A2013033.h08v05.006.2015256072248.hdf:MODIS_Grid_16DAY_250m_500m_VI:250m 16 days red reflectance',
 'HDF4_EOS:EOS_GRID:MOD13Q1.A2013033.h08v05.006.2015256072248.hdf:MODIS_Grid_16DAY_250m_500m_VI:250m 16 days NIR reflectance',
 'HDF4_EOS:EOS_GRID:MOD13Q1.A2013033.h08v05.006.2015256072248.hdf:MODIS_Grid_16DAY_250m_500m_VI:250m 16 days blue reflectance',
 'HDF4_EOS:EOS_GRID:MOD13Q1.A2013033.h08v05.006.2015256072248.hdf:MODIS_Grid_16DAY_250m_500m_VI:250m 16 days MIR reflectance',
 'HDF4_EOS:EOS_GRID:MOD13Q1.A2013033.h08v05.006.2015256072248.hdf:MODIS_Grid_16DAY_250m_500m_VI:250m 16 days view zenith angle',
 'HDF4_EOS:EOS_GRID:MOD13Q1.A2013033.h08v05.006.2015256072248.hdf:MODIS_Grid_16DAY_250m_500m_VI:250m 16 days sun zenith angle']

Do you have GDAL installed? You could try gdalinfo from the commandline to make sure at least that layer works. Then you can move on to `rio info '

Just ideas...

mpu-creare · 2020-07-13T15:29:51Z

Try:

with rasterio.Env() as env:
   dk = list(env.drivers().keys())
   dk.sort()
   print(dk)

jmilloy · 2020-07-13T15:42:11Z

Yeah, there's no HDF4 driver. I still can't figure out how to install it.

jmilloy · 2020-07-13T15:43:11Z

gdalinfo worked, the file is fine

jmilloy · 2020-07-13T15:58:32Z

I'm trying conda now, if that doesn't work, you might be best off just merging this since it is working for you. I'd prefer an explicit check for the subdataset string instead of a try except.

jmilloy · 2020-07-13T16:06:11Z

Okay, conda was able to include the hdf4 driver for rasterio. I had no trouble opening the file locally using the Rasterio node.

jmilloy

Approved, but with a preference for the one change.

podpac/core/data/rasterio_source.py

…g on try-catch to fail.

mpu-creare changed the base branch from master to develop July 3, 2020 02:04

mpu-creare requested a review from jmilloy July 9, 2020 15:28

mpu-creare commented Jul 9, 2020

View reviewed changes

podpac/core/data/rasterio_source.py Show resolved Hide resolved

mpu-creare commented Jul 9, 2020

View reviewed changes

jmilloy referenced this pull request Jul 9, 2020

FIX: Some parts of hdf/netcdf datasets require the file-like object t…

69a640d

…o still be open.

Merge branch 'develop' into feature/hdf4_subdatasets

0040f69

jmilloy approved these changes Jul 13, 2020

View reviewed changes

podpac/core/data/rasterio_source.py Outdated Show resolved Hide resolved

mpu-creare and others added 2 commits August 21, 2020 12:52

FIX: Checking for the source format more explicitly instead of relyin…

7590664

…g on try-catch to fail.

Merge branch 'develop' into feature/hdf4_subdatasets

30e283f

mpu-creare mentioned this pull request Aug 21, 2020

Rasterio Source should load all bands by default, and subdatasets as well #420

Open

mpu-creare merged commit 3b875eb into develop Aug 21, 2020

mpu-creare deleted the feature/hdf4_subdatasets branch August 21, 2020 17:08

mpu-creare mentioned this pull request Aug 21, 2020

Rasterio Node to support subdatasets #409

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/hdf4 subdatasets #410

Feature/hdf4 subdatasets #410

mpu-creare commented Jul 3, 2020

coveralls commented Jul 3, 2020 •

edited

mpu-creare Jul 9, 2020

jmilloy Jul 9, 2020

mpu-creare Jul 9, 2020

jmilloy Jul 13, 2020

mpu-creare Jul 13, 2020 •

edited

jmilloy Jul 13, 2020

jmilloy Jul 13, 2020

mpu-creare Jul 13, 2020

mpu-creare Aug 21, 2020

jmilloy commented Jul 13, 2020

mpu-creare commented Jul 13, 2020

jmilloy commented Jul 13, 2020

jmilloy commented Jul 13, 2020

jmilloy commented Jul 13, 2020

mpu-creare commented Jul 13, 2020 •

edited

mpu-creare commented Jul 13, 2020

mpu-creare commented Jul 13, 2020 •

edited

jmilloy commented Jul 13, 2020

jmilloy commented Jul 13, 2020 •

edited

mpu-creare commented Jul 13, 2020

mpu-creare commented Jul 13, 2020 •

edited

jmilloy commented Jul 13, 2020

jmilloy commented Jul 13, 2020

jmilloy commented Jul 13, 2020

jmilloy commented Jul 13, 2020

jmilloy left a comment

Feature/hdf4 subdatasets #410

Feature/hdf4 subdatasets #410

Conversation

mpu-creare commented Jul 3, 2020

coveralls commented Jul 3, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpu-creare Jul 13, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmilloy commented Jul 13, 2020

mpu-creare commented Jul 13, 2020

jmilloy commented Jul 13, 2020

jmilloy commented Jul 13, 2020

jmilloy commented Jul 13, 2020

mpu-creare commented Jul 13, 2020 • edited

mpu-creare commented Jul 13, 2020

mpu-creare commented Jul 13, 2020 • edited

jmilloy commented Jul 13, 2020

jmilloy commented Jul 13, 2020 • edited

mpu-creare commented Jul 13, 2020

mpu-creare commented Jul 13, 2020 • edited

jmilloy commented Jul 13, 2020

jmilloy commented Jul 13, 2020

jmilloy commented Jul 13, 2020

jmilloy commented Jul 13, 2020

jmilloy left a comment

Choose a reason for hiding this comment

coveralls commented Jul 3, 2020 •

edited

mpu-creare Jul 13, 2020 •

edited

mpu-creare commented Jul 13, 2020 •

edited

mpu-creare commented Jul 13, 2020 •

edited

jmilloy commented Jul 13, 2020 •

edited

mpu-creare commented Jul 13, 2020 •

edited