Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue using tiff_to_zarr #313

Open
norlandrhagen opened this issue Mar 6, 2023 · 10 comments
Open

Issue using tiff_to_zarr #313

norlandrhagen opened this issue Mar 6, 2023 · 10 comments

Comments

@norlandrhagen
Copy link
Contributor

norlandrhagen commented Mar 6, 2023

Hi there @martindurant,

I've started playing around with Kerchunk's tiff_to_zarr functionality and ran into an issue when trying to open up the reference file. In short, the tiff_to_zarr works successfully, but I get the error: ContainsArrayError: path '' contains an array. Hopefully this is a simple user error!
I've successfully used tiff_to_zarr on the kerchunk/tests/lcmap_tiny_cog_2019.tif and this Arctic DEM without incident.

Thanks in advance!

I've included a code snippet below and a example of the generated reference file.

Generation script:

import fsspec
import xarray as xr
import ujson
import imagecodecs.numcodecs
imagecodecs.numcodecs.register_codecs()

from kerchunk.tiff import tiff_to_zarr

hansen_url = 'https://storage.googleapis.com/earthenginepartners-hansen/GFC-2021-v1.9/Hansen_GFC-2021-v1.9_gain_50N_130W.tif'
ex_path = 'https://github.com/fsspec/kerchunk/blob/main/kerchunk/tests/lcmap_tiny_cog_2019.tif?raw=true'

hansen_ds = xr.open_dataset(hansen_url,engine='rasterio')


ex_ds = xr.open_dataset(ex_path,engine='rasterio')



zt = tiff_to_zarr(hansen_url)
with open("hansen.json", "wb") as f:
    f.write(ujson.dumps(zt).encode())

m = fsspec.get_mapper("reference://", fo='hansen.json')
hansen_kerchunk_ds = xr.open_dataset(
    m, engine="zarr", backend_kwargs={"consolidated": False}
)

examples of refs for both datasets:

Hansen:

{
	".zattrs": "{\"_ARRAY_DIMENSIONS\":[\"Y\",\"X\"],\"KeyDirectoryVersion\":1,\"KeyRevision\":1,\"KeyRevisionMinor\":0,\"GTModelTypeGeoKey\":2,\"GTRasterTypeGeoKey\":1,\"GeographicTypeGeoKey\":4326,\"GeogCitationGeoKey\":\"WGS 84\",\"GeogAngularUnitsGeoKey\":9102,\"GeogSemiMajorAxisGeoKey\":6378137.0,\"GeogInvFlatteningGeoKey\":298.257223563,\"ModelPixelScale\":[0.00025,0.00025,0.0],\"ModelTiepoint\":[0.0,0.0,0.0,-130.0,50.0,0.0]}",
	".zarray": "{\n \"chunks\": [\n  1,\n  40000\n ],\n \"compressor\": {\n  \"id\": \"imagecodecs_lzw\"\n },\n \"dtype\": \"|u1\",\n \"fill_value\": 0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  40000,\n  40000\n ],\n \"zarr_format\": 2\n}",
	"0.0": ["https:\/\/storage.googleapis.com\/earthenginepartners-hansen\/GFC-2021-v1.9\/Hansen_GFC-2021-v1.9_gain_50N_130W.tif", 320378, 777],
	"1.0": ["https:\/\/storage.googleapis.com\/earthenginepartners-hansen\/GFC-2021-v1.9\/Hansen_GFC-2021-v1.9_gain_50N_130W.tif", 321155, 796], ...}

Kerchunk test tiff:

{
	".zgroup": "{\n \"zarr_format\": 2\n}",
	".zattrs": "{\"multiscales\":[{\"datasets\":[{\"path\":\"0\"},{\"path\":\"1\"},{\"path\":\"2\"}],\"metadata\":{},\"name\":\"\",\"version\":\"0.1\"}],\"OVR_RESAMPLING_ALG\":\"NEAREST\",\"LAYOUT\":\"IFDS_BEFORE_DATA\",\"BLOCK_ORDER\":\"ROW_MAJOR\",\"BLOCK_LEADER\":\"SIZE_AS_UINT4\",\"BLOCK_TRAILER\":\"LAST_4_BYTES_REPEATED\",\"KNOWN_INCOMPATIBLE_EDITION\":\"NO\",\"KeyDirectoryVersion\":1,\"KeyRevision\":1,\"KeyRevisionMinor\":0,\"GTModelTypeGeoKey\":1,\"GTRasterTypeGeoKey\":1,\"GTCitationGeoKey\":\"Albers\",\"GeographicTypeGeoKey\":4326,\"GeogCitationGeoKey\":\"WGS 84\",\"GeogAngularUnitsGeoKey\":9102,\"GeogSemiMajorAxisGeoKey\":6378140.0,\"GeogInvFlatteningGeoKey\":298.256999999996,\"ProjectedCSTypeGeoKey\":32767,\"ProjectionGeoKey\":32767,\"ProjCoordTransGeoKey\":11,\"ProjLinearUnitsGeoKey\":9001,\"ProjStdParallel1GeoKey\":29.5,\"ProjStdParallel2GeoKey\":45.5,\"ProjNatOriginLongGeoKey\":-96.0,\"ProjNatOriginLatGeoKey\":23.0,\"ProjFalseEastingGeoKey\":0.0,\"ProjFalseNorthingGeoKey\":0.0,\"ModelPixelScale\":[30.0,30.0,0.0],\"ModelTiepoint\":[0.0,0.0,0.0,-1801185.0,2700405.0,0.0]}",
	"0\/.zattrs": "{\n \"_ARRAY_DIMENSIONS\": [\n  \"Y\",\n  \"X\"\n ]\n}",
	"0\/.zarray": "{\n \"chunks\": [\n  512,\n  512\n ],\n \"compressor\": {\n  \"id\": \"zlib\"\n },\n \"dtype\": \"|u1\",\n \"fill_value\": 0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  2048,\n  2048\n ],\n \"zarr_format\": 2\n}",
	"1\/.zattrs": "{\n \"_ARRAY_DIMENSIONS\": [\n  \"Y1\",\n  \"X1\"\n ]\n}",
	"1\/.zarray": "{\n \"chunks\": [\n  128,\n  128\n ],\n \"compressor\": {\n  \"id\": \"zlib\"\n },\n \"dtype\": \"|u1\",\n \"fill_value\": 0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  1024,\n  1024\n ],\n \"zarr_format\": 2\n}",
	"2\/.zattrs": "{\n \"_ARRAY_DIMENSIONS\": [\n  \"Y2\",\n  \"X2\"\n ]\n}",
	"2\/.zarray": "{\n \"chunks\": [\n  128,\n  128\n ],\n \"compressor\": {\n  \"id\": \"zlib\"\n },\n \"dtype\": \"|u1\",\n \"fill_value\": 0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  512,\n  512\n ],\n \"zarr_format\": 2\n}",
	"0\/0.0": ["https:\/\/github.com\/fsspec\/kerchunk\/blob\/main\/kerchunk\/tests\/lcmap_tiny_cog_2019.tif?raw=true", 114079, 13584],
	"0\/0.1": ["https:\/\/github.com\/fsspec\/kerchunk\/blob\/main\/kerchunk\/tests\/lcmap_tiny_cog_2019.tif?raw=true", 127671, 19626], ...}

repr's of both ref datasets:

image

image

Traceback:
---------------------------------------------------------------------------
ContainsArrayError                        Traceback (most recent call last)
Cell In[49], line 2
      1 m = fsspec.get_mapper("reference:[//](https://github.com/fsspec/kerchunk/issues/313)", fo='hansen.json')
----> 2 hansen_kerchunk_ds = xr.open_dataset(
      3     m, engine="zarr", backend_kwargs={"consolidated": False}
      4 )

File [~/opt/anaconda3/envs/install/envs/test_env/lib/python3.9/site-packages/xarray/backends/api.py:526](https://file+.vscode-resource.vscode-cdn.net/Users/nrhagen/Documents/carbonplan/global_forest_watch/~/opt/anaconda3/envs/install/envs/test_env/lib/python3.9/site-packages/xarray/backends/api.py:526), in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, backend_kwargs, **kwargs)
    514 decoders = _resolve_decoders_kwargs(
    515     decode_cf,
    516     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)
    522     decode_coords=decode_coords,
    523 )
    525 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 526 backend_ds = backend.open_dataset(
    527     filename_or_obj,
    528     drop_variables=drop_variables,
    529     **decoders,
    530     **kwargs,
    531 )
    532 ds = _dataset_from_backend_dataset(
    533     backend_ds,
    534     filename_or_obj,
...
-> 1442             raise ContainsArrayError(path)
   1443         raise GroupNotFoundError(path)
   1445 elif mode == 'w':

ContainsArrayError: path '' contains an array
@martindurant
Copy link
Member

TiffToZarr does indeed work, but because the data is actually just an array rather than a group of arrays, it is not valid input for xarray. Zarr opens it just fine, including the attributes.

We could have TiffToZarr produce zarr groups

--- a/kerchunk/tiff.py
+++ b/kerchunk/tiff.py
@@ -64,6 +64,9 @@ def tiff_to_zarr(urlpath, remote_options=None, target=None, target_options=None)
                     if isinstance(v, enum.EnumMeta):
                         meta[k] = v._name_
                 out[".zattrs"] = ujson.dumps(meta)
+    # make into group
+    out = {"data/" + k: v for k, v in out.items()}
+    out[".zgroup"] = '{"zarr_format": 2}'
     if "GTRasterTypeGeoKey" in meta:
         # TODO: make dataset and assign coords for geoTIFF

which makes it loadable by xarray. Whether the attributes belong to the group or the one array is another matter.

There is also code in kerchunk.tiff to make the X/Y coordinates, but we do NOT want to save these (each would be bigger than the original data file), we want to use xarray flexible indexes to generate them on demand at runtime. This would be a perfect project for you @norlandrhagen , if you are willing.

@martindurant
Copy link
Member

Whether the attributes belong to the group or the one array is another matter.

I see rioxarray puts these into a separate no-data variable called "spatial_ref"

Furthermore, you see that for the "test" dataset, you already have subdirectories, because this is a multiscale pyramid; I suppose this would be loaded by xarray-datatree?

@rabernat
Copy link
Contributor

Alternatively, we could implement the ability for xarray to open a single zarr array (rather than a group) using xr.open_dataarray.

@martindurant
Copy link
Member

Kerchunk is happy to provide datasets that match xarray's expectations, which should provide faster turnaround in this kind of situation. Unless, of course, we are of the opinion that the restriction results in a dataset that doesn't accurately reflect the true nature of the data.

@norlandrhagen , were you planning to put my suggested fix into action?

@rabernat
Copy link
Contributor

rabernat commented Apr 11, 2023

There is no logical reason why Xarray should not be able to read a single Zarr. It can already read single array from a netCDF file (see open_dataarray).

Kerchunk is happy to provide datasets that match xarray's expectations, which should provide faster turnaround in this kind of situation

I agree that Kerchunk can provide a shim around almost any type of data by massaging it into a structure compatible with Xarray. That doesn't mean that is the best software architecture. This approach causes tons of weird special cases to be implemented in Kerchunk.

I would advocate for addressing this issue upstream. A "fast turnaround" is not always the most important factor. On behalf of the xarray core devs, we would be happy to accept a PR to implement this an review it in a timely manner.

@maxrjones
Copy link
Contributor

@norlandrhagen has been working on handling the coordinates using Xarray, which would work with the strategy to read single Zarr arrays with Xarray.

On behalf of the xarray core devs, we would be happy to accept a PR to implement this an review it in a timely manner.

Do you have any expectations regarding the difficulty for this task? I'd be interested in helping out but don't want to overcommit.

@rabernat
Copy link
Contributor

Do you have any expectations regarding the difficulty for this task?

Since open_dataarray currently calls open_dataset under the hood, this would probably involve some non-trival refactoring of xarray's backend code. There are different design options that would need to be explored.

Perhaps 5 days of work for a dev already familiar with the xarray backend code? For someone new to xarray, probably much longer.

@martindurant
Copy link
Member

I have just had a look and come to the same conclusion. It may be possible to hack the code in ZarrStore.open_group to make an array look like a one-member group, but it would be ugly and I'm not immediately sure of how one might go about it without kerchunk-style indirection.

I guess some other backends also effectively make DataSet s out of arrays even when the backend format is not really hierarchical?

@rabernat
Copy link
Contributor

rabernat commented Apr 12, 2023

I guess some other backends also effectively make DataSet s out of arrays even when the backend format is not really hierarchical?

Rioxarray comes to mind here. They have defined a convention for how to take geotiffs and represent them as Xarray Datasets. IMO that would also fit better as a DataArray, not Dataset.

@maxrjones
Copy link
Contributor

https://github.com/carbonplan/xrefcoord implements @martindurant's idea to generate the coordinates on demand. https://projectpythia.org/kerchunk-cookbook/notebooks/case_studies/GeoTIFF_FMI.html shows this in action. Any feedback would be welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants