Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concatenating datasets with staggered grids #362

Open
TomNicholas opened this issue Sep 25, 2023 · 6 comments
Open

Concatenating datasets with staggered grids #362

TomNicholas opened this issue Sep 25, 2023 · 6 comments

Comments

@TomNicholas
Copy link

TomNicholas commented Sep 25, 2023

Here are two toy datasets designed to represent sections of a dataset that has variables living on a staggered grid. This type of dataset is common in fluid modelling (handling staggered grids is why xGCM exists).

import xarray as xr

ds1 = xr.Dataset(
    data_vars={
        'a': ('x_center', [1, 2, 3]),
        'b': ('x_outer',  [0.5, 1.5, 2.5, 3.5]),  
    },
)

ds2 = xr.Dataset(
    data_vars={
        'a': ('x_center', [4, 5, 6]),
        'b': ('x_outer',  [4.5, 5.5, 6.5]),  
    },
)

I have netcdf output files from an ocean model (UCLA-ROMS) that have this basic structure.

Combining these types of datasets seems like a bit of a pain to do with kerchunk at the moment.
To concatenate along the x direction, I actually need to concatenate a along x_center, and b along x_outer. So presumably I have to call MultiZarrToZarr once for each variable (or group of variables) that needs to be concatenated along a common dimension. My real dataset is split along multiple dimensions and has multiple staggered grid locations for each dimension, meaning I have to call MultiZarrToZarr something like 6 times.

This problem is analogous to what happens inside xarray.combine_by_coords, which automatically groups variables into sets with common dimensions splits datasets up into sets consisting of the same variable from each dataset, concatenates each set separately (along multiple dimensions in general), then merges the results.

Is that approach (call MultiZarrToZarr multiple times then call kerchunk.combine.merge_vars) the recommended way to handle this case currently? Could we imagine some improvement to the kerchunk.combine API that might make this easier?

@TomNicholas
Copy link
Author

TomNicholas commented Sep 25, 2023

When I actually tried this on my toy example locally I got an error:

import ujson
from pathlib import Path

from kerchunk.combine import MultiZarrToZarr


# save example datasets as netCDF
ds1.to_netcdf('./staggered_kerchunk_test/0.nc')
ds2.to_netcdf('./staggered_kerchunk_test/1.nc')


fs2 = fsspec.filesystem('')  # local file system to save final jsons to


def gen_json(in_filepath: str):
    in_filename = Path(in_filepath).stem
    
    h5chunks = SingleHdf5ToZarr(in_filepath, inline_threshold=300)
    
    out_filepath = f'staggered_kerchunk_test/{in_filename}.json'
    with fs2.open(out_filepath, 'wb') as f:
        f.write(ujson.dumps(h5chunks.translate()).encode())


for file in ['staggered_kerchunk_test/0.nc', 'staggered_kerchunk_test/1.nc']:
    gen_json(file)


# concatenate only variable 'a' along dimension 'x_center'
mzz = MultiZarrToZarr(
    ['staggered_kerchunk_test/0.json', 'staggered_kerchunk_test/1.json'],
    concat_dims=['x_center'],
    preprocess=kerchunk.combine.drop('b'),
)

d = mzz.translate('staggered_kerchunk_test/combined.json')
--------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[99], line 1
----> 1 d = mzz.translate('staggered_kerchunk_test/combined.json')

File /glade/work/tnicholas/conda-envs/kerchunk3.9/lib/python3.9/site-packages/kerchunk/combine.py:496, in MultiZarrToZarr.translate(self, filename, storage_options)
    490 """Perform all stages and return the resultant references dict
    491 
    492 If filename and storage options are given, the output is written to this
    493 file using ujson and fsspec instead of being returned.
    494 """
    495 if 1 not in self.done:
--> 496     self.first_pass()
    497 if 2 not in self.done:
    498     self.store_coords()

File /glade/work/tnicholas/conda-envs/kerchunk3.9/lib/python3.9/site-packages/kerchunk/combine.py:258, in MultiZarrToZarr.first_pass(self)
    256 z = zarr.open_group(fs.get_mapper(""))
    257 for var in self.concat_dims:
--> 258     value = self._get_value(i, z, var, fn=self._paths[i])
    259     if isinstance(value, np.ndarray):
    260         value = value.ravel()

File /glade/work/tnicholas/conda-envs/kerchunk3.9/lib/python3.9/site-packages/kerchunk/combine.py:226, in MultiZarrToZarr._get_value(self, index, z, var, fn)
    224     o = z[var].attrs[item]
    225 elif selector.startswith("data:"):
--> 226     o = z[selector.split(":", 1)[1]][...]
    227 elif selector.startswith("cf:"):
    228     import cftime

File /glade/work/tnicholas/conda-envs/kerchunk3.9/lib/python3.9/site-packages/zarr/hierarchy.py:500, in Group.__getitem__(self, item)
    498         raise KeyError(item)
    499 else:
--> 500     raise KeyError(item)

KeyError: 'x_center'

This doesn't make sense to me - if I only drop variable 'b', then only 'a' should be left, so why can't it find 'x_center'`?

@martindurant
Copy link
Member

MultiZarrToZarr is very much oriented towards a netCDF-style merger of sub-datasets, wherein the variables share coordinates with one-another, as opposed to the linear concat going on here. Other styles of merging are completely reasonable and we should develop ways to run them and to help users to choose between them.

Before going further, I believe the exception you are seeing is because x_center is has no defined values (they become [0, 1, 2] in each input and aren't represented in the zarr version of each input at all), so kerchunk has no way to know where each of the constituent datasets are supposed to go. If you define them as coo_map={"x_center": [[1, 2, 3], [4, 5, 6]], the combine works.

This particular dataset has variable b with unequal chunks, so it's not possible to represent this via zarr at all until ZEP003 is resolved.

@TomNicholas
Copy link
Author

MultiZarrToZarr is very much oriented towards a netCDF-style merger of sub-datasets, wherein the variables share coordinates with one-another, as opposed to the linear concat going on here.

Is there some other tool that I should be using instead? kerchunk.combine.concatenate_arrays? This is the first time I've tried to use kerchunk myself so I appreciate any pointers!

the combine works.

Use coo_map does work, thank you!

so kerchunk has no way to know where each of the constituent datasets are supposed to go

Why is coo_map required though? I gave kerchunk an ordered list, deliberately without a coordinate variable for x_center. Why do we have to invent a set of integer values to use as the data for that coordinate in order to combine the datasets?

This particular dataset has variable b with unequal chunks, so it's not possible to represent this via zarr at all until ZEP003 is resolved.

In this case presumably it should still work because my chunks are of the form {X, X, X, ... Y}, where Y<=X. I understand this general restriction though.

@martindurant
Copy link
Member

Why is coo_map required though? I gave kerchunk an ordered list, deliberately without a coordinate variable for x_center. Why do we have to invent a set of integer values to use as the data for that coordinate in order to combine the datasets?

Simply that we haven't a made a case for it. We have either take the values as they appear ( [0, 1, 2] for both sets) or take the index of the filename (would be [0, 0, 0], [1, 1, 1]!) but not a combination. I agree that we should be able to spot the case that the coord is missing from the input completely and cope with that - but this can only work when concating on exactly one axis.

In this case presumably it should still work because my chunks are of the form {X, X, X, ... Y}, where Y<=X

Unfortunately, no. Zarr does allow the last chunk to be partial, but is still saves a buffer equivalent to a whole chunk and then slices off the needed parts.

@TomNicholas
Copy link
Author

So if I promoted my dimensions to be coordinate variables then I could do a multi-dimensional concatenation like I suggested above?

I agree that we should be able to spot the case that the coord is missing from the input completely and cope with that

That would be nice. It is what xarray.concat/xarray.combine_nested can do.

but this can only work when concatenating on exactly one axis.

The way we dealt with that ambiguity in xarray was to allow combine_nested to accept lists-of-lists, and repeatedly combine along the innermost lists.

Unfortunately, no. Zarr does allow the last chunk to be partial, but is still saves a buffer equivalent to a whole chunk and then slices off the needed parts.

Sorry, you're saying that whilst Zarr can represent arrays where the final chunk is partial, you currently cannot build a kerchunk reference from netCDF files where the final chunk is smaller, because of the way Zarr actually handles that partial chunk?

@martindurant
Copy link
Member

because of the way Zarr actually handles that partial chunk?

An example: note how keys "0" and "1" have the same number of bytes. They are expected to be the size of the full chunk on read.

In [1]: store = {}

In [2]: import zarr

In [3]: z = zarr.open_array(store, mode="w", dtype="i4", compression=None, shape=(6,), chunks=(4,))

In [4]: z[:] = 1

In [5]: store
Out[5]:
{'.zarray': b'{\n    "chunks": [\n        4\n    ],\n    "compressor": null,\n    "dimension_separator": ".",\n    "dtype": "<i4",\n    "fill_value": 0,\n    "filters": null,\n    "order": "C",\n    "shape": [\n        6\n    ],\n    "zarr_format": 2\n}',
 '0': b'\x01\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00',
 '1': b'\x01\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants