Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] MultiZarrToZarr fails to concatenate dimension coordinate variables #386

Open
TomNicholas opened this issue Nov 1, 2023 · 15 comments

Comments

@TomNicholas
Copy link

TomNicholas commented Nov 1, 2023

When combining along a new dimension using coo_map, MultiZarrToZarr fails to concatenate dimension coordinate variables, despite concatenating coordinate variables just fine.

Minimal example:

from kerchunk.hdf import SingleHdf5ToZarr


# Set up some fake netCDF files containing the dimension coordinate 'time'
time1 = xr.Dataset(coords={'time': ('time', [1, 2, 3])})
time2 = xr.Dataset(coords={'time': ('time', [4, 5, 6])})
time1.to_netcdf('test1.nc')
time2.to_netcdf('test2.nc')


# open both files using kerchunk
single_jsons = [SingleHdf5ToZarr(filepath, inline_threshold=300).translate() for filepath in ['./test1.nc', './test2.nc']]

# combine along new dimension 'id`
mzz = MultiZarrToZarr(
    single_jsons,
    concat_dims=["id"],
    coo_map={'id': [10, 20]},
)
combined_test_json = mzz.translate()

# open with xarray to see what the result was
combined_test = xr.open_dataset(
    "reference://", engine="zarr",
    backend_kwargs={
        "storage_options": {
            "fo": combined_test_json,
        },
        "consolidated": False,
    }
)
combined_test

Screenshot from 2023-11-01 16-55-25

This is not what I expected - the time variable should have dimensions (time, id) - we've lost half the time values. The variable time should have been concatenated along id because I did not specify it in identical_dims.

What's weird is that this works as expected for coordinate variables, just not for dimension coordinates. In other words, if I rename the time variable to time_renamed, but have it still be a function of a dimension named time, then the concatenation happens as expected:

time1 = xr.Dataset(coords={'time_renamed': ('time', [1, 2, 3])})
time2 = xr.Dataset(coords={'time_renamed': ('time', [4, 5, 6])})

Screenshot from 2023-11-01 16-59-50

@TomNicholas
Copy link
Author

I'm trying to find the cause of the bug, but I'm struggling to follow the control flow of the code. I'm suspicious of these lines though:

if v not in dont_skip and v in all_deps:

@martindurant
Copy link
Member

Is it fair to say that coordinate variables are essentially the same as variables?

If you include time in the concat_dims, you do get all 6 values out, but not as a function of id - I assume this is not what you want.

@TomNicholas
Copy link
Author

TomNicholas commented Nov 3, 2023 via email

@martindurant
Copy link
Member

I assume your original situation was thus, but I would explicitly try also having a variable that depends on [id, time] to see if that matters.

it must be more than just the _ARRAY_DIMENSIONS. I will look into
that.

This might be the critical thing. Essentially, we want to treat time like a variable for concat purposes, but maintain its attributes such that it is still a coordinate later. It seems like the code treats these two facets the same.

@TomNicholas
Copy link
Author

An extra piece of information is needed to distinguish coordinates
variables from variables (xarray calls the latter "data variables"). I do
not actually know how this extra piece of information is stored in Zarr,
because it must be more than just the _ARRAY_DIMENSIONS. I will look into
that.

It's in the .zattrs. JSON snippet below is from a different synthetic example, where there is a coordinate variable "a" (and no dimension called "a"). That's where the extra piece of information necessary to distinguish coordinate variables from data variables is stored.

{
    "metadata": {
        ".zattrs": {
            "coordinates": "a"
        },
        ".zgroup": {
            "zarr_format": 2
        },
...

This might be the critical thing. Essentially, we want to treat time like a variable for concat purposes, but maintain its attributes such that it is still a coordinate later. It seems like the code treats these two facets the same.

So I was going to reply saying this: "I think that whether or not a variable is a coordinate should be irrelevant for how it is concatenated." But actually I've just realised that xarray wouldn't concatenate time in this case either, but for a different reason...

I assume your original situation was thus, but I would explicitly try also having a variable that depends on [id, time] to see if that matters.

Here's what happens if you try this in xarray:

a_data1 = (np.arange(6) + 10).reshape(2, 3)
a_data2 = (np.arange(6) + 20).reshape(2, 3)
time1 = xr.Dataset(
    data_vars={'a': (('id', 'time'), a_data)}, 
    coords={'time': ('time', [1, 2, 3])},
)
time2 = xr.Dataset(
    data_vars={'a': (('id', 'time'), a_data2)}, 
    coords={'time': ('time', [4, 5, 6])},
)

In [50]: xr.concat([time1, time2], dim='id')
Out[50]: 
<xarray.Dataset>
Dimensions:  (time: 6, id: 4)
Coordinates:
  * time     (time) int64 1 2 3 4 5 6
Dimensions without coordinates: id
Data variables:
    a        (id, time) float64 10.0 11.0 12.0 nan nan ... nan 23.0 24.0 25.0

That also didn't concatenate along time, hmm. Why is that? The docstring for xr.concat could be a lot clearer, but basically it is ultimately because time is backed by a pd.Index object, which cannot be concatenated in this way. (I definitely should have realised that earlier, sorry! 😅)

So we have a situation in which kerchunk is (possibly unintentionally?) behaving the same way as xarray. My intuition about what should happen here was wrong at least for xarray, but the reason why xarray has that behaviour does not actually apply to kerchunk! Zarr arrays don't have indexes, so this concatenation could be done perfectly easily by kerchunk.

What's the desired behaviour? I'm not sure now, but it's a question of data models, and this relates to #377 in that clearer abstractions/type systems would help expose these kind of differences more clearly.

@dcherian
Copy link
Contributor

dcherian commented Nov 3, 2023

I'm so confused. These both seem sensible to me:
image

unless you wanted time to be stacked along the new id dimension too. In which case, yes, that's the Xarray data model for indexed variables, we'd need to destroy the index to do what you want and we want the user to do that explicitly. And yes I agree, Kerchunk should not be following Xarray here

@TomNicholas
Copy link
Author

TomNicholas commented Nov 3, 2023 via email

@martindurant
Copy link
Member

I am weakly against extracting the logic and putting it in zarr. Zarr-python has proven to be very slow moving compared to the needs of kerchunk, and I feel like we probably will have a wider set of cases we want to cover than zarr wants. I am aware that to some extent we end up replicating stuff in xarray (like the encoding stuff), but I don't mind that for the same of being explicit. What we really need, is a good set of test cases to describe expected behaviour.

I would also point out that kerchunk aims to be useful beyond zarr, and that it will be a place to test out cutting edge zarr features (like var-chunks) too.

@TomNicholas
Copy link
Author

Thanks for your perspective @martindurant. I feel like your comments should be posted on #377 too?

@martindurant
Copy link
Member

Er yes, having multiple conversations in motion is part of the problem!

@TomNicholas
Copy link
Author

Coming back to this, three thoughts:

  1. If kerchunk exposed an array-like API with a concat method defined (i.e. the KerchunkArray suggestion in Refactor MultiZarrToZarr into multiple functions #377), the expected behaviour of this operation would be more obvious.

  2. I would like to submit a PR to fix this upstream in kerchunk, but I spent a while looking at the internals of MultiZarrToZarr and really struggled to understand what was going on. (e.g. Why are there multiple passes over the data? And is there a reason why MultiZarrToZarr doesn't seem to use either concatenate_arrays or merge_vars?)

  3. For now I think I can get around this bug with a hack: I rename the time variable to dummy temporarily (but don't rename the time array dimension), use MultiZarrToZarr with concat_dims=["id"], then rename dummy back to time afterwards. I can do the renaming by passing preprocess and postprocess to MultiZarrToZarr.

T_ArrRefs = NewType("T_ArrRefs", dict[Union[Literal['.zattrs'], Literal['.zarray'], str], dict])  # lower-level dict containing just the information for one zarr array


def rename_var(refs: T_ArrRefs, old_name: str, new_name: str) -> T_ArrRefs:
    """Rename a variable in a kerchunk array-level reference dict without renaming any corresponding dimension."""
    for old_key in list(refs):
        if old_key.startswith(old_name):
            new_key = old_key.replace(old_name, new_name)
            refs[new_key] = refs.pop(old_key)
    return refs


# combine along new dimension 'id`
mzz = MultiZarrToZarr(
    single_jsons,
    concat_dims=["id"],
    coo_map={'id': [10, 20]},
    preprocess=functools.partial(rename_var, old_name='time', new_name='dummy'),
    postprocess=functools.partial(rename_var, old_name='dummy', new_name='time'),
)
combined_test_json = mzz.translate()

which gives

Screenshot 2024-03-05 at 2 40 39 PM

@martindurant
Copy link
Member

Quick answers:

Why are there multiple passes over the data?

The first is to find all the coordinate values in all of the chunks, so that in the second pass we can thereafter know, for a particular value, where in the chunk grid it fits. When using concat([...]), this ordering (on exactly one dimension) is implicit in the order of the provided list.

is there a reason why MultiZarrToZarr doesn't seem to use either concatenate_arrays or merge_vars

Those are much more specific and un-flexible - useful in some limited circumstances. What MZZ does cannot be cleanly decomposed into a set of steps like that.

@martindurant
Copy link
Member

To the larger point: MZZ tries to form a sensible aggregate dataset out of all the inputs, by finding the set of coordinates in each chunk of each input (where "coordinate" can include "variable name"). This includes the various combine options of open_mfdataset, which was the inspiration.

If you wanted a decomposition of what MZZ does into separate functions, you would need the user to figure out which files belong together for each subconcat along x, which then get concatenated in a separate step along y. The user would need to know the groupings and the order all the way down. That's essentially how dask's stack and concat methods work. The idea, instead, is to be able to infer those values and figure out the whole grid in one operation, whether the origin of the coordinates is variables within each dataset, file names or something else.

For the specific issue here, it sounds like the logic of "is this a coordinate" isn't quite right, and given the right reproducer for testing, we should be able to address it easily.

@martindurant
Copy link
Member

(None of this is to say that having concat/stack/merge functions or array-standard methods, which can be applied tree-wise is a bad idea; but the workflow would be quite different and put a lot of up-front work on the caller)

@TomNicholas
Copy link
Author

Thanks for those thoughts @martindurant. I think they would be useful over on #377, so I will copy them over and reply to them there.


Coming back to this issue - my workaround exposed other bugs, which I've reported in #434.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants