[BUG] MultiZarrToZarr fails to concatenate dimension coordinate variables #386

TomNicholas · 2023-11-01T21:00:10Z

When combining along a new dimension using coo_map, MultiZarrToZarr fails to concatenate dimension coordinate variables, despite concatenating coordinate variables just fine.

Minimal example:

from kerchunk.hdf import SingleHdf5ToZarr


# Set up some fake netCDF files containing the dimension coordinate 'time'
time1 = xr.Dataset(coords={'time': ('time', [1, 2, 3])})
time2 = xr.Dataset(coords={'time': ('time', [4, 5, 6])})
time1.to_netcdf('test1.nc')
time2.to_netcdf('test2.nc')


# open both files using kerchunk
single_jsons = [SingleHdf5ToZarr(filepath, inline_threshold=300).translate() for filepath in ['./test1.nc', './test2.nc']]

# combine along new dimension 'id`
mzz = MultiZarrToZarr(
    single_jsons,
    concat_dims=["id"],
    coo_map={'id': [10, 20]},
)
combined_test_json = mzz.translate()

# open with xarray to see what the result was
combined_test = xr.open_dataset(
    "reference://", engine="zarr",
    backend_kwargs={
        "storage_options": {
            "fo": combined_test_json,
        },
        "consolidated": False,
    }
)
combined_test

This is not what I expected - the time variable should have dimensions (time, id) - we've lost half the time values. The variable time should have been concatenated along id because I did not specify it in identical_dims.

What's weird is that this works as expected for coordinate variables, just not for dimension coordinates. In other words, if I rename the time variable to time_renamed, but have it still be a function of a dimension named time, then the concatenation happens as expected:

time1 = xr.Dataset(coords={'time_renamed': ('time', [1, 2, 3])})
time2 = xr.Dataset(coords={'time_renamed': ('time', [4, 5, 6])})

The text was updated successfully, but these errors were encountered:

TomNicholas · 2023-11-01T22:40:03Z

I'm trying to find the cause of the bug, but I'm struggling to follow the control flow of the code. I'm suspicious of these lines though:

kerchunk/kerchunk/combine.py

Line 415 in cdd106b

if v not in dont_skip and v in all_deps:

martindurant · 2023-11-03T13:33:39Z

Is it fair to say that coordinate variables are essentially the same as variables?

If you include time in the concat_dims, you do get all 6 values out, but not as a function of id - I assume this is not what you want.

TomNicholas · 2023-11-03T14:00:09Z

Is it fair to say that *coordinate variables* are essentially the same as

*variables*? Basically - they are literally variables which are also designated as being "coordinates". You can see how this works in xarray.Dataset here https://github.com/pydata/xarray/blob/main/xarray/core/dataset.py#L656 An extra piece of information is needed to distinguish coordinates variables from variables (xarray calls the latter "data variables"). I do not actually know how this extra piece of information is stored in Zarr, because it must be more than just the _ARRAY_DIMENSIONS. I will look into that.

If you include time in the concat_dims, you do get all 6 values out

You do? I thought I tried that...

I assume this is not what you want.

No - unfortunately that's not the correct result, all the variables that aren't in `identical_dims` should aquire the id dimension.

martindurant · 2023-11-03T14:04:12Z

I assume your original situation was thus, but I would explicitly try also having a variable that depends on [id, time] to see if that matters.

it must be more than just the _ARRAY_DIMENSIONS. I will look into
that.

This might be the critical thing. Essentially, we want to treat time like a variable for concat purposes, but maintain its attributes such that it is still a coordinate later. It seems like the code treats these two facets the same.

TomNicholas · 2023-11-03T16:23:36Z

An extra piece of information is needed to distinguish coordinates
variables from variables (xarray calls the latter "data variables"). I do
not actually know how this extra piece of information is stored in Zarr,
because it must be more than just the _ARRAY_DIMENSIONS. I will look into
that.

It's in the .zattrs. JSON snippet below is from a different synthetic example, where there is a coordinate variable "a" (and no dimension called "a"). That's where the extra piece of information necessary to distinguish coordinate variables from data variables is stored.

{
    "metadata": {
        ".zattrs": {
            "coordinates": "a"
        },
        ".zgroup": {
            "zarr_format": 2
        },
...

This might be the critical thing. Essentially, we want to treat time like a variable for concat purposes, but maintain its attributes such that it is still a coordinate later. It seems like the code treats these two facets the same.

So I was going to reply saying this: "I think that whether or not a variable is a coordinate should be irrelevant for how it is concatenated." But actually I've just realised that xarray wouldn't concatenate time in this case either, but for a different reason...

I assume your original situation was thus, but I would explicitly try also having a variable that depends on [id, time] to see if that matters.

Here's what happens if you try this in xarray:

a_data1 = (np.arange(6) + 10).reshape(2, 3)
a_data2 = (np.arange(6) + 20).reshape(2, 3)
time1 = xr.Dataset(
    data_vars={'a': (('id', 'time'), a_data)}, 
    coords={'time': ('time', [1, 2, 3])},
)
time2 = xr.Dataset(
    data_vars={'a': (('id', 'time'), a_data2)}, 
    coords={'time': ('time', [4, 5, 6])},
)

In [50]: xr.concat([time1, time2], dim='id')
Out[50]: 
<xarray.Dataset>
Dimensions:  (time: 6, id: 4)
Coordinates:
  * time     (time) int64 1 2 3 4 5 6
Dimensions without coordinates: id
Data variables:
    a        (id, time) float64 10.0 11.0 12.0 nan nan ... nan 23.0 24.0 25.0

That also didn't concatenate along time, hmm. Why is that? The docstring for xr.concat could be a lot clearer, but basically it is ultimately because time is backed by a pd.Index object, which cannot be concatenated in this way. (I definitely should have realised that earlier, sorry! 😅)

So we have a situation in which kerchunk is (possibly unintentionally?) behaving the same way as xarray. My intuition about what should happen here was wrong at least for xarray, but the reason why xarray has that behaviour does not actually apply to kerchunk! Zarr arrays don't have indexes, so this concatenation could be done perfectly easily by kerchunk.

What's the desired behaviour? I'm not sure now, but it's a question of data models, and this relates to #377 in that clearer abstractions/type systems would help expose these kind of differences more clearly.

dcherian · 2023-11-03T17:24:36Z

I'm so confused. These both seem sensible to me:

unless you wanted time to be stacked along the new id dimension too. In which case, yes, that's the Xarray data model for indexed variables, we'd need to destroy the index to do what you want and we want the user to do that explicitly. And yes I agree, Kerchunk should not be following Xarray here

TomNicholas · 2023-11-03T17:28:40Z

unless you wanted time to be stacked along the new id dimension too. In

which case, yes, that's the Xarray data model for indexed variables, we'd need to destroy the index to do what you want and we want the user to do that explicitly. And yes I agree, Kerchunk should not be following Xarray here This is exactly what I am saying! For my use case of Kerchunk I did want time to be stacked along the new id dimension. I only remembered that was different to what xarray would do after trying it with xarray.

Kerchunk should not be following Xarray here

In which case there is a real bug in Kerchunk to be fixed!

…

On Fri, Nov 3, 2023, 1:24 PM Deepak Cherian ***@***.***> wrote: I'm so confused. These both seem sensible to me: [image: image] <https://user-images.githubusercontent.com/2448579/280370127-9148912d-d239-4b8f-9e28-95aceda6694e.png> unless you wanted time to be stacked along the new id dimension too. In which case, yes, that's the Xarray data model for indexed variables, we'd need to destroy the index to do what you want and we want the user to do that explicitly. And yes I agree, Kerchunk should not be following Xarray here — Reply to this email directly, view it on GitHub <#386 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AISNPI6MET6DQC32PFN6WNDYCUSF5AVCNFSM6AAAAAA6Z4DMD6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJSHA2DONJYGA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

martindurant · 2023-11-06T16:33:38Z

I am weakly against extracting the logic and putting it in zarr. Zarr-python has proven to be very slow moving compared to the needs of kerchunk, and I feel like we probably will have a wider set of cases we want to cover than zarr wants. I am aware that to some extent we end up replicating stuff in xarray (like the encoding stuff), but I don't mind that for the same of being explicit. What we really need, is a good set of test cases to describe expected behaviour.

I would also point out that kerchunk aims to be useful beyond zarr, and that it will be a place to test out cutting edge zarr features (like var-chunks) too.

TomNicholas · 2023-11-06T16:45:41Z

Thanks for your perspective @martindurant. I feel like your comments should be posted on #377 too?

martindurant · 2023-11-06T16:46:38Z

Er yes, having multiple conversations in motion is part of the problem!

TomNicholas · 2024-03-05T20:00:05Z

Coming back to this, three thoughts:

If kerchunk exposed an array-like API with a concat method defined (i.e. the KerchunkArray suggestion in Refactor MultiZarrToZarr into multiple functions #377), the expected behaviour of this operation would be more obvious.
I would like to submit a PR to fix this upstream in kerchunk, but I spent a while looking at the internals of MultiZarrToZarr and really struggled to understand what was going on. (e.g. Why are there multiple passes over the data? And is there a reason why MultiZarrToZarr doesn't seem to use either concatenate_arrays or merge_vars?)
For now I think I can get around this bug with a hack: I rename the time variable to dummy temporarily (but don't rename the time array dimension), use MultiZarrToZarr with concat_dims=["id"], then rename dummy back to time afterwards. I can do the renaming by passing preprocess and postprocess to MultiZarrToZarr.

T_ArrRefs = NewType("T_ArrRefs", dict[Union[Literal['.zattrs'], Literal['.zarray'], str], dict])  # lower-level dict containing just the information for one zarr array


def rename_var(refs: T_ArrRefs, old_name: str, new_name: str) -> T_ArrRefs:
    """Rename a variable in a kerchunk array-level reference dict without renaming any corresponding dimension."""
    for old_key in list(refs):
        if old_key.startswith(old_name):
            new_key = old_key.replace(old_name, new_name)
            refs[new_key] = refs.pop(old_key)
    return refs


# combine along new dimension 'id`
mzz = MultiZarrToZarr(
    single_jsons,
    concat_dims=["id"],
    coo_map={'id': [10, 20]},
    preprocess=functools.partial(rename_var, old_name='time', new_name='dummy'),
    postprocess=functools.partial(rename_var, old_name='dummy', new_name='time'),
)
combined_test_json = mzz.translate()

which gives

martindurant · 2024-03-06T19:18:07Z

Quick answers:

Why are there multiple passes over the data?

The first is to find all the coordinate values in all of the chunks, so that in the second pass we can thereafter know, for a particular value, where in the chunk grid it fits. When using concat([...]), this ordering (on exactly one dimension) is implicit in the order of the provided list.

is there a reason why MultiZarrToZarr doesn't seem to use either concatenate_arrays or merge_vars

Those are much more specific and un-flexible - useful in some limited circumstances. What MZZ does cannot be cleanly decomposed into a set of steps like that.

martindurant · 2024-03-06T19:25:06Z

To the larger point: MZZ tries to form a sensible aggregate dataset out of all the inputs, by finding the set of coordinates in each chunk of each input (where "coordinate" can include "variable name"). This includes the various combine options of open_mfdataset, which was the inspiration.

If you wanted a decomposition of what MZZ does into separate functions, you would need the user to figure out which files belong together for each subconcat along x, which then get concatenated in a separate step along y. The user would need to know the groupings and the order all the way down. That's essentially how dask's stack and concat methods work. The idea, instead, is to be able to infer those values and figure out the whole grid in one operation, whether the origin of the coordinates is variables within each dataset, file names or something else.

For the specific issue here, it sounds like the logic of "is this a coordinate" isn't quite right, and given the right reproducer for testing, we should be able to address it easily.

martindurant · 2024-03-06T19:26:22Z

(None of this is to say that having concat/stack/merge functions or array-standard methods, which can be applied tree-wise is a bad idea; but the workflow would be quite different and put a lot of up-front work on the caller)

TomNicholas · 2024-03-07T17:58:39Z

Thanks for those thoughts @martindurant. I think they would be useful over on #377, so I will copy them over and reply to them there.

Coming back to this issue - my workaround exposed other bugs, which I've reported in #434.

TomNicholas mentioned this issue Nov 3, 2023

Refactor MultiZarrToZarr into multiple functions #377

Open

TomNicholas mentioned this issue Dec 5, 2023

tree-reduce the combine for open_mfdataset(..., parallel=True, combine="nested") pydata/xarray#8523

Open

TomNicholas mentioned this issue Mar 7, 2024

Non-deterministic behaviour of MultiZarrToZarr #434

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] MultiZarrToZarr fails to concatenate dimension coordinate variables #386

[BUG] MultiZarrToZarr fails to concatenate dimension coordinate variables #386

TomNicholas commented Nov 1, 2023 •

edited

TomNicholas commented Nov 1, 2023

martindurant commented Nov 3, 2023

TomNicholas commented Nov 3, 2023 via email •

edited

martindurant commented Nov 3, 2023

TomNicholas commented Nov 3, 2023

dcherian commented Nov 3, 2023

TomNicholas commented Nov 3, 2023 via email

martindurant commented Nov 6, 2023

TomNicholas commented Nov 6, 2023

martindurant commented Nov 6, 2023

TomNicholas commented Mar 5, 2024

martindurant commented Mar 6, 2024

martindurant commented Mar 6, 2024

martindurant commented Mar 6, 2024

TomNicholas commented Mar 7, 2024

[BUG] MultiZarrToZarr fails to concatenate dimension coordinate variables #386

[BUG] MultiZarrToZarr fails to concatenate dimension coordinate variables #386

Comments

TomNicholas commented Nov 1, 2023 • edited

TomNicholas commented Nov 1, 2023

martindurant commented Nov 3, 2023

TomNicholas commented Nov 3, 2023 via email • edited

martindurant commented Nov 3, 2023

TomNicholas commented Nov 3, 2023

dcherian commented Nov 3, 2023

TomNicholas commented Nov 3, 2023 via email

martindurant commented Nov 6, 2023

TomNicholas commented Nov 6, 2023

martindurant commented Nov 6, 2023

TomNicholas commented Mar 5, 2024

martindurant commented Mar 6, 2024

martindurant commented Mar 6, 2024

martindurant commented Mar 6, 2024

TomNicholas commented Mar 7, 2024

TomNicholas commented Nov 1, 2023 •

edited

TomNicholas commented Nov 3, 2023 via email •

edited