Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single value variable of type int32 in NetCDF becomes float64 in Kerchunk #429

Open
rsignell opened this issue Mar 2, 2024 · 5 comments

Comments

@rsignell
Copy link

rsignell commented Mar 2, 2024

@martindurant, looks like we still have a single-value variable problem.
In these AWS Open Data NetCDF files, the variable 'spherical' has a single int32 value but it becomes a float64 after kerchunk:
https://nbviewer.org/gist/rsignell-usgs/5971951d348496229ce121b52a2fb750

(I discovered this because the xroms package designed to work with these ROMS NetCDF files bombed -- took me a while to figure out this was the reason...)

@martindurant
Copy link
Member

I am fairly puzzled, the metadata says int:

>>> fs = fsspec.filesystem("reference", fo=single_json, remote_protocol="s3", remote_options=so)
>>> fs.cat("spherical/.zarray")
b'{"chunks":[],"compressor":null,"dtype":"<i4","fill_value":-2147483647,"filters":null,"order":"C","shape":[],"zarr_format":2}'

and zarr agrees:

>>> g = zarr.open(fs.get_mapper())
>>> g.spherical.dtype
dtype('int32')

xarray has a bunch of "decode*" flags in open_dataset, but I can't immediately see one that might do the right thing here.

The value, by the way, is just 1. This is actually a boolean?

@keewis
Copy link
Contributor

keewis commented Mar 6, 2024

I believe the reason is the fill_value. At the moment, float* is one of the few data types that can have missing values (using nan), while int* can't represent missing values. mask_and_scale=False should be what you're looking for, and I believe you can convert only the ones you need using:

In [20]: import xarray as xr
    ...: 
    ...: ds = xr.Dataset(
    ...:     {
    ...:         "a": ("x", [0, 1, 2], {"_FillValue": 1}),
    ...:         "b": ("x", [0.1, 0.2, 1.0], {"_FillValue": 1.0}),
    ...:     }
    ...: )
    ...: skipped_variables = [
    ...:     name
    ...:     for name, var in ds.variables.items()
    ...:     if "_FillValue" in var.attrs and var.dtype.kind not in "cfmMO"
    ...: ]
    ...: 
    ...: 
    ...: def decode_with_skip(ds, skip=None):
    ...:     if not skip:
    ...:         return xr.decode_cf(ds)
    ...: 
    ...:     return ds[skip].merge(xr.decode_cf(ds.drop_vars(skip)))
    ...: 
    ...: 
    ...: display(ds)
    ...: display(ds.pipe(decode_with_skip, skip=skipped_variables).compute())
<xarray.Dataset> Size: 48B
Dimensions:  (x: 3)
Dimensions without coordinates: x
Data variables:
    a        (x) int64 24B 0 1 2
    b        (x) float64 24B 0.1 0.2 1.0
<xarray.Dataset> Size: 48B
Dimensions:  (x: 3)
Dimensions without coordinates: x
Data variables:
    a        (x) int64 24B 0 1 2
    b        (x) float64 24B 0.1 0.2 nan

(This might change with the custom dtypes in numpy, but it will take some effort to get working "nullable integer" dtypes)

@martindurant
Copy link
Member

@keewis : but the data here has an int fill_value and no _Fill_Value. Are you saying that having a fill value of any sort will cause a cast int->float even when there are actually no nulls?

@martindurant
Copy link
Member

Ah indeed, if I set the fill_value to null in the JSON, you get an int :|

@keewis
Copy link
Contributor

keewis commented Mar 6, 2024

zarr's fill_value is translated to the _FillValue attribute. The masking is applied without checking the actual values (which is potentially expensive) using where, and the mask value and the promoted dtypes are decided in xarray.core.dtypes.maybe_promote.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants