Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loosing time coordinate #74

Closed
abkfenris opened this issue Sep 3, 2021 · 22 comments · Fixed by #75
Closed

Loosing time coordinate #74

abkfenris opened this issue Sep 3, 2021 · 22 comments · Fixed by #75

Comments

@abkfenris
Copy link

I'm trying to make references for OISST, and the time returned from combined refs is off.

Test generation script
import json

import fsspec
from fsspec_reference_maker.hdf import SingleHdf5ToZarr
from fsspec_reference_maker.combine import MultiZarrToZarr

# import xarray as xr

# from oisst_clim_daily.utils.getCurrentDownloads import (
#     getCurrentDownloads,
#     netcdf_links,
#     date_from_url,
# )

# this_year, this_month, past_year, past_month = getCurrentDownloads()

# urls = netcdf_links(this_year, this_month) + netcdf_links(past_year, past_month)

# print(urls)


urls = [
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210801.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210802.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210803.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210804.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210805.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210806.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210807.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210808.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210809.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210810.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210811.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210812.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210813.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210814.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210815.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210816.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210817.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210818.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210819_preliminary.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210820_preliminary.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210821_preliminary.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210822_preliminary.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210823_preliminary.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210824_preliminary.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210825_preliminary.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210826_preliminary.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210827_preliminary.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210828_preliminary.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210829_preliminary.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210830_preliminary.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210831_preliminary.nc",
]

refs = []

for url in urls:
    with fsspec.open(url) as inf:
        h5chunks = SingleHdf5ToZarr(inf, url, inline_threshold=100)
        refs.append(h5chunks.translate())

with open("first.json", "w") as f:
    json.dump(refs[0], f)

mzz = MultiZarrToZarr(refs, remote_protocol="http", xarray_concat_args={"dim": "time"})
combined = mzz.translate(template_count=None)

with open("combined.json", "w") as f:
    json.dump(combined, f)

Trying to load the combined month's worth of data returns dates starting in 2065.

fs = fsspec.filesystem("reference", fo="./combined.json", skip_intance_cache=True, remote_protocol="http")
m = fs.get_mapper("")
ds_combined = xr.open_dataset(m, engine="zarr")
ds_combined

image

Loading the first reference gets the right date.

fs = fsspec.filesystem("reference", fo="./first.json", skip_intance_cache=True, remote_protocol="http")
m = fs.get_mapper("")
ds_first = xr.open_dataset(m, engine="zarr")
ds_first

image

While I'm testing with preliminary data, it does not appear to change the results if I don't include those files.

Package versions
➜ pip list
Package                Version
---------------------- -----------
aiohttp                3.7.4.post0
alembic                1.7.1
aniso8601              7.0.0
argh                   0.26.2
asciitree              0.3.3
async-timeout          3.0.1
attrs                  21.2.0
beautifulsoup4         4.9.3
bleach                 4.1.0
bokeh                  2.3.3
brotlipy               0.7.0
cached-property        1.5.2
certifi                2021.5.30
cffi                   1.14.6
chardet                4.0.0
charset-normalizer     2.0.0
click                  7.1.2
cloudpickle            1.6.0
colorama               0.4.4
coloredlogs            14.0
croniter               1.0.15
cryptography           3.4.7
cytoolz                0.11.0
dagit                  0.12.8
dagster                0.12.8
dagster-graphql        0.12.8
dagster-postgres       0.12.8
dask                   2021.8.1
defusedxml             0.7.1
distributed            2021.8.1
docstring-parser       0.10
entrypoints            0.3
fasteners              0.16
Flask                  1.1.2
Flask-Cors             3.0.10
Flask-GraphQL          2.0.1
Flask-Sockets          0.2.1
fsspec                 2021.8.1
fsspec-reference-maker 0.0.2
geojson                2.5.0
gevent                 21.8.0
gevent-websocket       0.10.1
gql                    2.0.0
graphene               2.1.9
graphql-core           2.3.2
graphql-relay          2.0.1
graphql-server-core    1.2.0
graphql-ws             0.3.1
greenlet               1.1.1
grpcio                 1.38.1
grpcio-health-checking 1.38.1
h5netcdf               0.11.0
h5py                   3.4.0
HeapDict               1.0.1
humanfriendly          9.2
idna                   3.1
importlib-metadata     4.8.1
importlib-resources    5.2.2
ipython-genutils       0.2.0
itsdangerous           2.0.1
Jinja2                 2.11.3
jsonschema             3.2.0
jupyter-core           4.7.1
locket                 0.2.0
Mako                   1.1.5
MarkupSafe             2.0.1
mistune                0.8.4
monotonic              1.5
msgpack                1.0.2
multidict              5.1.0
nbconvert              5.6.0
nbformat               5.1.3
numcodecs              0.9.0
numpy                  1.21.2
olefile                0.46
packaging              21.0
pandas                 1.3.2
pandocfilters          1.4.2
partd                  1.2.0
pendulum               2.1.2
Pillow                 8.3.1
pip                    21.2.4
promise                2.3
protobuf               3.17.2
psutil                 5.8.0
psycopg2               2.9.1
psycopg2-binary        2.9.1
pycparser              2.20
Pygments               2.10.0
pyOpenSSL              20.0.1
pyparsing              2.4.7
pyrsistent             0.17.3
PySocks                1.7.1
python-dateutil        2.8.2
pytz                   2021.1
pytzdata               2020.1
PyYAML                 5.4.1
requests               2.26.0
Rx                     1.6.1
sentry-sdk             1.3.1
setuptools             57.4.0
Shapely                1.7.1
six                    1.16.0
sortedcontainers       2.4.0
soupsieve              2.0.1
SQLAlchemy             1.4.23
tabulate               0.8.9
tblib                  1.7.0
testpath               0.5.0
toolz                  0.11.1
toposort               1.6
tornado                6.1
tqdm                   4.62.2
traitlets              5.1.0
typing-compat          0.1.0
typing-extensions      3.10.0.0
tzlocal                1.5.1
ujson                  4.0.2
urllib3                1.26.6
watchdog               2.1.5
webencodings           0.5.1
Werkzeug               1.0.1
wheel                  0.37.0
xarray                 0.19.0
yarl                   1.6.3
zarr                   2.9.5
zict                   2.0.0
zipp                   3.5.0
zope.event             4.5.0
zope.interface         5.4.0
@martindurant
Copy link
Member

Do you mind trying with main branch? We've been trying to nail down the thorny time types.

@abkfenris
Copy link
Author

I'll give it a shot.

@lsterzinger
Copy link
Collaborator

I just tested this on main on my system and got the same result as @abkfenris

@abkfenris
Copy link
Author

Yep, still does the same thing on main.

@lsterzinger
Copy link
Collaborator

So the date that should be 2021-08-01 is 2065-03-01, which is an offset of 15918 days.

The original timestamp of the .nc file is days since 1978-01-01 12:00:00. Coincidentally, that date is also 15918 days from 2021-08-01.

So it seems like MultiZarrToZarr is taking the day offset and applying it to the first datetime in the first data file, instead of applying it to 1978-01-01

@martindurant
Copy link
Member

@lsterzinger do you have time/interest to debug? First, I would set debug logging (e.g., fsspec.utils.setup_logging(logger_name="reference-combine")) and then second set a breakpoint in _build_output where the cftime stuff is to see how the numbers get manipulated.

@lsterzinger
Copy link
Collaborator

I do have interest in debugging, but my time is a bit limited these days. I might have time this afternoon/weekend to play around with things and see what's going on.

@lsterzinger
Copy link
Collaborator

Turns out fixing this is much more fun than anything else I have to do today 😉

There was a missing calendar attribute that caused the datetime building to fail. I tested it on my end and it seems to work. @abkfenris can you try out the change in #75 and see if that works for you?

@abkfenris
Copy link
Author

Hmm, I still appear to be seeing that offset when trying with your branch.

image

@lsterzinger
Copy link
Collaborator

Did you regenerate the .json files? I copy pasted your generation script directly and re-generated the files (I had to comment out the *_preliminary.nc files because of a 404 error). I attached a zip of my combined.json for reference.

Make sure you're actually running on the branch

import fsspec_reference_maker
print(fsspec_reference_maker.__version__)

Should result in 0.0.2+3.gcdb6528

image
combined.zip

@abkfenris
Copy link
Author

Your's does open correctly.

image

I did regenerate the json after rebuilding my environment on the branch and it still gives dates in 2065.

20210819 is no longer preliminary which caused that failure, so removing the _preliminary from that URL should include it.

>>> import fsspec_reference_maker
>>> fsspec_reference_maker.__version__
'0+untagged.174.gcdb6528'

Here's the generated combined.json.zip and the Dockerfile, environment.yml, and test script: environment_and_test_script.zip

@lsterzinger
Copy link
Collaborator

@abkfenris I don't think it will make a difference, but I did push another change to that branch. Can you try again?

@abkfenris
Copy link
Author

I'm still getting the same result with 5072d61

@lsterzinger
Copy link
Collaborator

That's super weird. I'm also on 5072d61, and I cloned your environment directly, and I get
image

@abkfenris
Copy link
Author

Aha, I didn't have cftime in my environment, so it wasn't executing the code your branch changed, it was instead executing https://github.com/intake/fsspec-reference-maker/pull/75/files#diff-850b631beff65d5bd4abca60a56ef8308e345e5626bbb8a526f15d31c33a752bL192-L194 .

image

And now with cftime, comparing your branch to the released version, your branch does fix the time offset.

Awesome, thank you!

@lsterzinger
Copy link
Collaborator

Great to hear!

@martindurant It seems like it's not good that it silently fails in this way if cftime is not installed, meaning this code does not parse the dates correctly
https://github.com/intake/fsspec-reference-maker/blob/5072d614cbb6cfa0f497dece422a953c7c4812ab/fsspec_reference_maker/combine.py#L205-L207

Thoughts?

@martindurant
Copy link
Member

It's a fair point, but I can't think of another way to say "see if this converts as times" (because most coordinates are not time, but it just so happens that all our examples are time series).

Note that @rabernat says we should just rely on xarray, but I haven't figured out yet how (because we have zarr arrays, not xarray datasets).

@abkfenris
Copy link
Author

Is that Exception be catching both ImportError when cftime is not available and it looks like ValueError when num2pydate can't convert a date?

Maybe throw a warning in the first case, and use the existing handling otherwise?

@lsterzinger
Copy link
Collaborator

lsterzinger commented Sep 3, 2021 via email

@martindurant
Copy link
Member

@abkfenris , that would e OK - except it may get annoying for those that have no idea what cftime is :)

@abkfenris
Copy link
Author

If I'm understanding warning filters right, warnings.simplefilter('ignore:::fsspec_reference_maker,default') would help quiet things down in that case.

@rabernat
Copy link
Contributor

rabernat commented Sep 3, 2021

Note that @rabernat says we should just rely on xarray, but I haven't figured out yet how (because we have zarr arrays, not xarray datasets).

I have added comments to try to help with this: #70 (comment)

The path we are on will end up re-implementing all of Xarray's coding machinery in fsspec-reference-maker. This is not sustainable. I would suggest refactoring and removing these special-case workarounds as soon as possible, before the technical debt piles up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants