# Create Kerchunk catalog for CMEMS


## Set up credentials on GFTS buckets

Credentials are stored in the `gfts` profile in your `~/.aws/credentials`. This file is generated automatically on GFTS Jupyterhub.

You can view them with `~/.aws/credentials`.

- access keys are in profile named `gfts`
- endpoint_url is `https://s3.gra.perf.cloud.ovh.net`
- region_name is `gra`

You should have read and write permissions to the bucket, but not delete

## Rclone to copy/sync data from Copernicus Marine to GFTS bucket



### Set up Rclone config file

[Rclone](https://rclone.org) is configured automatically on GFTS JupyterHub. You can view Rclone config file with `~/.config/rclone/rclone.conf`

```
[cmarine]
type = s3
provider = Other
endpoint = https://s3.waw3-1.cloudferro.com
acl = public-read

[gfts]
type = s3
provider = Other
env_auth = true
region = gra
endpoint = https://s3.gra.perf.cloud.ovh.net
```

## Copy CMEMS files with rclone

We copy CMEMS files we need to our bucket as they are not available on DestinE DEDL yet.

- 2D
```
./rclone copy cmarine:mdl-native-13/native/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-3D_PT1H-m_202309 gfts:gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-3D_PT1H-m_202309/
./rclone copy cmarine:mdl-native-13/native/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-2D_PT15M-i_202309 gfts:gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-2D_PT15M-i_202309/
./rclone copy cmarine:mdl-native-13/native/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-2D_PT1H-m_202309 gfts:gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-2D_PT1H-m_202309/
./rclone copy cmarine:mdl-native-10/native/IBI_ANALYSISFORECAST_PHY_005_001/cmems_mod_ibi_phy_anfc_0.027deg-2D_PT1H-m_202211 gfts:gfts-reference-data/IBI_ANALYSISFORECAST_PHY_005_001/cmems_mod_ibi_phy_anfc_0.027deg-2D_PT1H-m_202211/

```


In [2]:
# !cat ~/.aws/credentials

In [2]:
import s3fs
import xarray as xr

In [3]:
s3 = s3fs.S3FileSystem(
    anon=False,
    profile="gfts",
    client_kwargs={
        "endpoint_url": "https://s3.gra.perf.cloud.ovh.net",
        "region_name": "gra",
    },
)

## Create catalog for NWSHELF ANALYSIS FORECAST data

In [7]:
bucket_name = "gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-2D_PT1H-m_202309"
s3.ls(bucket_name)

['gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-2D_PT1H-m_202309',
 'gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-3D_PT1H-m_202309']

In [5]:
s3.ls("gfts-reference-data/")

['gfts-reference-data/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_AN_combined.json',
 'gfts-reference-data/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_AN_combined.parq',
 'gfts-reference-data/CMEMS_v6r1_NWS_PHY_NRT_NL_3D.parq',
 'gfts-reference-data/IBI_ANALYSISFORECAST_PHY_005_001',
 'gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013']

In [6]:
s3path = "s3://gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-2D_PT1H-m_202309/*/*/*.nc"

In [8]:
remote_files = s3.glob(s3path)
remote_files

['gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-2D_PT1H-m_202309/2022/04/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220402_20220402_R20220418_AN04.nc',
 'gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-2D_PT1H-m_202309/2022/04/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220403_20220403_R20220418_AN05.nc',
 'gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-2D_PT1H-m_202309/2022/04/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220404_20220404_R20220418_AN06.nc',
 'gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-2D_PT1H-m_202309/2022/04/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220405_20220405_R20220418_AN07.nc',
 'gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-2D_PT1H-m_202309/2022/04/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220406_20220406_R20220425_AN01.nc',
 'gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_n

In [10]:
import fsspec
from kerchunk.hdf import SingleHdf5ToZarr

In [11]:
fs = fsspec.filesystem(
    "s3",
    anon=False,
    profile="gfts",
    client_kwargs={
        "endpoint_url": "https://s3.gra.perf.cloud.ovh.net",
        "region_name": "gra",
    },
)

In [11]:
fs_files = fs.glob(s3path)

In [12]:
fs_files

['gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-2D_PT1H-m_202309/2022/04/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220402_20220402_R20220418_AN04.nc',
 'gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-2D_PT1H-m_202309/2022/04/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220403_20220403_R20220418_AN05.nc',
 'gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-2D_PT1H-m_202309/2022/04/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220404_20220404_R20220418_AN06.nc',
 'gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-2D_PT1H-m_202309/2022/04/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220405_20220405_R20220418_AN07.nc',
 'gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-2D_PT1H-m_202309/2022/04/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220406_20220406_R20220425_AN01.nc',
 'gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_n

In [15]:
import ujson

In [16]:
fs2 = fsspec.filesystem("")  # local file system to save final jsons to
so = dict(
    mode="rb", anon=True, default_fill_cache=False, default_cache_type="first"
)  # args to fs.open()
# default_fill_cache=False avoids caching data in between file chunks to lowers memory usage.

In [17]:
def gen_json(fs, fs2, so, file_url):
    with fs.open(file_url, **so) as infile:
        h5chunks = SingleHdf5ToZarr(infile, file_url, inline_threshold=300)
        # inline threshold adjusts the Size below which binary blocks are included directly in the output
        # a higher inline threshold can result in a larger json file but faster loading time
        name = file_url.split("/")[-1].split(".")[0]
        outf = f"{name}.json"  # file name to save json to
        print(outf)
        with fs2.open(outf, "wb") as f:
            f.write(ujson.dumps(h5chunks.translate()).encode())

In [14]:
%%time
for file in fs_files:
    gen_json(fs, fs2, so, file)

CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220402_20220402_R20220418_AN04.json
CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220403_20220403_R20220418_AN05.json
CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220404_20220404_R20220418_AN06.json
CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220405_20220405_R20220418_AN07.json
CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220406_20220406_R20220425_AN01.json
CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220407_20220407_R20220425_AN02.json
CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220408_20220408_R20220425_AN03.json
CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220409_20220409_R20220425_AN04.json
CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220410_20220410_R20220425_AN05.json
CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220411_20220411_R20220425_AN06.json
CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220412_20220412_R20220425_AN07.json
CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220413_20220413_R20220502_AN01.json
CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220414_20220414_R20220502_AN02.json
CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220415_20220415_R20220502_AN03.json
CMEMS_v6r1_NWS_PHY_N

In [15]:
ds = xr.open_dataset(
    "reference://",
    engine="zarr",
    backend_kwargs={
        "consolidated": False,
        "storage_options": {
            "fo": "CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_20220402_20220402_R20220418_AN04.json",
            "remote_protocol": "s3",
            "remote_options": {"anon": False},
        },
    },
)
print(ds)

<xarray.Dataset> Size: 693MB
Dimensions:    (latitude: 551, longitude: 936, time: 24)
Coordinates:
  * latitude   (latitude) float32 2kB 46.0 46.03 46.06 ... 61.23 61.25 61.28
  * longitude  (longitude) float32 4kB -16.0 -15.97 -15.94 ... 9.921 9.949 9.977
  * time       (time) datetime64[ns] 192B 2022-04-02T00:30:00 ... 2022-04-02T...
Data variables:
    mlotst     (time, latitude, longitude) float64 99MB ...
    thetao     (time, latitude, longitude) float64 99MB ...
    ubar       (time, latitude, longitude) float64 99MB ...
    uo         (time, latitude, longitude) float64 99MB ...
    vbar       (time, latitude, longitude) float64 99MB ...
    vo         (time, latitude, longitude) float64 99MB ...
    zos        (time, latitude, longitude) float64 99MB ...
Attributes: (12/13)
    Conventions:     CF-1.8
    comment:         
    contact:         https://marine.copernicus.eu/contact
    domain_name:     NWS36
    field_date:      20220402
    field_type:      mean
    ...        

In [21]:
from kerchunk.combine import MultiZarrToZarr

In [17]:
json_list = fs2.glob("CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_*_*_*_*.json")

mzz = MultiZarrToZarr(
    json_list,
    remote_protocol="s3",
    remote_options={"anon": False},
    concat_dims=["time"],
    identical_dims=["latitude", "longitude"],
)

d = mzz.translate()

In [18]:
%%time
backend_args = {
    "consolidated": False,
    "storage_options": {
        "fo": d,
        "remote_protocol": "s3",
        "remote_options": {"anon": False},
    },
}
print(xr.open_dataset("reference://", engine="zarr", backend_kwargs=backend_args))

<xarray.Dataset> Size: 521GB
Dimensions:    (latitude: 551, longitude: 936, time: 18048)
Coordinates:
  * latitude   (latitude) float32 2kB 46.0 46.03 46.06 ... 61.23 61.25 61.28
  * longitude  (longitude) float32 4kB -16.0 -15.97 -15.94 ... 9.921 9.949 9.977
  * time       (time) datetime64[ns] 144kB 2022-04-02T00:30:00 ... 2024-04-22...
Data variables:
    mlotst     (time, latitude, longitude) float64 74GB ...
    thetao     (time, latitude, longitude) float64 74GB ...
    ubar       (time, latitude, longitude) float64 74GB ...
    uo         (time, latitude, longitude) float64 74GB ...
    vbar       (time, latitude, longitude) float64 74GB ...
    vo         (time, latitude, longitude) float64 74GB ...
    zos        (time, latitude, longitude) float64 74GB ...
Attributes: (12/13)
    Conventions:     CF-1.8
    comment:         
    contact:         https://marine.copernicus.eu/contact
    domain_name:     NWS36
    field_date:      20220402
    field_type:      mean
    ...     

In [19]:
## do not use json because it is too big & slow
# with fs2.open('CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_AN_2D_combined.json', 'wb') as f:
#    f.write(ujson.dumps(d).encode())

In [20]:
from kerchunk import df

In [21]:
df.refs_to_dataframe(d, "CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_AN_2D_combined.parq")

In [22]:
import fsspec

In [23]:
fs = fsspec.implementations.reference.ReferenceFileSystem(
    "CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_AN_2D_combined.parq",
    remote_protocol="s3",
    target_protocol="file",
    lazy=True,
)
dset = xr.open_dataset(
    fs.get_mapper(), engine="zarr", backend_kwargs={"consolidated": False}
)

In [24]:
dset

### Copy the json file into remote bucket with rclone

In [None]:
#!rclone copy CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_AN_2D_combined.json gfts:gfts-reference-data/

In [25]:
!rclone copy CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_AN_2D_combined.parq gfts:gfts-reference-data/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_AN_2D_combined.parq

In [26]:
s3.ls("gfts-reference-data/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_AN_2D_combined.parq")

['gfts-reference-data/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_AN_2D_combined.parq/.zmetadata',
 'gfts-reference-data/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_AN_2D_combined.parq/latitude',
 'gfts-reference-data/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_AN_2D_combined.parq/longitude',
 'gfts-reference-data/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_AN_2D_combined.parq/mlotst',
 'gfts-reference-data/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_AN_2D_combined.parq/thetao',
 'gfts-reference-data/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_AN_2D_combined.parq/time',
 'gfts-reference-data/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_AN_2D_combined.parq/ubar',
 'gfts-reference-data/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_AN_2D_combined.parq/uo',
 'gfts-reference-data/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_AN_2D_combined.parq/vbar',
 'gfts-reference-data/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_AN_2D_combined.parq/vo',
 'gfts-reference-data/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_AN_2D_combined.parq/zos']

## Create catalog for IBI ANALYSISFORECAST

In [6]:
bucket_name = "gfts-reference-data/IBI_ANALYSISFORECAST_PHY_005_001/cmems_mod_ibi_phy_anfc_0.027deg-2D_PT1H-m_202211"
s3.ls(bucket_name)

['gfts-reference-data/IBI_ANALYSISFORECAST_PHY_005_001/cmems_mod_ibi_phy_anfc_0.027deg-2D_PT1H-m_202211/2022',
 'gfts-reference-data/IBI_ANALYSISFORECAST_PHY_005_001/cmems_mod_ibi_phy_anfc_0.027deg-2D_PT1H-m_202211/2023',
 'gfts-reference-data/IBI_ANALYSISFORECAST_PHY_005_001/cmems_mod_ibi_phy_anfc_0.027deg-2D_PT1H-m_202211/2024']

In [7]:
s3path = "s3://gfts-reference-data/IBI_ANALYSISFORECAST_PHY_005_001/cmems_mod_ibi_phy_anfc_0.027deg-2D_PT1H-m_202211/*/*/*.nc"

In [12]:
fs_files = fs.glob(s3path)

In [18]:
fs2 = fsspec.filesystem("")  # local file system to save final jsons to
so = dict(
    mode="rb", anon=True, default_fill_cache=False, default_cache_type="first"
)  # args to fs.open()
# default_fill_cache=False avoids caching data in between file chunks to lowers memory usage.

In [19]:
%%time
for file in fs_files:
    gen_json(fs, fs2, so, file)

CMEMS_v6r1_IBI_PHY_NRT_NL_01hav_20220402_20220402_R20220418_AN04.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav_20220403_20220403_R20220418_AN05.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav_20220404_20220404_R20220418_AN06.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav_20220405_20220405_R20220418_AN07.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav_20220406_20220406_R20220425_AN01.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav_20220407_20220407_R20220425_AN02.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav_20220408_20220408_R20220425_AN03.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav_20220409_20220409_R20220425_AN04.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav_20220410_20220410_R20220425_AN05.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav_20220411_20220411_R20220425_AN06.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav_20220412_20220412_R20220425_AN07.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav_20220413_20220413_R20220502_AN01.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav_20220414_20220414_R20220502_AN02.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav_20220415_20220415_R20220502_AN03.json
CMEMS_v6r1_IBI_PHY_N

In [22]:
json_list = fs2.glob("CMEMS_v6r1_IBI_PHY_NRT_NL_01hav_*_*_*_*.json")

mzz = MultiZarrToZarr(
    json_list,
    remote_protocol="s3",
    remote_options={"anon": False},
    concat_dims=["time"],
    identical_dims=["latitude", "longitude"],
)

d = mzz.translate()

### Save into parquet

In [23]:
from kerchunk import df

In [24]:
df.refs_to_dataframe(d, "CMEMS_v6r1_IBI_PHY_NRT_NL_01hav_2D_combined.parq")

### Copy to our bucket

In [25]:
!rclone copy CMEMS_v6r1_IBI_PHY_NRT_NL_01hav_2D_combined.parq gfts:gfts-reference-data/CMEMS_v6r1_IBI_PHY_NRT_NL_01hav_2D_combined.parq