## A demo notebook to publish datacubes and workflow to EarthCODE catalog
### A DeepESDL example notebook

Please, also refer to the [DeepESDL documentation](https://deepesdl.readthedocs.io/en/latest/guide/jupyterlab/) and visit the platform's [website](https://www.earthsystemdatalab.net/) for further information!

Brockmann Consult, 2025

-----------------

**This notebook runs with the python environment `users-deep-code-test`, please checkout the documentation for [help on changing the environment](https://deepesdl.readthedocs.io/en/latest/guide/jupyterlab/#python-environment-selection-of-the-jupyter-kerne).**

###  üìò Pre-requisite:
Before using the deep-code CLI or API to publish metadata, users must configure GitHub access by creating a .gitaccess file in the working directory from which deep-code is executed.

1. Generate a Personal Access Token (PAT) from your GitHUB account:
    1. Navigate to GitHub ‚Üí Settings ‚Üí Developer settings ‚Üí Personal access tokens.
    2. Click ‚ÄúGenerate new token‚Äù.
    3. Choose the following scopes to ensure full access:
        - repo (Full control of repositories ‚Äî includes fork, pull, push, and read)
    4. Generate the token and copy it immediately ‚Äî GitHub won‚Äôt show it again.

2. Create a .gitaccess File

In the same directory where you run the deep-code commands, create a file named .gitaccess with the following content:
```
github-username: your-git-user
github-token: personal access token
```
Replace your-git-user and your-personal-access-token with your actual GitHub username and token.

This file is required to allow deep-code to fork the Open Science Metadata repository, commit metadata changes, and open a pull request to the EarthCODE Catalog.

In [1]:
import os
import xcube
import warnings
import deep_code

from xcube.webapi.viewer import Viewer
from xcube.core.store import new_data_store
from deep_code.tools.lint import LintDataset
from deep_code.tools.publish import Publisher

In [2]:
warnings.filterwarnings('ignore')

## Generate starter configuration templates for publishing to EarthCODE openscience catalog.

In [None]:
!deep-code generate-config

## Here we create a small dataset from xcube-cmems store

In [3]:
store = new_data_store("cmems")
store



<xcube_cmems.store.CmemsDataStore at 0x7ffa2ae076e0>

In [4]:
ds = store.open_data(
    "DMI-BALTIC-SST-L3S-NRT-OBS_FULL_TIME_SERIE",
    variable_names=["sea_surface_temperature"],
    bbox=[9, 53, 20, 62],
    time_range=("2022-01-01", "2022-01-05"),
)
ds

INFO - 2025-09-12T13:17:42Z - Selected dataset version: "201904"
INFO:copernicusmarine:Selected dataset version: "201904"
INFO - 2025-09-12T13:17:42Z - Selected dataset part: "default"
INFO:copernicusmarine:Selected dataset part: "default"


Unnamed: 0,Array,Chunk
Bytes,9.48 MiB,1.76 MiB
Shape,"(5, 451, 551)","(1, 451, 512)"
Dask graph,10 chunks in 2 graph layers,10 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 9.48 MiB 1.76 MiB Shape (5, 451, 551) (1, 451, 512) Dask graph 10 chunks in 2 graph layers Data type float64 numpy.ndarray",551  451  5,

Unnamed: 0,Array,Chunk
Bytes,9.48 MiB,1.76 MiB
Shape,"(5, 451, 551)","(1, 451, 512)"
Dask graph,10 chunks in 2 graph layers,10 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


## Lint your in-memory dataset for metadata correctness and completness, before publishing to EarthCODE open science catalog

In [None]:
linter = LintDataset(dataset=ds)
linter.lint_dataset()

## Fix the errors from the linter

Adding gcmd_keyword_url connects your data to a semantic network of Earth science concepts, enabling:

- Better automated discovery

- Stronger metadata interoperability

- Alignment with international FAIR standards

To find the the gcmd url for your variable, please use, https://gcmd.earthdata.nasa.gov/KeywordViewer/scheme/all?gtm_scheme=all

In [None]:
ds.attrs["description"] = (
    "This is a extracted dataset from copernicus marine data store" 
)

ds["sea_surface_temperature"].attrs["gcmd_keyword_url"] = "https://gcmd.earthdata.nasa.gov/KeywordViewer/scheme/all/e4d58a7f-7eaa-4f75-996a-18238c698063?gtm_keyword=SEA%20SURFACE%20FOUNDATION%20TEMPERATURE&gtm_scheme=Earth%20Science"

## Write the dataset to the team s3 bucket

In [None]:
S3_USER_STORAGE_KEY = os.environ["S3_USER_STORAGE_KEY"]
S3_USER_STORAGE_SECRET = os.environ["S3_USER_STORAGE_SECRET"]
S3_USER_STORAGE_BUCKET = os.environ["S3_USER_STORAGE_BUCKET"]

In [None]:
team_store = new_data_store(
    "s3", 
    root=S3_USER_STORAGE_BUCKET, 
    storage_options=dict(
        anon=False, 
        key=S3_USER_STORAGE_KEY, 
        secret=S3_USER_STORAGE_SECRET
    )
)

In [None]:
team_store.write_data(ds, "cmems_sst_v2.zarr", replace=True)

The user workflow which is the JNB has to be pushed to git repository: https://github.com/deepesdl/cube-gen/blob/main/Permafrost/Create-CCI-Permafrost-cube-EarthCODE.ipynb

# üìò Publishing Metadata to the EarthCODE Catalogue

Once the dataset and workflow metadata are prepared and validated, users can initiate the publishing process using the deep-code CLI. The following command automates the entire workflow:

## üîπ The below command performs the following steps:

1. Generates valid STAC and OGC API Records based on the provided configuration files

2. Forks the open-science-catalog-metadata repository on GitHub

3. Inserts the generated records into the correct directory structure

4. Creates a Pull Request (PR) for review by the Open Science Catalog steward

## publish using the python function

In [6]:
# publish using the python function
publisher = Publisher(
    dataset_config_path="dataset-config.yaml",
    workflow_config_path="workflow-config.yaml",
    environment="staging",
)
publisher.publish_all()

INFO:root:Forking repository...
INFO:root:Repository forked to tejasmharish/open-science-catalog-metadata-staging
INFO:root:Checking local repository...
INFO:root:Cloning forked repository...
Cloning into '/home/tejas/temp_repo'...
INFO:root:Repository cloned to /home/tejas/temp_repo
INFO:deep_code.tools.publish:Generating STAC collection...
INFO:deep_code.utils.dataset_stac_generator:Attempting to open dataset 'cmems_sst_v2.zarr' with configuration: Public store
INFO:deep_code.utils.dataset_stac_generator:Successfully opened dataset 'cmems_sst_v2.zarr' with configuration: Public store
INFO:deep_code.tools.publish:Variable catalog already exists for sea-surface-temperature, adding product link.
INFO:deep_code.tools.publish:Generating OGC API Record for the workflow...
INFO:root:Creating new branch: add-new-collection-cmems-sst-20250912151834...
Switched to a new branch 'add-new-collection-cmems-sst-20250912151834'
INFO:deep_code.tools.publish:Adding products/cmems-sst/collection.json t

[add-new-collection-cmems-sst-20250912151834 c3381005] Add new dataset collection: cmems-sst and workflow/experiment: lps-demo-cmems-sst-workflow
 9 files changed, 434 insertions(+), 6 deletions(-)
 create mode 100644 experiments/lps-demo-cmems-sst-workflow/record.json
 create mode 100644 products/cmems-sst/collection.json
 create mode 100644 workflows/lps-demo-cmems-sst-workflow/record.json


remote: 
remote: Create a pull request for 'add-new-collection-cmems-sst-20250912151834' on GitHub by visiting:        
remote:      https://github.com/tejasmharish/open-science-catalog-metadata-staging/pull/new/add-new-collection-cmems-sst-20250912151834        
remote: 
To https://github.com/tejasmharish/open-science-catalog-metadata-staging.git
 * [new branch]        add-new-collection-cmems-sst-20250912151834 -> add-new-collection-cmems-sst-20250912151834
INFO:root:Creating a pull request...


Branch 'add-new-collection-cmems-sst-20250912151834' set up to track remote branch 'add-new-collection-cmems-sst-20250912151834' from 'origin'.


INFO:root:Pull request created: https://github.com/ESA-EarthCODE/open-science-catalog-metadata-staging/pull/156
INFO:deep_code.tools.publish:Pull request created: None
INFO:root:Cleaning up local repository...
INFO:deep_code.tools.publish:Pull request created: None


## publish using cli

In [None]:
!deep-code publish dataset-config.yaml workflow-config.yaml -e staging