---
title: Create STAC metadata for MUR SST 
description: Tutorial for creating STAC metadata for a collection in CMR
author: Aimee Barciauskas
date: January 18, 2024
execute:
  freeze: true
  cache: true
---

## Run this notebook

You can launch this notebook in VEDA JupyterHub by clicking the link below.

[Launch in VEDA JupyterHub (requires access)](https://nasa-veda.2i2c.cloud/hub/user-redirect/git-pull?repo=https://github.com/NASA-IMPACT/veda-docs&urlpath=lab/tree/veda-docs/notebooks/veda-operations/publish-cmip6-kerchunk-stac.ipynb&branch=main) 

<details><summary>Learn more</summary>
    
### Inside the Hub

This notebook was written on a VEDA JupyterHub instance

See (VEDA Analytics JupyterHub Access)[https://nasa-impact.github.io/veda-docs/veda-jh-access.html] for information about how to gain access.

### Outside the Hub

You are welcome to run this anywhere you like (Note: alternatively you can run this on https://daskhub.veda.smce.nasa.gov/, MAAP, locally, ...), just make sure that the data is accessible, or get in contact with the VEDA team to enable access.

</details>

## Approach

This notebook creates STAC collection metadata for the [MUR SST](https://search.earthdata.nasa.gov/search/granules?p=C1996881146-POCLOUD) dataset. 

## Step 1: Install and import necessary libraries

In [1]:
%%capture
!pip install xstac

In [2]:
import earthaccess
import json
from datetime import datetime
import pandas as pd
import pystac
import requests
import s3fs
import xstac
import xarray as xr

## Step 2: Get Collection metadata from CMR

In [5]:
earthaccess.login()

EARTHDATA_USERNAME and EARTHDATA_PASSWORD are not set in the current environment, try setting them or use a different strategy (netrc, interactive)
No .netrc found in /home/jovyan


Enter your Earthdata Login username:  aimeeb
Enter your Earthdata password:  ········


You're now authenticated with NASA Earthdata Login
Using token with expiration date: 03/25/2024
Using user provided credentials for EDL


<earthaccess.auth.Auth at 0x7fcd00caf640>

In [72]:
collection_identifier = 'MUR-JPL-L4-GLOB-v4.1'
# collection_identifier = 'GPM_3IMERGDF.07'
collection_configs = json.loads(open('collection-configs.json').read())
collection = collection_configs[collection_identifier]
short_name, version, temporal_step, variables, reference_system = collection.values()
collection_query = earthaccess.collection_query()
r = collection_query.short_name(short_name).version(version)
cmr_collection = r.get(1)[0]
# cmr_collection

Pick out one granule to open for data cube dimensions and variables

In [73]:
first_result = earthaccess.search_data(
    short_name=short_name,
    version=version,
    cloud_hosted=True,
    count=1
)

Granules found: 7910


In [74]:
first_result

[Collection: {'Version': '4.1', 'ShortName': 'MUR-JPL-L4-GLOB-v4.1'}
 Spatial coverage: {'HorizontalSpatialDomain': {'Geometry': {'BoundingRectangles': [{'WestBoundingCoordinate': -180, 'SouthBoundingCoordinate': -90, 'EastBoundingCoordinate': 180, 'NorthBoundingCoordinate': 90}]}}}
 Temporal coverage: {'RangeDateTime': {'EndingDateTime': '2002-06-01T21:00:00.000Z', 'BeginningDateTime': '2002-05-31T21:00:00.000Z'}}
 Size(MB): 0
 Data: ['https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc']]

### Sidebar: S3 vs HTTPS access

Right now, earthaccess only supports `open` over https. Below, we see that open the dataset with direct S3 access is many times faster.

In [75]:
files = earthaccess.open(first_result)

 Opening 1 granules, approx size: 0.0 GB


QUEUEING TASKS | :   0%|          | 0/1 [00:00<?, ?it/s]

PROCESSING TASKS | :   0%|          | 0/1 [00:00<?, ?it/s]

COLLECTING RESULTS | :   0%|          | 0/1 [00:00<?, ?it/s]

Run code below if you want to evaluate the difference, in download time, between HTTPS and S3.

In [76]:
# %%time
# ds_https = xr.open_mfdataset(files)

In [77]:
s3_link = first_result[0].data_links(access='direct')[0]
s3_link

's3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc'

In [78]:
%%time
fs = s3fs.S3FileSystem(anon=False)
# Open the dataset using xarray
# The 'chunks' parameter enables Dask for lazy loading
ds_s3 = xr.open_dataset(fs.open(s3_link), engine='h5netcdf', chunks={})

CPU times: user 98.3 ms, sys: 7.31 ms, total: 106 ms
Wall time: 285 ms


In [79]:
#ds_s3

## Step 3: Generate STAC metadata

The spatial and temporal extents are extracted from the CMR collection metadata.

In [80]:
spatial_extent = cmr_collection['umm']['SpatialExtent']
bounding_rectangle = spatial_extent['HorizontalSpatialDomain']['Geometry']['BoundingRectangles'][0]
extent_list = [
    bounding_rectangle["WestBoundingCoordinate"],
    bounding_rectangle["SouthBoundingCoordinate"],   
    bounding_rectangle["EastBoundingCoordinate"],
    bounding_rectangle["NorthBoundingCoordinate"],    
]
spatial_extent = list(map(int, extent_list))

temporal_extent = cmr_collection['umm']['TemporalExtents'][0]['RangeDateTimes'][0]
start = temporal_extent['BeginningDateTime']
end = temporal_extent.get('EndingDateTime', None)

extent = pystac.Extent(
    spatial=pystac.SpatialExtent(bboxes=[spatial_extent]),
    temporal=pystac.TemporalExtent([[pd.to_datetime(start), pd.to_datetime(end)]])
)

Add the provider information from CMR.

In [81]:
cmr_roles_to_pystac_roles = {
    'PROCESSOR': pystac.ProviderRole.PROCESSOR,
    'DISTRIBUTOR': pystac.ProviderRole.HOST
}
def create_providers_from_data_centers(data_centers):
    providers = []

    for center in data_centers:
        # Extracting necessary information from each data center
        short_name = center.get("ShortName", "")
        long_name = center.get("LongName", "")
        roles = []
        for role in center.get("Roles", []):
            if role in cmr_roles_to_pystac_roles:
                roles.append(cmr_roles_to_pystac_roles[role])
        url = next((url_info["URL"] for url_info in center.get("ContactInformation", {}).get("RelatedUrls", []) 
                    if url_info.get("URLContentType") == "DataCenterURL"), None)

        # Creating a PySTAC Provider object
        provider = pystac.Provider(name=short_name, description=long_name, roles=roles, url=url)
        providers.append(provider)

    return providers

data_centers = cmr_collection['umm']['DataCenters']
providers = create_providers_from_data_centers(data_centers)

Put it all together to intialize a `pystac.Collection` instance.

In [82]:
_id = short_name.replace('.', '_')
description = cmr_collection['umm']['Abstract']
concept_id = cmr_collection['meta']['concept-id']
pystac_collection = pystac.Collection(
    id=_id,
    extent=extent,
    description=cmr_collection['umm']['Abstract'],
    providers=providers,
    stac_extensions=['https://stac-extensions.github.io/datacube/v2.0.0/schema.json'],
    license="CC0-1.0",
    extra_fields={'collection_concept_id': concept_id}
)

That collection instance is used by `xstac` to generate additional metadata, specifically for the [`datacube extension`](https://github.com/stac-extensions/datacube) information.

In [83]:
collection_template = pystac_collection.to_dict()

# see https://github.com/stac-utils/xstac/issues/30
for k, v in ds_s3.variables.items():
    attrs = {name: xr.backends.zarr.encode_zarr_attr_value(value) for name, value in v.attrs.items()}
    ds_s3[k].attrs = attrs
        
collection = xstac.xarray_to_stac(
    ds_s3,
    collection_template,
    temporal_dimension='time',
    temporal_step=temporal_step,
    x_dimension="lon",
    y_dimension="lat",
    reference_system=reference_system,
    validate=False
)

collection.validate()

['https://schemas.stacspec.org/v1.0.0/collection-spec/json-schema/collection.json',
 'https://stac-extensions.github.io/datacube/v2.0.0/schema.json']

In [84]:
# collection

Set the second value for the time extent to `None` since the dataset is ongoing. Otherwise the extent is just the extent of the first file in the collection.

In [85]:
collection.to_dict()['cube:dimensions']['time']['extent'][1] = None

In [86]:
cube_variables = collection.to_dict()['cube:variables']
for variable in cube_variables.keys():
    cube_variables[variable]['shape'][0] = None

Add [renders](https://github.com/stac-extensions/render) extension.

In [87]:
collection.extra_fields['renders'] = {}
for vname, vvalue in variables.items():
    collection.extra_fields['renders'][vname] = {
      "title": f"Renders configuration for {vname}",
      "resampling": "average",
      "colormap_name": vvalue['colormap'],
      "rescale": vvalue['rescale'],
      "backend": "xarray"
    }
    collection.to_dict()['cube:variables'][vname]['renders'] = vname

## Step 4: Write to json

In [88]:
with open(f"{collection.id}.stac.json", "w+") as f:
    f.write(json.dumps(collection.to_dict()))