# 1. STAC metadata generation

There are several ways to create STAC metadata. We will explore a few of those to give you a sense of the possibilities and provide guidance for your best options in several scenarios.

Methods:
- write JSON by hand
- use `pystac` or `stac-pydantic`
- use `rio-stac`
- use a `stactools` package

We will be using the [Sentinel-2 Cloud-Optimized GeoTIFFs collection on the AWS Open Data Registry](https://registry.opendata.aws/sentinel-2-l2a-cogs/) to explore STAC metadata.

This particular collection is available in a STAC API maintained by Element 84: [earth-search.aws.element84.com/v1](earth-search.aws.element84.com/v1) but it is a great example to show the advantages of some of the tools that are available for this purpose.

The files are stored as cloud-optimized geotiffs (COGs) in an S3 bucket in the us-west-2 region. Here is a look at the files available for a single granule. 

In [1]:
import boto3
from botocore import UNSIGNED
from botocore.client import Config

region = "us-west-2"
bucket_name = "sentinel-cogs"
prefix = "sentinel-s2-l2a-cogs/28/G/GV/2025/4/S2B_28GGV_20250417_0_L2A/"

s3 = boto3.client("s3", region_name=region, config=Config(signature_version=UNSIGNED))
s3_keys = []

for page in s3.get_paginator("list_objects_v2").paginate(
    Bucket=bucket_name, Prefix=prefix
):
    if "Contents" in page:
        s3_keys.extend([obj["Key"] for obj in page["Contents"]])

s3_keys

['sentinel-s2-l2a-cogs/28/G/GV/2025/4/S2B_28GGV_20250417_0_L2A/AOT.tif',
 'sentinel-s2-l2a-cogs/28/G/GV/2025/4/S2B_28GGV_20250417_0_L2A/B01.tif',
 'sentinel-s2-l2a-cogs/28/G/GV/2025/4/S2B_28GGV_20250417_0_L2A/B02.tif',
 'sentinel-s2-l2a-cogs/28/G/GV/2025/4/S2B_28GGV_20250417_0_L2A/B03.tif',
 'sentinel-s2-l2a-cogs/28/G/GV/2025/4/S2B_28GGV_20250417_0_L2A/B04.tif',
 'sentinel-s2-l2a-cogs/28/G/GV/2025/4/S2B_28GGV_20250417_0_L2A/B05.tif',
 'sentinel-s2-l2a-cogs/28/G/GV/2025/4/S2B_28GGV_20250417_0_L2A/B06.tif',
 'sentinel-s2-l2a-cogs/28/G/GV/2025/4/S2B_28GGV_20250417_0_L2A/B07.tif',
 'sentinel-s2-l2a-cogs/28/G/GV/2025/4/S2B_28GGV_20250417_0_L2A/B08.tif',
 'sentinel-s2-l2a-cogs/28/G/GV/2025/4/S2B_28GGV_20250417_0_L2A/B09.tif',
 'sentinel-s2-l2a-cogs/28/G/GV/2025/4/S2B_28GGV_20250417_0_L2A/B11.tif',
 'sentinel-s2-l2a-cogs/28/G/GV/2025/4/S2B_28GGV_20250417_0_L2A/B12.tif',
 'sentinel-s2-l2a-cogs/28/G/GV/2025/4/S2B_28GGV_20250417_0_L2A/B8A.tif',
 'sentinel-s2-l2a-cogs/28/G/GV/2025/4/S2B_28GGV_202

Since these files are publicly available in S3, let's convert the S3 URIs to https URLs to make it easier for clients to use them later:

In [2]:
from pathlib import Path


urls = {
    Path(key).stem: f"https://{bucket_name}.s3.{region}.amazonaws.com/{key}"
    for key in s3_keys
}
urls

{'AOT': 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/28/G/GV/2025/4/S2B_28GGV_20250417_0_L2A/AOT.tif',
 'B01': 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/28/G/GV/2025/4/S2B_28GGV_20250417_0_L2A/B01.tif',
 'B02': 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/28/G/GV/2025/4/S2B_28GGV_20250417_0_L2A/B02.tif',
 'B03': 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/28/G/GV/2025/4/S2B_28GGV_20250417_0_L2A/B03.tif',
 'B04': 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/28/G/GV/2025/4/S2B_28GGV_20250417_0_L2A/B04.tif',
 'B05': 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/28/G/GV/2025/4/S2B_28GGV_20250417_0_L2A/B05.tif',
 'B06': 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/28/G/GV/2025/4/S2B_28GGV_20250417_0_L2A/B06.tif',
 'B07': 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/28/G/GV/2025/4/S2B

We can infer some details about the files from the path but we have to make a few assumptions!
- sensor: Sentinel-2B
- MGRS tile: 28GGV
- date: 20250417

In theory you could figure out which granule covers a specific area of interest on a particular date, but it would require a bunch of S3 path listing and filtering.
There are probably software tools that do this kind of thing to locate Sentinel-2 files, but fortunately there is a better way: STAC!

### 1.1 Writing STAC "by hand"

Let's try creating the bare minimum STAC item by writing a Python dictionary by hand. We can verify that it is valid by loading it with `pystac` when we are done.

Start by figuring out the extent and CRS of one of the raster files:

In [3]:
import rasterio

with rasterio.open(urls["B02"]) as src:
    crs = src.crs
    bounds = src.bounds
    shape = src.shape

print(crs, bounds, shape)

EPSG:32728 BoundingBox(left=699960.0, bottom=5390200.0, right=809760.0, top=5500000.0) (10980, 10980)


To create the bare minimum STAC item, we only need to populate a few fields (and a few constants like `type` and `stac_version`):
  - id: unique identifier for the item
  - bbox: item bounding box coordinates (EPSG:4326)
  - datetime: `YYYY-mm-ddTHH:MM:SSZ`
  - assets: dictionary of assets with hrefs and media types

In [4]:
import json
from pathlib import Path

from rasterio.warp import transform_bounds
from shapely.geometry import box, mapping

bbox = transform_bounds(crs, "epsg:4326", *bounds)

item_dict = {
    "type": "Feature",
    "stac_version": "1.1.0",
    "id": "S2B_28GGV_20250417_0_L2A",
    "bbox": bbox,
    "geometry": mapping(box(*bbox)),
    "properties": {
        "datetime": "2025-04-17T00:00:00Z",
    },
    "assets": {
        key: {
            "href": url,
            "type": "image/tiff; application=geotiff; profile=cloud-optimized",
        }
        for key, url in urls.items()
        if url.endswith(".tif")
    },
}

print(json.dumps(item_dict, indent=2))

{
  "type": "Feature",
  "stac_version": "1.1.0",
  "id": "S2B_28GGV_20250417_0_L2A",
  "bbox": [
    -12.635791766296055,
    -41.61487677746372,
    -11.284390704259561,
    -40.59281342182659
  ],
  "geometry": {
    "type": "Polygon",
    "coordinates": [
      [
        [
          -11.284390704259561,
          -41.61487677746372
        ],
        [
          -11.284390704259561,
          -40.59281342182659
        ],
        [
          -12.635791766296055,
          -40.59281342182659
        ],
        [
          -12.635791766296055,
          -41.61487677746372
        ],
        [
          -11.284390704259561,
          -41.61487677746372
        ]
      ]
    ]
  },
  "properties": {
    "datetime": "2025-04-17T00:00:00Z"
  },
  "assets": {
    "AOT": {
      "href": "https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/28/G/GV/2025/4/S2B_28GGV_20250417_0_L2A/AOT.tif",
      "type": "image/tiff; application=geotiff; profile=cloud-optimized"
    },
    "

Great! We made a STAC item by hand. It is not verbose but it has all of the critical information that a client application would need to infer some details about the data without opening any tif files.

Now let's validate it with our next tool: `pystac`

In [5]:
import pystac

pystac.Item.from_dict(item_dict).validate()

['https://schemas.stacspec.org/v1.1.0/item-spec/json-schema/item.json']

Item metadata is important but Collection metadata is also critical! We can create a bare-bones collection document like this:

In [6]:
collection_dict = {
    "type": "Collection",
    "stac_version": "1.1.0",
    "id": "sentinel-2-l2a-cogs",
    "description": "Sentinel-2 L2A cloud-optimized geotiffs",
    "extent": {
        "spatial": {"bbox": [[-180, -90, 180, 90]]},
        "temporal": {"interval": [["2015-06-27T10:25:31.456000Z", None]]},
    },
    "links": [],
    "license": "proprietary",
}

pystac.Collection.from_dict(collection_dict)

### 1.2 Using `pystac` to write STAC

Writing STAC JSON by hand can be hard especially when you want to use some of the more advanced features of STAC metadata. Creating STAC items with `pystac` is a bit more ergonomic than specifying the structure by hand because you can get type hints on all of the possible fields and can take advantage of pre-defined constants (like `pystac.MediaType.COG`) instead of trying to remember the actual values. Plus, you will be forced to populate all of the required fields from the STAC spec which can help you avoid heartburn induced by invalid STAC items.

In [7]:
from datetime import datetime

item = pystac.Item(
    id="S2B_28GGV_20250417_0_L2A",
    bbox=bbox,
    geometry=mapping(box(*bbox)),
    datetime=datetime(2025, 4, 17),
    properties={},
    assets={
        key: pystac.Asset(
            href=url,
            media_type=pystac.MediaType.COG,
        )
        for key, url in urls.items()
        if url.endswith(".tif")
    },
)

item

That was a little easier, but client applications would appreciate a bit more information to the item like the projection of the files! We could have added this in the "by hand" example, too, but it would have been a multi-step process updating the list of `stac_extensions` and the `proj` properties. `pystac` handles these for us in one step.

In [8]:
item.ext.add("proj")

item.ext.proj.apply(
    epsg=crs.to_epsg(),
    bbox=bounds,
    shape=shape,
)

item

Now you can create the same collection object that we made by hand but this time using `pystac`:

In [9]:
collection = pystac.Collection(
    id="sentinel-2-l2a-cogs",
    description="Sentinel-2 L2A cloud-optimized geotiffs",
    extent=pystac.Extent(
        spatial=pystac.SpatialExtent([[-180, -90, 180, 90]]),
        temporal=pystac.TemporalExtent(
            [[datetime(2015, 6, 27, hour=10, minute=25, second=31), None]]
        ),
    ),
)
collection

### 1.3 Writing STAC with `rio-stac`

`rio-stac` does some of the work that we had to do to calculate the bounding box, geometry, etc for you. It will be slower than infering properties from the storage path schema or some existing metadata file, but it will reliably read the extent/geometry/etc from the actual assets which can be valuable.

- rio-stac docs: <https://developmentseed.org/rio-stac/>

In [10]:
from rio_stac.stac import create_stac_item

item = create_stac_item(
    source=urls["B02"],
    input_datetime=datetime(2025, 4, 17),
    id="S2B_28GGV_20250417_0_L2A",
    with_proj=True,
    with_eo=False,
    assets={
        key: pystac.Asset(
            href=url,
            media_type=pystac.MediaType.COG,
        )
        for key, url in urls.items()
        if url.endswith(".tif")
    },
)

item

That was nice an easy! I didn't have to use `rasterio` for anything myself, `rio-stac` took care of that for me. There are more tools in `rio-stac` which can be used to derive metadata for more extensions (like `eo`) so check out the docs!

There are not any collection generation convenience functions in `rio-stac` so that's all for this section.

### 1.4 Writing STAC with `stactools` packages

You may have been thinking "surely someone has already done the work to sort out the STAC metadata configuration for the Sentinel 2 dataset". You are right! Sometimes the data providers or the community will produce STAC metadata alongside the data and the most common method for publishing the STAC metadata generation workflow is via [`stactools` packages](https://github.com/stactools-packages).

Many datasets/collections have a package that can be used from python or a CLI to generate collection and item STAC metadata, including Sentinel-2. At this time the [`stactools-sentinel2`](https://github.com/stactools-packages/sentinel2) package is not configured for the files that we have been working with so far in this notebook so we will reference files in a different bucket for this section.

In [11]:
from stactools.sentinel2.stac import create_item as create_sentinel2_item


item = create_sentinel2_item(
    granule_href="https://roda.sentinel-hub.com/sentinel-s2-l2a/tiles/34/L/BP/2022/4/1/0/",
)

item



## Conclusion

That's it! You have seen a few ways of creating STAC metadata. There is a time and a place for each of these methods so keep them all in mind when considering the best path for cataloging your data.