# STAC EuroSAT

This notebook demonstrates how to convert annotations provided by [EuroSAT](https://github.com/phelber/EuroSAT) dataset
into STAC-compatible definitions with extensions relevant for machine learning tasks.
Notably, the STAC [Label](https://github.com/stac-extensions/label)and [Scientific](https://github.com/stac-extensions/scientific)
extensions are used to reference the labeled annotations from train, validation and test splits, and provide citation reference to
the original work respectively.

To facilitate parsing of EuroSAT metadata itself,
the [torchgeo.datasets.EuroSAT](https://torchgeo.readthedocs.io/en/stable/api/datasets.html#torchgeo.datasets.EuroSAT)
class will be used to handle the metadata extraction process, parsing the labeled data hierarchy, and generate the splits
definition and sample generation from them.

## First Step

Below are equivalent `torchgeo.datasets.EuroSAT100` and `torchgeo.datasets.EuroSAT` classes that define a subset and the complete dataset
respectively. While developing or editing the STAC generation pipeline, it is recommended to work with the 100 subset variation to speed up
the process and directly observe an overview of the expected result.

In [170]:
from torchgeo.datasets import EuroSAT, EuroSAT100  # noqa

# pick one:
DatasetEuroSAT = EuroSAT100  # subset of (6 train, 2 val, 2 test) image samples per class (10)
# DatasetEuroSAT = EuroSAT  # full dataset

## General configurations

In [171]:
import os

# This is the version applied along the source code tags of this repository.
# If contents are re-generated with new STAC definitions, this should be updated accordingly.
# It will then be possible to have a trace of the source code that produced the samples.
# Note: This is not automatically updated using 'make VERSION=<version> bump' to avoid unnecessary duplicates from other releases.
CATALOG_VERSION = "0.3.3"

DATA_ROOT_DIR = os.path.abspath("../data")
EUROSAT_ROOT_DIR = os.path.join(DATA_ROOT_DIR, "EuroSAT")
EUROSAT_TYPE_DIR = "subset" if issubclass(DatasetEuroSAT, EuroSAT100) else "full"  # need distinct dirs, otherwise download dir name clash
EUROSAT_DATA_DIR = os.path.join(EUROSAT_ROOT_DIR, "data", EUROSAT_TYPE_DIR)
EUROSAT_STAC_DIR = os.path.join(EUROSAT_ROOT_DIR, "stac", EUROSAT_TYPE_DIR)

# base URL where samples would be accessible from (links in STAC)
# by default, this points to were the sample subset EuroSAT100 is expected to be pushed on GitHub once the version is released
# adjust accordingly if you desire to use another location
# Note: The `data|stac/<variant>` path of the URL will be applied automatically under this root URL. It must not be included.
EUROSAT_STAC_URL = f"https://raw.githubusercontent.com/ai-extensions/stac-data-loader/{CATALOG_VERSION}/data/"
CATALOG_ROOT_DIR = DATA_ROOT_DIR  # override as desired, make sure it align with the above URL for locations to match desired output

os.makedirs(EUROSAT_DATA_DIR, exist_ok=True)
os.makedirs(EUROSAT_STAC_DIR, exist_ok=True)

## STAC Definitions

Following types are **NOT** used for "strong" type checking.
They are provided as reference of the expected STAC properties and for quick validation of the structure by IDEs.

### Base Definitions

In [172]:
%pip install stac_dataloader

Note: you may need to restart the kernel to use updated packages.


In [173]:
import datetime
import os
from typing import Any, Dict, List, Literal, Optional, Tuple, TypedDict, Union
from typing_extensions import NotRequired, Required

from stac_dataloader.typedef import GeoFeatureCollection, Geometry, Split

# This is the STAC Core version
# It shouldn't change unless something new needs to be supported.
STAC_VERSION = "1.0.0"

STAC_CATALOG_EXTENSIONS = []
STAC_CATALOG_SCHEMAS = [
    f"https://schemas.stacspec.org/v{STAC_VERSION}/catalog-spec/json-schema/catalog.json",
] + STAC_CATALOG_EXTENSIONS

STAC_COLLECTION_EXTENSIONS = [
    "https://stac-extensions.github.io/eo/v1.1.0/schema.json",
    # FIXME: ML-AOI valid only for STAC Item (ie: {type: Feature}) (https://github.com/stac-extensions/ml-aoi/issues/5)
    # "https://stac-extensions.github.io/ml-aoi/v0.1.0/schema.json",
    "https://stac-extensions.github.io/scientific/v1.0.0/schema.json",
    "https://stac-extensions.github.io/version/v1.0.0/schema.json",
    "https://stac-extensions.github.io/view/v1.0.0/schema.json",
]
STAC_COLLECTION_SCHEMAS = [
    f"https://schemas.stacspec.org/v{STAC_VERSION}/collection-spec/json-schema/collection.json",
] + STAC_COLLECTION_EXTENSIONS

STAC_ITEM_EXTENSIONS = [
    "https://stac-extensions.github.io/eo/v1.1.0/schema.json",
    "https://stac-extensions.github.io/file/v1.0.0/schema.json",
    "https://stac-extensions.github.io/raster/v1.1.0/schema.json",  # more band metadata, but somewhat overlaps with "EO"
    "https://stac-extensions.github.io/label/v1.0.1/schema.json",
    "https://stac-extensions.github.io/ml-aoi/v0.1.0/schema.json",
    "https://stac-extensions.github.io/version/v1.0.0/schema.json",
]
STAC_ITEM_SCHEMAS = [
    f"https://schemas.stacspec.org/v{STAC_VERSION}/item-spec/json-schema/item.json",
] + STAC_ITEM_EXTENSIONS

# technically, tuples would be better for bbox/point, but not the types used in JSON
Number = Union[int, float]
BoundingBox = List[Number]  # 4 value
Point = List[Number]  # 2 values
DateTimeInterval = List[Union[str, None]]
SpatialExtent = TypedDict(
    "SpatialExtent",
    {
        "bbox": Required[List[BoundingBox]],
    }
)
TemporalExtent = TypedDict(
    "TemporalExtent",
    {
        "interval": Required[List[DateTimeInterval]],
    }
)
Extent = TypedDict(
    "Extent",
    {
        "spatial": Required[SpatialExtent],
        "temporal": Required[TemporalExtent],
    }
)
Provider = TypedDict(
    "Provider",
    {
        "name": str,
        "roles": List[str],
        "url": str,
    }
)
Link = TypedDict(
    "Link",
    {
        "rel": str,
        "href": str,
        "type": str,  # media-type
        "title": NotRequired[str],
    },
    total=False,
)
STACMetadata = TypedDict(
    "STACMetadata",
    {
        "stac_version": Required[str],
        "type": Required[Literal["Catalog", "Collection", "Feature"]],  # NB: Feature == STAC Item
        "id": Required[str],
        "title": NotRequired[str],
        "description": Required[str],
        "links": Required[List[Link]],
    }
)

### STAC Extensions

In [174]:
STACExtensionVersion = TypedDict(
    "STACExtensionVersion",
    {
        "version": Required[str],
        "deprecated": NotRequired[bool],
        "experimental": NotRequired[bool],
    }
)
STACLabelRef = TypedDict(
    "STACLabelRef",
    {
        "title": str,
        "href": str,    # URL to GeoJSON FeatureCollection or GeoTiff/COG
        "type": str,    # media-type
    }
)
STACLabelAssets = TypedDict(
    "STACLabelAssets",
    {
        "labels": STACLabelRef,
        "raster": STACLabelRef,
    }
)
STACLabelClass = TypedDict(
    "STACLabelClass",
    {
        # name that can define a "category" of classe names
        # those categories should be specified in Features under the keys defined by 'label:properties'
        "name": Required[Union[str, None]],
        # all the applicable classes that should be part of the "category"
        "classes": Required[Union[List[str], List[int]]],
    }
)
STACLabelCount = TypedDict(
    "STACLabelCount",
    {
        "name": str,  # class
        "count": int,
    }
)
STACLabelOverview = TypedDict(
    "STACLabelOverview",
    {
        "property_key": str,
        "counts": List[STACLabelCount],
    }
)
STACLabelProperties = TypedDict(
    "STACLabelProperties",
    {
        # properties in the linked 'labels' asset with GeoJSON FeatureCollection
        # those properties should be provided for each Feature in the GeoJSON
        # the values of those properties should contain the keys from 'label:classes'
        "label:properties": Required[List[str]],
        "label:classes": Required[List[STACLabelClass]],
        "label:type": Required[Literal["raster", "vector"]],
        "label:description": Required[str],
        "label:methods": NotRequired[List[Literal["manual", "automatic"]]],
        "label:tasks": NotRequired[List[Literal["classification", "detection", "segmentation", "regression"]]],
        "label:overviews": NotRequired[List[Dict[str, Any]]],
    }
)
STACExtensionLabel = TypedDict(
    "STACExtensionLabel",
    {
        "assets": Required[STACLabelAssets],
        "properties": Required[STACLabelProperties],
    }
)
STACExtensionMLAOI = TypedDict(
    "STACExtensionMLAOI",
    {
        "ml-aoi:split": NotRequired[Split],
    }
)
STACEOAssets = TypedDict(
    "STACEOAssets",
    {
        "reflectance": NotRequired[Link],
        "temperature": NotRequired[Link],
        "saturation": NotRequired[Link],
        "cloud": NotRequired[Link],
        "cloud-shadow": NotRequired[Link],
    },
    total=False,
)
# Metadata about each band: https://github.com/stac-extensions/eo#common-band-names
STACEOBandCommonName = Literal[
    "coastal",
    "blue",
    "green",
    "red",
    "yellow",
    "pan",
    "rededge",
    "nir",
    "nir08",
    "nir09",
    "cirrus",
    "swir16",
    "swir22",
    "lwir",
    "lwir11",
    "lwir12",
]
# Definition of each field: https://github.com/stac-extensions/eo#band-object
STACEOBand = TypedDict(
    "STACEOBand",
    {
        "name": Required[str],
        "common_name": Required[STACEOBandCommonName],
        "description": NotRequired[str],
        "center_wavelength": NotRequired[Number],
        "full_width_half_max": NotRequired[Number],
        "solar_illumination": NotRequired[Number],
        "assets": NotRequired[STACEOAssets],
    }
)
STACExtensionEO = TypedDict(
    "STACExtensionEO",
    {
        "eo:bands": Required[List[STACEOBand]],
        "eo:cloud_cover": NotRequired[Number],
        "eo:snow_cover": NotRequired[Number],
    }
)
STACExtensionView = TypedDict(
    "STACExtensionView",
    {
        "view:off_nadir": Number,
        "view:incidence_angle": Number,
        "view:azimuth": Number,
        "view:sun_azimuth": Number,
        "view:sun_elevation": Number,
    }
)
STACRasterDataType = Literal[
    "int8",
    "int16",
    "int32",
    "int64",
    "uint8",
    "uint16",
    "uint32",
    "uint64",
    "float16",
    "float32",
    "float64",
    "cint16",
    "cint32",
    "cfloat32",
    "cfloat64",
    "other"
]
STACRasterStatistics = TypedDict(
    "STACRasterStatistics",
    {
        "mean": Number,
        "minimum": Number,
        "maximum": Number,
        "stddev": Number,
        "valid_percent": Number,
    }
)
STACRasterHistogram = TypedDict(
    "STACRasterHistogram",
    {
        "count": int,
        "min": Number,
        "max": Number,
        "buckets": List[int],
    }
)
STACExtensionRaster = TypedDict(
    "STACExtensionRaster",
    {
        "nodata": Union[int, Literal["nan", "inf", "-inf"]],
        "sampling": str,
        "data_type": STACRasterDataType,
        "bits_per_sample": int,
        "spatial_resolution": Number,
        "statistics": STACRasterStatistics,
        "unit": str,
        "scale": Number,
        "offset": Number,
        "histogram": STACRasterHistogram,
    }
)
STACCitation = TypedDict(
    "STACCitation",
    {
        "doi": str,
        "citation": str,
    }
)
STACExtensionScientific = TypedDict(
    "STACExtensionScientific",
    {
        # how to cite this collection
        "sci:doi": Required[str],  # ++ "Link" with 'cite-as' using the DOI reference (RFC-8574)
        "sci:citation": Required[str],
        # related work/citations that use this STAC data collection
        "sci:publications": NotRequired[List[STACCitation]],
    }
)

### STAC Item and Assets

In [175]:
# https://github.com/radiantearth/stac-spec/blob/master/item-spec/common-metadata.md#instrument
STACCoreInstrumentItem = TypedDict(
    "STACCoreInstrumentItem",
    {
        "platform": NotRequired[str],  # satellite
        "instruments": NotRequired[List[str]],
        "constellation": NotRequired[str],
        "mission": NotRequired[str],
        "gsd": NotRequired[Number],
    }
)
STACCoreItemProperties = TypedDict(
    "STACCoreItemProperties",
    {
        "datetime": Required[str],
        "license": Required[str],
    },
    total=False,
)
STACCoreFeatureItem = TypedDict(
    "STACCoreFeatureItem",
    {
        "type": Literal["Feature"],
        "bbox": Required[BoundingBox],
        "geometry": Required[Geometry],
        "assets": Required[List[Dict[str, Any]]],
        "properties": Required[STACCoreItemProperties],
        "collection": str,
    }
)
STACExtendedItemProperties = TypedDict(
    "STACExtendedItemProperties",
    {
        "properties": Union[
            STACExtensionEO,
            STACExtensionView,
            STACExtensionLabel,
            STACExtensionMLAOI,
        ]
    },
    total=False,
)
STACExtendedItem = Union[
    STACMetadata,
    STACCoreInstrumentItem,
    STACCoreFeatureItem,
    STACExtendedItemProperties,
]

### STAC Collection

In [176]:
STACMetadataCollection = TypedDict(
    "STACMetadataCollection",
    {
        "stac_extensions": Required[List[str]],
        "version": Required[str],
        "keywords": NotRequired[List[str]],
        "license": Required[str],  # anything, but commonly "CC-BY-SA-4.0"
    }
)
STACCoreCollection = TypedDict(
    "STACCoreCollection",
    {
        "extent": Required[Extent],
        "providers": NotRequired[List[Provider]],
    }
)
STACExtendedCollectionProperties = TypedDict(
    "STACExtendedCollectionProperties",
    {
        "properties": Union[
            STACExtensionEO,
            STACExtensionView,
            STACExtensionScientific,
            # STACExtensionMLAOI,  # Not valid according to its schema, {type: Feature} only
        ]
    }
)
STACExtendedCollection = Union[
    STACMetadata,
    STACMetadataCollection,
    STACCoreCollection,
    STACExtensionVersion,
    STACExtendedCollectionProperties,
]

### STAC Catalog

In [177]:
STACCatalog = STACMetadata  # only requires "links" to contain the STAC Collection as "child"

### STAC Metadata Definition

To make the calling functions more succinct, start by defining constant metadata references used by STAC Collections.

In [178]:
EUROSAT_STAC_COLLECTION_BASE: STACExtendedCollection = {
    "stac_version": STAC_VERSION,
    "stac_extensions": STAC_COLLECTION_EXTENSIONS,
    "type": "Collection",
    "id": None,  # to fill later, just to ensure field ordering
    "title": None,
    "description": None,
    "version": CATALOG_VERSION,
    "license": "MIT",  # https://github.com/phelber/EuroSAT#license
    "experimental": True,
    "sci:doi": "10.1109/JSTARS.2019.2918242",  #  arXiv:1709.00029  # https://github.com/phelber/EuroSAT#references
    "sci:citation": (
        "Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. "
        "Patrick Helber, Benjamin Bischke, Andreas Dengel, Damian Borth. "
        "IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019."
    ),
    "sci:publications": [
        {
            "doi": "10.1109/IGARSS.2018.8519248",
            "citation": (
                "Introducing EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification. "
                "Patrick Helber, Benjamin Bischke, Andreas Dengel. 2018 "
                "IEEE International Geoscience and Remote Sensing Symposium, 2018."
            ),
        }
    ],
    "links": [
        {
            "rel": "cite-as",
            "href": "https://arxiv.org/abs/1709.00029",
            "title": "EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification",
        }
    ]
    # other items to fill by script
}

## Populate STAC Catalog, Collections, Items and Assets from EuroSAT Dataset

**Note** <br>
Because we need the metadata to populate STAC Collections, Items and Assets, the `download=True` parameter is used.
However, we don't need the actual data (imagery pixel values), but instead the metadata that each GeoTiff contains.
Therefore, we override the `__getitem__` method to retrieve only metadata, by bypassing image loading/conversion, and make parsing faster.
Nevertheless, it can take some time to download and extract the ZIP contents on the first run.

### Important Design Decisions

#### Separation of STAC Items

Normally, a large set of Polygon labels would be combined into a few GeoJSON FeatureCollection references for corresponding
imagery that spans over a large source raster area. However, EuroSAT provides small pre-extracted 64x64 image patches.
Therefore, each annotated sample has directly one label GeoJSON Polygon and one source raster. Instead of using a
few GeoJSON and GeoTIFF references, there is no alternative option. A large list of small STAC Items with duplicated metadata
must be produced.

#### STAC Temporal Extent

Sentinel-2 imagery started in 2015. According to the reference
[STAC Collections for Sentinel-2](https://raw.githubusercontent.com/sat-utils/sat-stac-sentinel/master/stac_sentinel/sentinel-s2-l1c.json),
the initial date-time is set to `2015-06-27T10:25:31.456Z`.
The reference [EuroSAT research paper](https://www.researchgate.net/publication/319463676) indicates
some references to collected data up to March 2017. However, since there is no way to precisely validate the last data retrieval date
or whether some adjustments were later made to the dataset, assume that they would at most last until when the paper was published,
which is on 14 June 2019 on [IEEE](https://ieeexplore.ieee.org/document/8736785) (ie: `2017-06-14T00:00:00Z`).


In [179]:
import json
from copy import deepcopy
from datetime import datetime, timezone
from functools import cache

import requests
import requests.exceptions
import numpy as np
import rasterio
from rasterio import MemoryFile
from rasterio.crs import CRS
from rasterio.warp import transform_bounds, transform_geom
from shapely.geometry import Polygon, box
from shapely.ops import unary_union
from PIL import Image
from PIL.Image import Resampling  # noqa  # IDE doesn't see it although correct location


RasterIOImage = MemoryFile
SampleMetadata = TypedDict(
    "SampleMetadata",
    {
        "image": str,
        "index": int,
        "label": str,
        "class": str,
    }
)

# required by STAC as standard for all Geo references
EPSG_4326 =  CRS.from_epsg(4326)  # type: ignore  # IDE type stub error


class DataLoaderEuroSAT(DatasetEuroSAT):
    @cache
    def __getitem__(
        self,
        sample_index: int,
    ) -> Optional[SampleMetadata]:  # type: ignore  # mismatch 'Dict[str, Tensor]' on purpose
        img_path, target_index = self.samples[sample_index]
        class_name = self.classes[target_index]
        return {"image": img_path, "index": sample_index, "label": str(target_index), "class": class_name}


def normalize_band(image_band: np.ndarray) -> np.ndarray:
    band_min, band_max = image_band.min(), image_band.max()  # type: ignore  # IDE type stub error
    return (image_band - band_min) / (band_max - band_min)


def brighten_band(image_band: np.ndarray, alpha: float = 0.13, beta: float = 0.0, gamma: float = 2.0) -> np.ndarray:
    return np.clip(np.power(alpha * image_band + beta, 1. / gamma), 0, 255)


def save_thumbnail(src_path: str, png_path: str) -> None:
    """
    Generate a visually adequate RBG preview from the relevant RGB bands.

    Since Sentinel-2 has 13 bands, they cannot all be loaded by traditional image utilities.
    """
    if not os.path.isfile(png_path):
        os.makedirs(os.path.dirname(png_path), exist_ok=True)

        indices = [
            DataLoaderEuroSAT.all_band_names.index(band) + 1  # 1-based indices for bands
            for band in DataLoaderEuroSAT.rgb_bands
        ]
        img = rasterio.open(src_path)
        rgb = Image.fromarray(
            (
                np.dstack([
                    brighten_band(normalize_band(img.read(idx))) for idx in indices
                ])
                * 255
            ).astype(np.uint8)
        )

        w_size = 64
        if rgb.size != (w_size, w_size):
            w_scale = w_size / float(img.size[0])
            h_size = int(float(img.size[1]) * float(w_scale))
            rgb = rgb.resize((w_size, h_size), Resampling.LANCZOS)

        rgb.save(png_path)


@cache
def get_stac_band_details(loader: DataLoaderEuroSAT) -> Tuple[Union[STACCoreInstrumentItem, STACExtensionEO], STACExtensionRaster]:
    # https://github.com/stac-extensions/eo#common-band-names
    # use the reference STAC Collection for Sentinel-2 to have all validated metadata of the bands and sensors
    data = requests.get("https://raw.githubusercontent.com/sat-utils/sat-stac-sentinel/master/stac_sentinel/sentinel-s2-l1c.json").json()
    stac_eo: Union[STACCoreInstrumentItem, STACExtensionEO] = data["properties"]
    for band in stac_eo["eo:bands"]:
        # fix mapping error between reference and EuroSAT dataset: B8A -> B08A
        if band["name"] == "B8A":
            band["name"] = "B08A"
        # add missing common name, as per recommendation in STAC EO common band names for Sentinel-2
        if "common_name" not in band:
            if band["name"] in ["B05", "B06", "B07"]:
                band["common_name"] = "rededge"
            if band["name"] == "B08A":
                band["common_name"] = "nir08"
            if band["name"] == "B09":
                band["common_name"] = "nir09"
    assert len(set(loader.all_band_names) - set(band["name"] for band in stac_eo["eo:bands"])) == 0, "Missing EO band definitions"
    assert all(band.get("common_name") for band in stac_eo["eo:bands"]), "Missing EO band common names"

    # fix invalid field (eo:gsd), replace by what it should be ("gsd" by itself) and also (raster:bands.spatial_resolution)
    res = stac_eo["gsd"] = stac_eo.pop("eo:gsd")
    stac_raster = {"raster:bands": []}
    for band in stac_eo["eo:bands"]:
        raster_band = {
            "nodata": 0,
            "unit": "m",
            "spatial_resolution": res,
        }
        stac_raster["raster:bands"].append(raster_band)

    return stac_eo, stac_raster


def convert_sample_to_stac_item(
    sample: SampleMetadata,
    split: Split,
    loader: DataLoaderEuroSAT,
    collection_name: str,
) -> STACExtendedItem:
    """
    Generate a STAC Item with relevant extensions from an EuroSAT sample defined by the PyTorch Data-Loader.
    """
    img = rasterio.open(sample["image"])
    rgb_path = sample["image"].replace("/tif/", "/png/").replace(".tif", ".png")
    save_thumbnail(sample["image"], rgb_path)
    stac_eo, stac_raster = get_stac_band_details(loader)
    stac_bands = {band["common_name"]: band for band in stac_eo["eo:bands"]}
    rgb_bands = [stac_bands["red"], stac_bands["green"], stac_bands["blue"]]
    bbox_bounds = transform_bounds(
        img.crs,
        EPSG_4326,
        *img.bounds,
    )
    geom = transform_geom(
        img.crs,
        EPSG_4326,
        box(*bbox_bounds),
    )
    class_name = sample["class"]
    image_name = os.path.splitext(os.path.split(sample["image"])[-1])[0]
    sample_idx = sample["index"]
    sample_id = f"{collection_name}-sample-{sample_idx}-class-{class_name}"

    label_geojson: Geometry = geom.__geo_interface__ if hasattr(geom, "__geo_interface__") else geom  # GeoJSON
    label_path = sample["image"].replace("/tif/", "/label/").replace(".tif", ".geojson")
    label_features: GeoFeatureCollection = {
        "type": "FeatureCollection",
        "features": [{"type": "Feature", "properties": {"class": class_name}, "geometry": label_geojson}],
    }
    os.makedirs(os.path.dirname(label_path), exist_ok=True)
    save_json(label_features, label_path)

    raster_link = {
        "title": f"Raster {image_name} with {class_name} class",
        "href": get_loc(sample["image"]),
        "type": "image/tiff; application=geotiff",  # add "; profile=cloud-optimized" if applicable
    }
    raster_asset = raster_link.copy()
    raster_asset.update(stac_raster)
    # FIXME: this ML-AOI is breaking schema validation, although this is what it is for!
    # (https://github.com/stac-extensions/ml-aoi/issues/6)
    raster_asset.update({
        "ml-aoi:role": "feature",
        "ml-aoi:reference-grid": True,
    })
    source_link = raster_link.copy()
    source_link.update({
        "rel": "source",
        "label:assets": ["labels", "raster"],
    })
    derived_source_link = raster_link.copy()
    derived_source_link.update({
        "rel": "derived_from",
        "ml-aoi:role": "feature",  # FIXME: rename to "source"? (https://github.com/stac-extensions/ml-aoi/issues/3)
    })
    thumbnail_base = {
        "title": f"Preview of {image_name}.",
        "href": get_loc(rgb_path),
        "type": "image/png",
    }
    thumbnail_link = thumbnail_base.copy()
    thumbnail_link["rel"] = "thumbnail"
    thumbnail_base.update({
        "eo:bands": rgb_bands
    })
    stac_item: STACExtendedItem = {
        "stac_version": STAC_VERSION,
        "stac_extensions": STAC_ITEM_EXTENSIONS,
        "type": "Feature",
        "id": sample_id,
        "title": sample_id.replace("-", " "),
        "description": f"Annotated sample from the {collection_name} collection.",
        "bbox": bbox_bounds,
        "geometry": label_geojson,  # GeoJSON
        "assets": {
            "labels": {
                "title": f"Labels for image {image_name} with {class_name} class",
                "href": get_loc(label_path),
                "type": "application/geo+json",
                # FIXME: this ML-AOI os breaking schema validation, although this is what it is for!
                # (https://github.com/stac-extensions/ml-aoi/issues/6)
                "ml-aoi:role": "label",
            },
            "raster": raster_asset,
            "thumbnail": thumbnail_base,
        },
        "collection": collection_name,
        "properties": {
            "datetime": datetime.now(timezone.utc).isoformat(),
            "license": "MIT",  # https://github.com/phelber/EuroSAT#license
            "version": CATALOG_VERSION,  # Could be managed individually per sample, but here we update everything each time
            "label:properties": [
                "class"
            ],
            "label:tasks": ["segmentation", "classification"],
            "label:type": "vector",  # Raster is the imagery, but labels are vector GeoJSON Polygon that defines it
            "label:methods": ["manual"],
            "label:description": "Land-cover area classification on Sentinel-2 image.",
            "label:classes": [
                {
                    "name": "class",
                    "classes": [sample["class"], sample["label"]],
                }
            ],
            "label:overviews": [
                # basic overview since each sample has its own STAC Item
                {
                    "property_key": "class",
                    "counts": [{"name": sample["class"], "count": 1}],
                }
            ],
            "ml-aoi:split": split,
        },
        "links": [
            thumbnail_link,
            source_link,
            derived_source_link,
            # STAC Catalog/Collection references added by caller
        ]
    }
    stac_prop = stac_item["properties"]
    stac_prop.update(stac_eo)
    return stac_item


@cache
def get_loc(url_path: str, local: bool = False, relative: bool = False) -> str:
    loc_from = CATALOG_ROOT_DIR
    loc_dest = EUROSAT_STAC_URL
    if local:
        loc_from, loc_dest = loc_dest, loc_from
    url_path = url_path.replace(loc_from, "").lstrip("/")
    if relative:
        return url_path
    url_path = os.path.join(loc_dest, url_path)
    return url_path


def save_json(data: Dict[str, Any], json_file_path: str) -> None:
    with open(json_file_path, mode="w", encoding="utf-8") as fs:
        json.dump(data, fs, indent=2, ensure_ascii=False, sort_keys=False)


def load_json(json_file_path: str) -> Dict[str, Any]:
    if json_file_path.split("://")[0] in ["http", "https"]:
        return request_json(json_file_path)
    with open(json_file_path, mode="r", encoding="utf-8") as fs:
        return json.load(fs)


@cache  # avoid rate-limiting as much as possible
def request_json(url: str) -> Dict[str, Any]:
    try:
        return requests.get(url, headers={"Accept": "application/json"}).json()
    except requests.exceptions.RequestException as req_exc:
        raise ValueError(f"Failed retrieval of JSON content from [{url}]") from req_exc


SCHEMA_MAPPING = {}

@cache  # WARNING: don't use for JSON contents that should change over runs to apply new updates (use 'load_json' instead)
def load_schema(json_file_path: str) -> Dict[str, Any]:
    if json_file_path in SCHEMA_MAPPING:
        return SCHEMA_MAPPING[json_file_path]
    SCHEMA_MAPPING[json_file_path] = load_json(json_file_path)
    return SCHEMA_MAPPING[json_file_path]


In [180]:
from tqdm.notebook import tqdm

CATALOG_PATH = os.path.join(EUROSAT_STAC_DIR, "catalog.json")
CATALOG_URL = get_loc(CATALOG_PATH)
CATALOG_ROOT_LINK = {
  "rel": "root",
  "href": CATALOG_URL,
  "type": "application/json"
}


def generate_stac_collections():
    stac_catalog_collection_links: List[Link] = []
    stac_item_geometries: List[Geometry] = []
    for split_name in tqdm(DatasetEuroSAT.splits, desc="Split"):
        split: Split = "validate" if split_name == "val" else split_name
        stac_collection_name = f"EuroSAT-{EUROSAT_TYPE_DIR}-{split}"
        stac_collection = deepcopy(EUROSAT_STAC_COLLECTION_BASE)
        stac_collection["id"] = stac_collection_name
        stac_collection["title"] = stac_collection_name.replace("-", " ")
        stac_collection["description"] = (
            f"EuroSAT dataset with labeled annotations for land-cover classification and associated imagery. "
            f"This collection represents the samples part of the {split} split set for training machine learning algorithms."
        )
        stac_collection.setdefault("extent", {})  # temporary for ordering
        stac_collection.update({    # FIXME: Not valid according to its schema, {type: Feature} only
           "ml-aoi:split": split,   #   (https://github.com/stac-extensions/ml-aoi/issues/5)
        })                          #   However, this is our only property hint, so leave it for now without the extension specified
        stac_collection_path = os.path.join(EUROSAT_STAC_DIR, split, "collection.json")
        stac_collection_url = get_loc(stac_collection_path)
        stac_collection_link: Link = {
            "rel": "collection",
            "href": stac_collection_url,
            "type": "application/json",
            "title":  f"EuroSAT STAC Collection with samples from '{split}' split.",
        }
        stac_collection_parent = stac_collection_link.copy()
        stac_collection_parent["rel"] = "parent"

        data_loader = DataLoaderEuroSAT(root=EUROSAT_DATA_DIR, split=split_name, download=True)
        os.makedirs(os.path.join(EUROSAT_STAC_DIR, split), exist_ok=True)
        for sample in tqdm(data_loader, desc=f"Sample ({split})"):
            label = sample["label"]
            if not label:  # ignore tiles by themselves without annotations
                continue
            index = sample["index"]
            iloc = os.path.join(EUROSAT_STAC_DIR, split, f"item-{index}.json")
            item = convert_sample_to_stac_item(sample, split, data_loader, stac_collection_name)
            stac_item_geometries.append(item["geometry"])

            # links for each respective item
            item.setdefault("links", [])
            item_link = {
                "rel": "item",
                "href": get_loc(iloc),
                "type": "application/geo+json",
            }
            item_self = item_link.copy()
            item_self["rel"] = "self"
            item["links"].extend([
                CATALOG_ROOT_LINK,
                stac_collection_parent,
                stac_collection_link,
                item_self,
            ])

            stac_collection["links"].append(item_link)
            save_json(item, iloc)

        # full extent considering all samples
        extent_geom = unary_union([Polygon(*poly["coordinates"]) for poly in stac_item_geometries])
        stac_collection["extent"]["spatial"] = {"bbox": [extent_geom.bounds]}
        # FIXME  search paper for official range (?) - doesn't seem to be any
        # tradeoff: limit to paper publication date (see note in previous cell)
        publication_date = "2017-06-14T00:00:00Z"
        stac_collection["extent"]["temporal"] = {"interval": [["2015-06-27T10:25:31.456Z", publication_date]]}

        # more link references
        stac_collection_self = stac_collection_link.copy()
        stac_collection_self["rel"] = "self"
        stac_catalog_collection_links.append(stac_collection_self)
        stac_collection["links"].extend([
            CATALOG_ROOT_LINK,
            stac_collection_link,
            {
                "rel": "parent",
                "href": CATALOG_URL,
                "type": "application/json",
                "title": "STAC Catalog",
            }
        ])
        stac_eo, _ = get_stac_band_details(data_loader)
        stac_collection.setdefault("properties", {})
        stac_collection["properties"].update(stac_eo)
        stac_collection["links"] = stac_collection.pop("links")  # reinsert last since long list, provide metadata higher up
        save_json(stac_collection, stac_collection_path)

    return stac_catalog_collection_links


In [181]:
stac_catalog_collections = generate_stac_collections()
catalog_collections_refs = deepcopy(stac_catalog_collections)
for link in catalog_collections_refs:
    link["rel"] = "child"


catalog: STACCatalog = {
    "id": "example",
    "type": "Catalog",
    "title": "Example STAC Catalog",
    "stac_version": STAC_VERSION,
    "description": "Example catalog with annotated label collections.",
    "links": [
         CATALOG_ROOT_LINK,
        {
            "rel": "self",
            "href": CATALOG_URL,
            "type": "application/json"
        }
    ] + catalog_collections_refs
}
save_json(catalog, CATALOG_PATH)

Split:   0%|          | 0/3 [00:00<?, ?it/s]

Sample (train):   0%|          | 0/60 [00:00<?, ?it/s]

Sample (validate):   0%|          | 0/20 [00:00<?, ?it/s]

Sample (test):   0%|          | 0/20 [00:00<?, ?it/s]

## STAC Schema Validation

Verify that all items that were generated respect the various STAC schemas.

**Note** <br>
If using the full `EuroSAT` dataset, this can take a while to complete, as every STAC Item for every one of the 27000 samples is validated,
and this against every single schema combination listed below. It is recommended to use the `EuroSAT100` subset to quickly validate if
modified STAC definitions above and the generated STAC Collections and Items remain valid.

In [185]:
from jsonschema.exceptions import ValidationError
from jsonschema.validators import validate

# schemas that cause 404 or other retrieval errors as they are not yet published officially
# temporarily use the literal reference instead, but leave the "expected" schema reference in the STAC Collections, Items, etc.
SCHEMA_ALIAS = {
    "https://stac-extensions.github.io/ml-aoi/v0.1.0/schema.json": "https://raw.githubusercontent.com/stac-extensions/ml-aoi/main/json-schema/schema.json",
}

# schemas to bypass validation.
# this should be avoided unless for special cases where issues must be resolved on the reference definitions
SCHEMA_IGNORE = [
    # FIXME: Invalid ml-aoi schema causes validation error (https://github.com/stac-extensions/ml-aoi/issues/6)
    #   temporarily disable its validation, but leave its references defined within the STAC Items
    "https://stac-extensions.github.io/ml-aoi/v0.1.0/schema.json"
]

stac_catalog_dir = os.path.dirname(CATALOG_PATH)  # limit search to active dataset variant
stac_catalog_files = []
stac_collection_files = []
stac_item_files = []
for root, _, files in os.walk(stac_catalog_dir):
    for file in files:
        file_path = os.path.join(root, file)
        if file == "catalog.json":
            stac_catalog_files.append(file_path)
        elif file == "collection.json":
            stac_collection_files.append(file_path)
        elif file.startswith("item-") and file.endswith(".json"):
            stac_item_files.append(file_path)

assert stac_catalog_files
assert stac_collection_files
assert stac_item_files

stac_file = None
schema_file = None
try:
    for stac_type, files, schemas in tqdm(
        [
            ("Catalog", stac_catalog_files, STAC_CATALOG_SCHEMAS),
            ("Collection", stac_collection_files, STAC_COLLECTION_SCHEMAS),
            ("Items", stac_item_files, STAC_ITEM_SCHEMAS),
        ],
        desc="Validating STAC schemas",
    ):
        for stac_file in tqdm(sorted(files), desc=f"Validating STAC {stac_type} files"):
            print(f"[{get_loc(stac_file, relative=True)}] testing...")
            content = load_json(stac_file)
            for schema_file in schemas:
                old_schema_file = schema_file
                if schema_file in SCHEMA_ALIAS:
                    schema_file = SCHEMA_ALIAS[schema_file]
                if any(schema_ref in SCHEMA_IGNORE for schema_ref in [schema_file, old_schema_file]):
                    print(f"- [{schema_file}]... IGNORED!")
                    continue
                schema = load_schema(schema_file)
                print(f"- [{schema_file}]... ", end="")
                validate(content, schema)
                print("OK")
            print("")
except ValidationError as exc:
    raise AssertionError(f"Failed [{stac_file}] validation against [{schema_file}]") from exc


Validating STAC schemas:   0%|          | 0/3 [00:00<?, ?it/s]

Validating STAC Catalog files:   0%|          | 0/1 [00:00<?, ?it/s]

[EuroSAT/stac/subset/catalog.json] testing...
- [https://schemas.stacspec.org/v1.0.0/catalog-spec/json-schema/catalog.json]... OK



Validating STAC Collection files:   0%|          | 0/3 [00:00<?, ?it/s]

[EuroSAT/stac/subset/test/collection.json] testing...
- [https://schemas.stacspec.org/v1.0.0/collection-spec/json-schema/collection.json]... OK
- [https://stac-extensions.github.io/eo/v1.1.0/schema.json]... OK
- [https://stac-extensions.github.io/scientific/v1.0.0/schema.json]... OK
- [https://stac-extensions.github.io/version/v1.0.0/schema.json]... OK
- [https://stac-extensions.github.io/view/v1.0.0/schema.json]... OK

[EuroSAT/stac/subset/train/collection.json] testing...
- [https://schemas.stacspec.org/v1.0.0/collection-spec/json-schema/collection.json]... OK
- [https://stac-extensions.github.io/eo/v1.1.0/schema.json]... OK
- [https://stac-extensions.github.io/scientific/v1.0.0/schema.json]... OK
- [https://stac-extensions.github.io/version/v1.0.0/schema.json]... OK
- [https://stac-extensions.github.io/view/v1.0.0/schema.json]... OK

[EuroSAT/stac/subset/validate/collection.json] testing...
- [https://schemas.stacspec.org/v1.0.0/collection-spec/json-schema/collection.json]... OK
- [

Validating STAC Items files:   0%|          | 0/100 [00:00<?, ?it/s]

[EuroSAT/stac/subset/test/item-0.json] testing...
- [https://schemas.stacspec.org/v1.0.0/item-spec/json-schema/item.json]... OK
- [https://stac-extensions.github.io/eo/v1.1.0/schema.json]... OK
- [https://stac-extensions.github.io/file/v1.0.0/schema.json]... OK
- [https://stac-extensions.github.io/raster/v1.1.0/schema.json]... OK
- [https://stac-extensions.github.io/label/v1.0.1/schema.json]... OK
- [https://raw.githubusercontent.com/stac-extensions/ml-aoi/main/json-schema/schema.json]... IGNORED!
- [https://stac-extensions.github.io/version/v1.0.0/schema.json]... OK

[EuroSAT/stac/subset/test/item-1.json] testing...
- [https://schemas.stacspec.org/v1.0.0/item-spec/json-schema/item.json]... OK
- [https://stac-extensions.github.io/eo/v1.1.0/schema.json]... OK
- [https://stac-extensions.github.io/file/v1.0.0/schema.json]... OK
- [https://stac-extensions.github.io/raster/v1.1.0/schema.json]... OK
- [https://stac-extensions.github.io/label/v1.0.1/schema.json]... OK
- [https://raw.githubuse