This notebook was automatically tested on the EOTDL platform with kernel **eotdl-2023.10**
<details>
    
```
    channels:
    - conda-forge
    dependencies:
    - python=3.9
    - ipykernel
    - geojson
    - ipyleaflet
    - ipywidgets
    - jupyterlab_widgets
    - jupyterlab-geojson
    - libgcc
    - papermill
    - timm
    - geomet
    - matplotlib
    - lightning
    - rioxarray
    - pystac
    - pyproj
    - mlflow
    - pip:
      - pytorch-eo==2023.7.21
      - eotdl==2023.9.14.post4
    description: ''
    name: eotdl-2023.10
    prefix: null
```
</details>

# STAC EuroSAT

In this demo we generate STAC metadata for the [EuroSAT](https://github.com/phelber/EuroSAT) dataset.

In [1]:
import os
from pathlib import Path
path = Path(os.environ.get('EOTDL_DOWNLOAD_PATH'), 'EuroSAT')
assert(path.exists) 

> ***Note:*** the hosted [EOTDL](https://eotdl.com) platform ensures that onboarded training datasets are readily available, to access the datasets from your local machine or other platforms please check the corresponding [tutorial notebooks](https://notebooks.api.eotdl.com/?search=offline)!

In [2]:
import pandas 
from glob import glob 
from random import sample

The EuroSAT dataset consists of 2700 Sentinel 2 images with one label per image for scene classification. There are 10 different categories in total. We use 100 samples for fast prototyping.

In [3]:
images = glob(str(path) + '/ds/**/*.tif', recursive=True)
images = sample(images, 100)
labels = [x.split('/')[-1].split('_')[0] for x in images]
cats = sorted(os.listdir(path / 'ds/images/remote_sensing/otherDatasets/sentinel_2/tif'))
ixs = [cats.index(x) for x in labels]

df = pandas.DataFrame({'image': images, 'label': labels, 'ix': ixs})
df

Unnamed: 0,image,label,ix
0,/cache/datasets/EuroSAT/ds/images/remote_sensi...,Residential,7
1,/cache/datasets/EuroSAT/ds/images/remote_sensi...,SeaLake,9
2,/cache/datasets/EuroSAT/ds/images/remote_sensi...,AnnualCrop,0
3,/cache/datasets/EuroSAT/ds/images/remote_sensi...,Forest,1
4,/cache/datasets/EuroSAT/ds/images/remote_sensi...,HerbaceousVegetation,2
...,...,...,...
95,/cache/datasets/EuroSAT/ds/images/remote_sensi...,HerbaceousVegetation,2
96,/cache/datasets/EuroSAT/ds/images/remote_sensi...,PermanentCrop,6
97,/cache/datasets/EuroSAT/ds/images/remote_sensi...,Residential,7
98,/cache/datasets/EuroSAT/ds/images/remote_sensi...,HerbaceousVegetation,2


In [4]:
df.ix.unique()

array([7, 9, 0, 1, 2, 3, 8, 4, 6, 5])

We start by generating STAC metadata following the core STAC specification. 

- We generate a STAC item for every image in the datasets
- a STAC collection to represent the images collection
- a STAC catalog to represent the final dataset (which will include also the annotations).

https://pystac.readthedocs.io/en/stable/

In [5]:
import pystac
from datetime import datetime
import rasterio as rio
import uuid
from shapely.geometry import GeometryCollection, Polygon, box, shape, mapping
from tqdm import tqdm

In [6]:
# create empty catalog

eurosat = pystac.Catalog(id="eurosat", description="EuroSAT dataset")
eurosat

In [7]:
# create collection

# # spatial extent (should compute from images)
sp_extent = pystac.SpatialExtent([None,None,None,None])

# temporal extent (should compute from images or given by authors)
from_date = datetime.strptime('2015-10-22', '%Y-%m-%d') # unknown
to_date = datetime.strptime('2019-10-22', '%Y-%m-%d') # unknown
tmp_extent = pystac.TemporalExtent([(from_date, to_date)])

extent = pystac.Extent(sp_extent, tmp_extent)

sentinel = pystac.Collection(id='sentinel2', description = 'EuroSAT Sentinel 2 dataset', extent = extent)
eurosat.add_child(sentinel)

eurosat

In [8]:
# creating items

dst_path = path / 'eurosat'
os.makedirs(dst_path, exist_ok=True)

def create_item(image):
    params = {}
    params['id'] = image.split('/')[-1].split('.')[0] # use original name
    params['datetime'] = from_date # unknown
    params['properties'] = {}
    with rio.open(image) as src:
        params['bbox'] = list(src.bounds)
        params['geometry'] = mapping(box(*params['bbox']))
        i = pystac.Item(**params)
        image_dst_path = dst_path / f"{params['id']}.tif"
        for band in src.indexes:
            image_dst_path = dst_path / f"{params['id']}_B{band}.tif"
            out_meta = src.meta.copy()
            out_meta.update({"count": 1})
            with rio.open(image_dst_path, "w", **out_meta) as dest:
                dest.write(src.read(band), 1)
            i.add_asset(key=f'B{band}', asset=pystac.Asset(href=str(image_dst_path), title='Geotiff', media_type=pystac.MediaType.GEOTIFF))
    return i

In [9]:
# import multiprocessing
# from concurrent.futures import ProcessPoolExecutor

# num_cores = multiprocessing.cpu_count()
# with ProcessPoolExecutor(max_workers=num_cores) as pool:
#     with tqdm(total=len(images)) as rm-rprogress:
#         futures = []
#         for image in df.image:
#             future = pool.submit(create_item, image) 
#             future.add_done_callback(lambda p: progress.update())
#             futures.append(future)
#         items = []
#         for future in futures:
#             result = future.result()
#             items.append(result)

items = [create_item(image) for image in tqdm(df.image)]
            
for item in tqdm(items):
  sentinel.add_item(item)

100%|██████████| 100/100 [05:37<00:00,  3.38s/it]
100%|██████████| 100/100 [00:00<00:00, 46015.40it/s]


In [10]:
# reset spatial extent

bounds = [list(GeometryCollection([shape(s.geometry) for s in eurosat.get_all_items()]).bounds)]
sentinel.extent.spatial = pystac.SpatialExtent(bounds)

In [11]:
eurosat.normalize_hrefs('eurosat-stac')

In [12]:
# eurosat.validate_all()

In [13]:
eurosat.save(catalog_type=pystac.CatalogType.SELF_CONTAINED)

We have created a STAC Catalog for our dataset !

In [14]:
eurosat = pystac.Catalog.from_file('eurosat-stac/catalog.json')
eurosat