# Example how to use static STAC catalgoues with xarray and dask or rasterio

This example shows how tu use static [STAC](https://stacspec.org/en/about/) (Spatio-Temporal Asset Catalog) , [xarray](https://docs.xarray.dev/en/stable/) and [Dask](https://www.dask.org/) for processing big raster datasets, also with good support for time series. As extra, it is also shown how to find data URLs from STAC and use them with `Rasterio`. The main idea is to first find the data from STAC catalogue and then define processing as process graph in Dask. The downloading and processing is done lazily at the end, so that only needed data (only needed bands and area) is downloaded. The libraries take care of data download, so you do not need to know about file paths. These tools work best when data is provided as [Cloud-optimized GeoTiffs](https://www.cogeo.org/) (COGs).

For trying out this example, it is recommended to start interactive [Jupyter session](https://docs.csc.fi/computing/webinterface/jupyter/) with [Puhti web interface](https://docs.csc.fi/computing/webinterface/), for example with 2 cores and 12 Gb memory.

Dask is used for parallization of computing, see [CSC Dask tutorial](https://docs.csc.fi/support/tutorials/dask-python/), inc how to use Dask with Jupyter in
Puhti web interface and how to create batch jobs with Dask.

We'll search for Sentinel-2 data overlapping cetnral Helsinki. 

The main steps:
* Start Dask cluster
* Read STAC catalogue to copy metadata of all Sentinel2 L2A decade mosaic to memory.
* Create datacube of images from area of interest and required bands.
* Calculate timeseries for a point.
* Filter images only from 2020 summer.
* Finally, plot the single decade mosaic images.
* As extra, it is also shown how to find data URLs from STAC and use them with `Rasterio`.

In this example [FMI Tuulituhohaukka STAC catalogue](https://pta.data.lit.fmi.fi/stac/catalog.json) `Sentinel-2_global_mosaic_dekadi"` collection is used, but there are several [other STAC catalogues available](https://stacspec.org/en/about/datasets/). FMI data is stored in Sodankylä, Finland and is openly available without registration.

This example works with [geoconda module](https://docs.csc.fi/apps/geoconda/) in Puhti, the required libraries can be seen from imports.

The example is partly based on [Stackstac documentation](https://stackstac.readthedocs.io/en/latest/basic.html) and [Organizing Geospatial data with Spatio Temporal Assets Catalogs — STAC using python](https://towardsdatascience.com/organizing-geospatial-data-with-spatio-temporal-assets-catalogs-stac-using-python-45f1a64ca082).

In [None]:
import numpy as np
from pystac import Catalog, Collection
import pystac_client
import stackstac

Start Dask cluster. 

For following how Dask works open [Dask Dashboard or JupyterLab Dask Extension](https://docs.csc.fi/support/tutorials/dask-python/#dask-with-jupyter).

In [None]:
from dask.distributed import Client

client = Client()
client

## Read static catalogues with `pystac_client`

**This currently does not work with FMI STAC catalogue, because of minor problems in the syntax on FMI side. This should work with valid STAC catalogues.** If in hurry, jump to `Read static catalogues with PyStac`.

Open static catalog and read its general info.

In [None]:
URL = "https://pta.data.lit.fmi.fi/stac/catalog.json"
catalog = pystac_client.Client.open(URL)
print(f"ID: {catalog.id}")
print(f"Title: {catalog.title or 'N/A'}")
print(f"Description: {catalog.description or 'N/A'}")
print(f"Description: {catalog.links or 'N/A'}")

Which collections the catalogue includes? This does not currently work with FMI.

In [None]:
# collections = list(catalog.get_collections())

# print(f"Number of collections: {len(collections)}")
# print("Collections IDs:")
# for collection in collections:
#     print(f"- {collection.id}")

Select one collection, from FMI currently only `Tuulituhoriski` collection works.

In [None]:
collection = catalog.get_collection('Tuulituhoriski')
collection

## Read static catalogues with `PyStac`

If you already know the link to collection .json page or to avoid FMI problems with collections listing it is possible to open Collection directly with PyStac library. Look for other FMI links here: https://pta.data.lit.fmi.fi/stac/catalog.json

In [None]:
collection = Collection.from_file('https://pta.data.lit.fmi.fi/stac/catalog/Sentinel-2_global_mosaic_dekadi/Sentinel-2_global_mosaic_dekadi.json')
# collection = Collection.from_file('https://pta.data.lit.fmi.fi/stac/catalog/Sentinel-1_dekadi_mosaiikki/Sentinel-1_dekadi_mosaiikki.json')
# collection = Collection.from_file('https://pta.data.lit.fmi.fi/stac/catalog/Tuulituhoriski/Tuulituhoriski.json')
# collection = Collection.from_file('https://pta.data.lit.fmi.fi/stac/catalog/Metsavarateema/Metsavarateema.json')

Get a list of all items (images) in the collection.

Depending on the collection size this step might take some time (even minutes) and for really big collections this is unfeasible.

In [None]:
items = list(collection.get_all_items())

See how many items were found and some basic info about the first item.

In [None]:
print(f"Number of items: {len(items)}")

for i, item in enumerate(items[:1]):
    print(f"{i}: {item}", flush=True)
    print(f"{i}: {item.bbox}", flush=True)
    print(f"{i}: {item.properties}", flush=True)
    print(f"{i}: {item.assets}", flush=True)

From this can be seen that:
* This specific collection includes 184 items at the time of writing this guideline, but it is regularly updated, so the number will increase.
* Each item has several assets with different Sentinel2 band values, but also different additional values from origional data and mosaicking.

To create the `xarray` DataSet we need to provide dataset's coordinate system and pixel size manually, because it is not provided by FMI in understandable way for stackstac.

To find out these see `gdalinfo` for one of the bands we will use.

In [None]:
!gdalinfo /vsicurl/https://pta.data.lit.fmi.fi/sen2/s2m_b04/s2m_sdr_20170201-20170210_b04_r20m.tif

The `gdalinfo` output shows, that the data is in EPSG:3067 (Finnish TM35FIN) coordinate system and pixel size is 20 meters.

Next let's create the Xarray Dataset from all found items, but limiting the area of interest (=central Helsinki) and selecting only bands 2 to 4. The band names can be seen from the Item metadata printout above.

In [None]:
cube = stackstac.stack(
    items=[item.to_dict() for item in items],    # it needs the items as dictionaries items[:4]
    assets=['b04', 'b03', 'b02'],
    epsg=3067,
    resolution=20,
    bounds=(385480, 6671940, 387480, 6673940)
).squeeze()
cube

Unfortunatelly also time info is not automatically created correctly, see NaT on the `time` row above.

The dates are correctly given under `start_datetime` and `end_datetime`, we will use `start_datetime` below as replacement for `time`. Because `start_datetime` was not read in proper datatime format, we need also convert it from string to `datatime64` type.

In [None]:
cube2 = cube.assign_coords(time=np.array(cube.start_datetime.values,dtype=np.datetime64))
cube2

Next well will plot a timeseries for a single pixel for 2017-(2022).
But first to avoid problems with a specific broken file, remove mosaic for 2020-01-21-2020-01-31.

In [None]:
cube3 = cube2[cube2.id!='Sentinel-2_global_mosaic_dekadi_2020-01-21_2020-01-31']
cube3

Select data for one pixel for full timeseries.

In [None]:
b02_timeserires = cube3.sel(x=386600.0, y=6672680.0, band='b02')
b02_timeserires

So far we have downloaded only metadata for the datacube, for next plot also actual data will be downloaded, but only as much as needed for the plot. It takes a moment to plot, please wait.

In [None]:
b02_timeserires.plot()

Finally, to print out some summer images from 2020 select the data for this period from datacube.

In [None]:
cube_2020 = cube3[cube3["time"] > np.datetime64('2020-05-31T00:00:00.000000000')]
cube_2020_summer = cube_2020[cube_2020["time"] < np.datetime64('2020-08-31T00:00:00.000000000')]
cube_2020_summer

Plot the data for each decade.

In [None]:
cube_2020_summer.plot.imshow(row="time", rgb="band", robust=True, size=6)

Some of the images do not look correct, let's check what is wrong.

In [None]:
cube_2020_summer[cube_2020_summer["time"]==np.datetime64('2020-08-21T00:00:00.000000000')].values #OK data

In [None]:
cube_2020_summer[cube_2020_summer["time"]==np.datetime64('2020-08-11T00:00:00.000000000')].values #First band all 0, second all nan, third ok.

In [None]:
cube_2020_summer[cube_2020_summer["time"]==np.datetime64('2020-08-01T00:00:00.000000000')].values #Second band all 0.

## Using data with Rasterio

If interested in working with other Python packages than 'xarray' or interested to double-check the data problem, we can find from the items list created in the beginning of this Notebook, which files are related to specific date.

In [None]:
#Note this search is working with strings, not proper dates.
def search_items(items, date):
    for item in items:
        if item.properties["start_datetime"] == (date):
            return item

In [None]:
a = search_items(items, '2020-08-01T00:00:00Z')
a.assets

Then the files can be checked with `gdalinfo`, compared to command in the beginning, which fetched only saved metadata, the `-stats` flag enables calculating statistics for a file, but also downloads all data to local for a moment. So it takes a moment to finish.

In [None]:
b03_path='/vsicurl/'+ a.assets["b03"].href
b04_path='/vsicurl/'+ a.assets["b04"].href

In [None]:
!gdalinfo {b03_path} -stats

The missing data seems to be related to Helsinki area only, because on file level the statistics look ok.

To plot the file and histogram in Helsinki area with rasterio.

In [None]:
import rasterio
import matplotlib.pyplot as plt
from rasterio.windows import from_bounds
from rasterio.plot import show
from rasterio.plot import show_hist

In [None]:
### Create a subplot
fig, ax = plt.subplots(ncols=2, nrows=2, figsize=(15, 15))

# Add band3 map and histogram, not OK
with rasterio.open(b03_path) as src:
    rst = src.read(1, window=from_bounds(385480, 6671940, 387480, 6673940, src.transform))
    show(rst, ax=ax[0, 0], cmap='viridis', title='b03 map')
    show_hist(rst, bins=50, lw=0.0, stacked=False, alpha=0.3, histtype='stepfilled', ax=ax[1, 0], title="b03 histogram")
    
# Add band4 map and histogram, OK
with rasterio.open(b04_path) as src:
    rst = src.read(1, window=from_bounds(385480, 6671940, 387480, 6673940, src.transform))
    show(rst, ax=ax[0, 1], cmap='viridis', title='b04 map')
    show_hist(rst, bins=50, lw=0.0, stacked=False, alpha=0.3, histtype='stepfilled', ax=ax[1, 1], title="b04 histogram")    