# Example how to use CSC STAC API, xarray and dask

This example shows how tu use [STAC](https://stacspec.org/en/about/) (Spatio-Temporal Asset Catalog) API, [xarray](https://docs.xarray.dev/en/stable/) and [Dask](https://www.dask.org/) for processing big raster datasets, also with good support for time series. The main idea is to first define the search and processing as process graph. The downloading and processing is done lazily at the end, so that only needed data (good enough cloud-free image, only needed bands and area) is downloaded. The libraries take care of data download, so you do not need to know about file paths. These tools work best when data is provided as [Cloud-optimized GeoTiffs](https://www.cogeo.org/) (COGs).

For trying out this example, it is recommended to start interactive [Jupyter session](https://docs.csc.fi/computing/webinterface/jupyter/) with [Puhti web interface](https://docs.csc.fi/computing/webinterface/), for example with 1 cores and 8 Gb memory.

Dask is used for parallization of computing, see [CSC Dask tutorial](https://docs.csc.fi/support/tutorials/dask-python/), inc how to use Dask with Jupyter in
Puhti web interface and how to create batch jobs with Dask.

We'll search for 12 months of Sentinel-2 data overlapping cetnral Helsinki. Then filter out cloudy scenes, based on their metadata, then create a median composite for each month.

The main steps:
* Start Dask cluster
* Query STAC catalogue to find Sentinel2 L2A images from area and time of interest and small cloud coverage, 
* Create first datacube, defining required bands and bbx.
* Mosaic the images with median value, for each month.
* Finally, calculate the result.
* Close Dask cluster

 [CSC STAC catalogue](http://86.50.229.158:8080/geoserver/ogc/stac) is at the moment testing phase, the URL will change. At the moment one collection is available `sentinel2_full_test`, but the plan is to add other collections soon.

This example works with [geoconda module](https://docs.csc.fi/apps/geoconda/) in Puhti, the required libraries can be seen from imports.

The example is mostly based on [Stackstac documentation](https://stackstac.readthedocs.io/en/latest/basic.html), plotting from https://stacspec.org/en/tutorials/access-sentinel-2-data-aws

In [None]:
import stackstac
from dask.distributed import Client
import pystac_client
import pyproj
import pystac_client
from pystac import Catalog, Collection
import geopandas as gpd
import json

Start Dask cluster. 

For following how Dask works open [Dask Dashboard or JupyterLab Dask Extension](https://docs.csc.fi/support/tutorials/dask-python/#dask-with-jupyter).

In [None]:
# Not starting Dask by default, but it could be started.
# Make sure you ahve reserved several cores then.
# client = Client()/
# client

Define the center of area of interest, in this case Helsinki.

In [None]:
lon, lat = 24.945, 60.173, #Helsinki
#lon,lat = 25.6, 65.1

If you want to see pystac_client API calls, set logger to DEBUG. Enable if you want.

In [None]:
# import logging
# logging.basicConfig()
# logger = logging.getLogger('pystac_client')
# logger.setLevel(logging.DEBUG)

## Example of working STAC

**ToDo. Look for working FastAPI and GeoServer other services, to have better comparision.**

Search from STAC API, using [pystac-client](https://pystac-client.readthedocs.io/). Define the STAC catalog endpoint.

In [None]:
URL = "http://86.50.229.158:8080/geoserver/ogc/stac"
catalog = pystac_client.Client.open(URL)
catalog

In [None]:
print(f"ID: {catalog.id}")
print(f"Title: {catalog.title or 'N/A'}")
print(f"Description: {catalog.description or 'N/A'}")
print(f"Links: {catalog.links or 'N/A'}")

Find out which collections are available.

In [None]:
for collection in catalog.get_collections():
    print(collection.id)

In [None]:
collection = catalog.get_collection('sentinel2_full_test')
collection

### Search

STAC provides two different search options:

* Basic search, avaialble criteria: collection, location and time.
* Advanced search with filder, basic search + other attributes provided by STAC. In CSC STAC Sentinel data has information about cloud coverage.

#### Basic search

Search with a point as location

In [None]:
%%time
search = catalog.search(
    intersects=dict(type="Point", coordinates=[lon, lat]),
    collections=["sentinel2_full_test"],
    datetime="2021-08-01/2021-09-30"
)
print('Found items: ' "{}".format(search.matched()))

Search with a bbox as location

In [None]:

%%time
search_bbox = catalog.search(
    bbox=[23.0,60.5,26.0,64.0],
    collections=["sentinel2_full_test"],
    datetime="2021-08-01/2021-08-15"
)
print('Found items: ' "{}".format(search_bbox.matched()))

#### Search with filter 

Same as above, but added the cloud coverage criteria.

In [None]:
params = {
    "intersects": {"type": "Point", "coordinates": [lon, lat]},
    "collections": "sentinel2_full_test",
    "datetime": "2021-08-01/2021-09-30",
    "filter": {
        "op": "<",
        "args": [{"property": "eo:cloud_cover"}, 20]
    }
}

search_filter = catalog.search(**params)
print('Found items: ' "{}".format(search_filter.matched()))

## ItemCollection

Get ItemCollection of the search results, it includes metadata about the found scenes, and links to their data. No actual data is downloaded yet. See how Jupyter displays the ItemCollection info.

In [None]:
item_collection = search_filter.item_collection()
item_collection

In [None]:
len(item_collection)

## Plotting search results

Searching without location limitation and shorte time, to have scenes from all Finland.

In [None]:
stac_json = search_bbox.get_all_items_as_dict()

In [None]:
# Add Item ID to properties to have access to it in GeoPandas
for a in stac_json['features']:
    a['properties']['title']=a['id']

In [None]:
gdf = gpd.GeoDataFrame.from_features(stac_json, "epsg:4326")
print('Found items: ' "{}".format(len(gdf))) 

In [None]:
gdf.head()

In [None]:
fig = gdf.plot(
    edgecolor="black",
    alpha=0.05,
)
_ = fig.set_title("STAC Query Results")

To plot items with a zoomable map see this example: https://stacspec.org/en/tutorials/access-sentinel-2-data-aws#Plot-STAC-Items-on-a-Map

## Retrieving data

Create `xarray` datacube from the items. Using all the defaults, our data will be in its native coordinate reference system, at the finest resolution of all the assets. This will be fast, because the actual data is not fetched yet. How does the datacube look like?

In [None]:
# Define smaller bbox
# Convert lat-lon point to the data's UTM coordinate reference system, then use that to slice the `x` and `y` dimensions, which are indexed by their UTM coordinates.
x_utm, y_utm = pyproj.Proj("EPSG:32635")(lon, lat)
buffer = 10000  # meters
x_utm-buffer

In [None]:
%time 
cube = stackstac.stack(
    items=item_collection,
    bounds=(x_utm-buffer, y_utm-buffer, x_utm+buffer, y_utm+buffer), 
    assets=["B04_60m", "B03_60m", "B02_60m"],
    epsg=32635
).squeeze() 
# When item_collection contains multiple epsg's, epsg value needs to be provided
cube

Use xarray's `resample` to create 1-month median composites.

In [None]:
monthly = cube.resample(time="MS").median("time", keep_attrs=True)
monthly

So far no data has been downloaded, nor anything computed with actual data. Data size has become 7 Mb, which will actually be downloaded. In this example the final data size is very small, but Dask is good also in handling much bigger amounts of data, also bigger than fits to memory.

To start the process use `compute()`. The process can be followed from Dask Dashboard or Dask Lab Extension.

In [None]:
# %%time
# data = monthly.compute()

Show the resulting images.

In [None]:
%%time
monthly.plot.imshow(row="time", rgb="band", robust=True, size=10);

### One item info

In [None]:
i = 0
item = item_collection[i]
print(f"{i}: {item}")
print(f"{i}: {item.bbox}")
print(f"{i}: {item.properties}")
print(f"Available assets: {item.assets.keys()}")
for key in item.assets.keys():
        print(f"{key}: {item.assets[key]}")

## Working with rasterio or other tools

It is possible to use the STAC also when working with rasterio or other tools, but then the URLs must be manually retrieved.

In [None]:
import rasterio
from rasterio.plot import show

Select item and asset, retrieve URL.

In [None]:
url = item_collection[3].assets['B04_60m'].href
url

In [None]:
%time
dataset = rasterio.open(url)
print(f"Transform: {dataset.transform}")
print(f"Transform: {dataset.shape}")
print(f"Transform: {dataset.crs}")

In [None]:
%time
show(dataset.read(), transform=dataset.transform)

Use GDAL with the URL.

In [None]:
%time
!gdalinfo {url}