# Introduction

## SpatioTemporal Asset Catalogs (STAC)

This lab will demonstrate how to search for and download geospatial data in the cloud. 

It will introduce <a href="https://stacspec.org/en/" target="_blank">SpatioTemporal Asset Catalogs (STAC)</a>, a specification that makes it easy to query and search through large collections of geospatial data assets stored in the cloud. 

You will also learn to use the <a href="https://pystac-client.readthedocs.io/en/stable/index.html" target="_blank">pystac_client</a> package which provides tools for working with STAC in Python.

First, let's briefly outline what the STAC specification is before completing some data querying and downloading tasks to make the concepts concrete.

**spatiotemporal asset:** this is a file comprising geospatial data for a location and point in time. For example, this could be Landsat or Sentinel-2 satellite images stored in the cloud such as in Microsoft Azure or Amazon Web Services. This is a file that we can download and use the data in our analysis and applications. However, if you look at <a href="https://planetarycomputer.microsoft.com/catalog" target="_blank">Microsoft's Planetary Computer Data Catalog</a>, <a href="https://aws.amazon.com/marketplace/search/results?trk=868d8747-614e-4d4d-9fb6-fd5ac02947a8&sc_channel=el&FULFILLMENT_OPTION_TYPE=DATA_EXCHANGE&CONTRACT_TYPE=OPEN_DATA_LICENSES&filters=FULFILLMENT_OPTION_TYPE%2CCONTRACT_TYPE" target="_blank">Amazon Web Services Open Data</a>, or the <a href="https://explorer.sandbox.dea.ga.gov.au/stac/" target="_blank">Digital Earth Australia Open Data Cube</a> you will see there are lots of spatiotemporal assets available (for free). The challenge is searching through these collections of assets to find the data you need and downloading it. The STAC specification provides a solution for this. 

The STAC specification comprises:

* **STAC Item** - a GeoJSON feature that represents a spatiotemporal asset with links to the spatiotemporal asset and additional metadata fields (e.g. bounding box, thumbnail, datetime, cloud cover).
* **STAC Catalog** - a JSON file of links to STAC Items to support querying and retrieving STAC Items. STAC Catalogs can comprise sub-catalogs that group together related data within a larger structure. For example, Microsoft's Planetary Computer might create a STAC Catalog for all of its spatiotemporal assets and organise these assets in sub-catalogs (e.g. a catalog for Landsat 7, Landsat 8, Sentinel-2, SRTM DEM etc.).
* **STAC Collection** - an extension of a STAC Catalog with additional metadata properties (e.g. extents, licences, providers) to describe STAC Items within the collection. 
* **STAC API** - an API that allows clients to query a STAC collection, search for STAC Items, and retrieve their links for downloading. The search endpoint is designed to recieve queries of STAC Catalogs that filter on location, date, and time as well as other fields. It returns a GeoJSON FeatureCollection object with of STAC Items that meet the search criteria. 

### Task

We're going to use the pystac_client package to query a range of STAC Catalogs hosted in the cloud. We'll complete the following tasks:

* Find the least cloudy Sentinel-2 image for a field in Western Australia using the Microsoft Planetary Computer.
* Find the least cloudy Sentinel-2 image for a field in Western Australia using Amazon Web Services.

### Tips

These are some tips for working with STAC here.

* use rectangular bounding boxes or area-of-interest geometries to quickly identify STAC Items that intersect with their extent.
* for exploratory work use small areas-of-interest to minimise the size of searches of STAC Collections and the amount of data transmitted over the network. 

### Useful links

* <a href="https://stacspec.org/en" target="_blank">STAC website</a>: the STAC homepage with details about STAC, tutorials, and links to STAC catalogs.
* <a href="https://radiantearth.github.io/stac-browser/#/" target="_blank">STAC Browser</a>: a web browser to search for STAC catalogs.
* <a href="https://stacindex.org/" target="_blank">STAC Index</a>: an index of STAC catalogs and tutorials.
* <a href="https://planetarycomputer.microsoft.com/catalog" target="_blank">Microsoft Planetary Computer Catalog</a>: Microsoft Planetary Computer's STAC catalogs.


## Setup

### Run the labs

You can run the labs locally on your machine or you can use cloud environments provided by Google Colab. **If you're working with Google Colab be aware that your sessions are temporary and you'll need to take care to save, backup, and download your work.**

<a href="https://colab.research.google.com/github/data-analysis-3300-3003/colab/blob/main/lab-6-self-guided.ipynb" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

### Download data

If you need to download the date for this lab, run the following code snippet. 

In [None]:
import os

if "week-6" not in os.listdir(os.getcwd()):
    os.system('wget "https://github.com/data-analysis-3300-3003/data/raw/main/data/week-6.zip"')
    os.system('unzip "week-6.zip"')

### Working in Colab

If you're working in Google Colab, you'll need to install the required packages that don't come with the colab environment.

In [None]:
if 'google.colab' in str(get_ipython()):
    !pip install geopandas
    !pip install pyarrow
    !pip install mapclassify
    !pip install rasterio
    !pip install planetary-computer
    !pip install pystac-client


### Import modules

In [None]:
import os
import json
import geopandas as gpd
import pandas as pd
import numpy as np
import pystac_client
import planetary_computer as pc
import plotly.express as px
import plotly.io as pio
import rasterio
from rasterio import windows
from rasterio import features
from rasterio import warp
from skimage import io

from pystac.extensions.eo import EOExtension as eo

# setup renderer
if 'google.colab' in str(get_ipython()):
    pio.renderers.default = "colab"
else:
    pio.renderers.default = "jupyterlab"

## Sentinel-2 and Microsoft Planetary Computer

To provide an introducion to the STAC specification and using it to search for spatiotemporal assets, we'll use it to query Microsoft's Planetary Computer to find a cloud free Sentinel-2 satellite image for a field in Western Australia. 

We'll be using the <a href="https://pystac-client.readthedocs.io/en/stable/" target="_blank">pystac_client</a> package which is a STAC Python Client providing classes for working with STAC Catalogs and APIs.

First, we need to create a `pystac_client.Client` object with methods and attributes to interact with a given STAC Catalog. Using the `pystac_client.Client.open()` method we can open a STAC Catalog or API and read the root catalog. 

The `pystac_client.Client.open()` method requires a `url` which points to the STAC catalog or api. The `url` for the Microsoft Planetary Computer STAC API is `"https://planetarycomputer.microsoft.com/api/stac/v1"`.  

In [None]:
# open a connection to the Microsoft Planetary Computer's root STAC catalog
pc_catalog = pystac_client.Client.open(
    url="https://planetarycomputer.microsoft.com/api/stac/v1",
    # modifier=planetary_computer.sign_inplace
)

A `pystac_client` object has a `search()` method that can be used to specify a query to search a STAC Collection for STAC Items that meet certain conditions. The `search()` method has the following parameters that can be used to define scope of the query:

* `max_items` - maximum number of items to return from the search. 
* `bbox` - a list or tuple of of bounding box coordinates. STAC Items that intersect the bounding box will be returned. 
* `intersects` - a str or dict representation of a GeoJSON geometry or Shapely `geometry`. STAC Items that intersect the geometry will be returned. 
* `datetime` - a single datetime or datetime range used to filter STAC Items. 
* `query` - list of JSON or query parameters using the STAC API query extension. 

You can see the full details for the `search()` method <a href="https://pystac-client.readthedocs.io/en/stable/api.html#pystac_client.Client.search" target="_blank">here</a>.

#### Area of interest

Before we can `search()` the Planetary Computer STAC Catalog we need to create the geographic extent for our query. 

We're going to start by reading in a geometry for the field boundary stored in a shapefile. We need to convert the shapefile to one of:

* bounding box coordinates
* a GeoJSON geometry
* a Shapely `geometry`

We'll demonstrate how to do each of these conversions for your reference. 

First let's read the data from file. Then, we'll compute the <a href="https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.envelope.html" target="_blank">`envelope`</a> of the field's geometry. The envelope is the smallest rectangular geometry to cover the field's geometry. It is often beneficial to pass in simpler geometries than more complex shapes for identifying STAC Items that intersect with an area-of-interest.

In [None]:
# load field boundary from shapefile
data_path = os.path.join(os.getcwd(), "week-6", "BF66_bdy.shp")
aoi = gpd.read_file(data_path)

# add the field boundary to a map object
m = aoi.explore()
aoi_env = aoi["geometry"].envelope
# draw envelope in red
aoi_env.explore(m=m, color="red", style_kwds={"fillOpacity": 0})

A `GeoSeries` is a sequence of Shapely `geometry` objects. Thus, we can just extract the first and only element of the `aoi_env` `GeoSeries` to obtain a Shapely `geometry`.

In [None]:
# get Shapely geometry object
aoi_shapely = aoi_env[0]
print(aoi_shapely)

The process to obtain a GeoJSON str or dict representation of the envelope is more involved. First, we use the `GeoPandas` `to_json()` method to convert the `GeoSeries` to a GeoJSON FeatureCollection in str format. 

Then, we use the `json.loads()` to function to parse the JSON string data to a Python dict. 

Finally, we can subset the `geometry` property out of the dict.

In [None]:
aoi_json = json.loads(aoi_env.to_json())
print("AOI Envelope as GeoJSON FeatureCollection")
print("")
print(aoi_json)
aoi_geometry = dict(aoi_json["features"][0])["geometry"]
print("")
print("AOI Envelope as GeoJSON Geometry")
print("")
print(aoi_geometry)

Finally, it is simple to obtain a list of coordinates for the bounding box by using the `total_bounds` property of the `GeoSeries` and converting it to a list object.

See the GeoPandas <a href="https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.total_bounds.html" target="_blank">`total_bounds` docs</a>.

In [None]:
bbox = aoi_env.total_bounds.tolist()
bbox

#### Datetime

Let's specify a datetime range to search. Here, we'll look for all Sentinel-2 STAC Items that intersect our area-of-interest for the month of October 2019. 

In [None]:
time_of_interest = "2019-10-01/2019-11-01"

#### Extensions

The STAC specification permits extensions which allow for more detailed descriptions of STAC Items in a collection. A commonly used extension is the <a href="https://github.com/stac-extensions/eo" target="_blank">`Electro-Optical Extension Specification`</a> for describing snapshots of the Earth for a point-in-time and designed for data that's captured for one or more wavelengths of the electromagnetic spectrum (i.e. remote sensing data).

It includes the following item properties:

* `eo:bands`: an array of available bands (i.e. different spectral wavebands for a remote sensing image).
* `eo:cloud_cover`: an estimate of cloud cover for the STAC Item.
* `eo:snow_cover`: an estimate of snow and ice cover for the STAC Item.

The `eo:cloud_cover` property could be useful to help with searching a STAC Collection for cloud free scenes.

We can set up a query of `eo` properties as: `{"eo:cloud_cover": {"lt": 10}}`. This will find all STAC Items with a property of `eo:cloud_cover` less than 10%.

#### Search

We're now ready to search the Planetary Computer STAC Catalog's `sentinel-2-l2a` for all images with low cloud cover in October 2019 that intersect our area-of-interest. 

The `s2_search` object is an `ItemSearch` instance which represents the search of a STAC API. We can retrieve the STAC Items returned by the search as an `ItemCollection` using the `item_collection()` method.

We can print the `ItemCollection` and interactively explore its contents. This helpfully illustrates the structure of the STAC specification. Our search of the `sentinel-2-l2a` collection returned 2 STAC Items. Each STAC Item corresponds to a Sentinel-2 image.

We can explore each of the STAC Items and see that it has several metadata properties (e.g. Bounding Box, Datetime, platform, proj:epsg, eo:cloud_cover), it also has an Assets slot which stores links to the underlying data referenced by the STAC Item. In this case it is a cloud-optimised GeoTIFF files stored in Microsoft Azure. 

In [None]:
# Search the Planetary Computers S2 Catalog
s2_search = pc_catalog.search(
    collections=["sentinel-2-l2a"],
    bbox=bbox,
    datetime=time_of_interest,
    query={"eo:cloud_cover": {"lt": 10}},
)

# Check how many items were returned
s2_items = s2_search.item_collection()
print(f"Returned {len(s2_items)} Items")

In [None]:
s2_items

### Download data

Now we've completed a search of the STAC API and identified that there are two Sentinel-2 images that meet our search criteria, we're in a position to download these images and use their data. 

As these are optical images of the Earth's surface, we'd like to use the least cloudy image.  We can write a small routine to find the STAC Item with the lowest eo:cloud_cover value and download that item. 

We imported the `EOExtension` module as `eo` at the start of the notebook. We can use call the `eo.ext()` method on a STAC Item to extend it with properties from the `eo` extension. This allows us to get the `eo` item properties such as `cloud_cover` easily. 

Let's loop over all the STAC Items in our search, retrieve their `eo:cloud_cover` value, and append that value to a list. 

In [None]:
# empty list
cloud_cover = []
for i in s2_items:
    cloud_cover.append(eo.ext(i).cloud_cover)

Next, we'll find the minimum cloud cover value and that STAC Item's position in our `ItemCollection` `s2_items`. 

In [None]:
min_cloud_cover = min(cloud_cover)
min_cloud_cover_idx = cloud_cover.index(min_cloud_cover)
print(f"The STAC Item with lowest cloud cover had {min_cloud_cover}% cloud cover")
print(f"The index postion of the STAC Item with lowest cloud cover in our ItemCollection is {min_cloud_cover_idx}")

Let's subset the the STAC Item with the lowest cloud cover from our `ItemCollection`. This should give us a single STAC Item which we can inspect. 

In [None]:
least_cloudy_s2 = s2_items[min_cloud_cover_idx]
least_cloudy_s2

Now we've identified the STAC Item with the lowest cloud cover, we need to download it. This is where we head to the Assets property of the STAC Item where we see a series of `href` properties with hyperlinks to where that data is physically stored (here, this is in Azure Blob Storage as cloud-optimised GeoTIFF files). 

We can print out the list of Assets associated with the STAC Item.

In [None]:
# print assets properties of STAC Item
least_cloudy_s2.assets.keys()

In [None]:
# lets look at the property for B02 - blue band reflectance
least_cloudy_s2.assets["B02"]

Let's download the red band data to a NumPy `ndarray`. The `href` points to a cloud-optmised GeoTIFF (COG) file stored in Azure Blob Storage (i.e. in the cloud). A COG file is similar to a regular GeoTIFF file, but it can receive HTTP requests to retrieve portions of data that correspond to a geographic extent and at a particular zoom level. 

Planet (a commercial CubeSat company that make use of STAC and GeoTIFFs in their products) have a <a href="An Introduction to Cloud Optimized GeoTIFFS (COGs) Part 1: Overview" target="_blank">blog</a> post that introduce COGs.

### Recap quiz

<details>
    <summary><b>Why do these features of a cloud-optimised GeoTIFF make them more suited to working with big geospatial datasets than regular GeoTIFF files?</b></summary>
As geospatial datasets increase in size (e.g. satellites capturing data with ever finer spatial resolutions and with a higher cadence) the amount of data we'd need to store and read into memory increases. This might exceed our computer's capacity or result in long runtimes for our program. COGs allow us to just read the data that corresponds to our area-of-interest and not the entire file. This means we can make use of the larger storage capacity of cloud providers and just retrieve the data we need. 
</details>

<p></p>

To download the data for the red band we need to get its link or `href`

In [None]:
least_cloudy_red_href = least_cloudy_s2.assets["B04"].href
least_cloudy_red_href

#### Signing links

To download data from the Planetary Computer the link needs to be "signed". This allows Microsoft to manage traffic and use of the Planetary Computer's resources in the cloud. 

The `planetary_computer` package was imported as `pc` and has a `sign()` function we can use to sign links. You should see a code has been appended to the link to the COG - it has been signed. 

In [None]:
least_cloudy_red_href = pc.sign(least_cloudy_red_href)
least_cloudy_red_href

#### Download COG data

We can read data from COG files in the cloud using rasterio in a similar way to how we've been reading local GeoTIFF files on our machine. 

We use the `rasterio.open()` function to open a file connection to the COG in the cloud and use the connection objects `read()` method to read data from the COG in the cloud to a NumPy `ndarray` on our machine. 

However, to read in a subset of the data we use the `window` argument of `read()` and pass in a `window` object. 

A `Window` object is a rectangular subset of raster defined as `Window(column_offset, row_offset, width, height)`. 

The rasterio.windows module has a <a href="https://rasterio.readthedocs.io/en/latest/api/rasterio.windows.html#rasterio.windows.from_bounds" target="_blank">`from_bounds()`</a> function which converts bounding coordinates to a `Window` object. 

rasterio has a `features` module which has as a `bounds()` function which takes in a GeoJSON geometry or Shapely `geometry` and returns a (left, bottom, right, top) bounding box which we can use to create a `Window`. 

However, our GeoJSON geometry will likely be in EPSG:4326 (geographic) coordinate system which could be different from the project system of the COG we're trying to read data from. This requires us to use rasterio `transform_bounds()` function to transform our bounding box coordinates to the coordinates of the COG data. 

To summarise this process: 

1. use `features.bounds()` to convert GeoJSON or Shapely `geometry` to a bounding box.
2. use `warp.transform_bounds()` to convert the bounding box to the CRS of the COG data.
3. use `windows.from_bounds()` to convert the reprojected bounding box to a `Window` object.
4. pass the `Window` object to `read()` to read only data from the COG within the `Window`. 


In [None]:
# open a connection to the COG using its signed link
with rasterio.open(least_cloudy_red_href) as ds:
    aoi_bounds = features.bounds(aoi_shapely)
    warped_aoi_bounds = warp.transform_bounds("epsg:4326", ds.crs, *aoi_bounds)
    aoi_window = windows.from_bounds(transform=ds.transform, *warped_aoi_bounds)
    meta = ds.meta
    band_data = ds.read(1, window=aoi_window)

In [None]:
# check the meta object for the COG metadata
meta

In [None]:
# let's visualise the data to check it looks ok
px.imshow(band_data, color_continuous_scale="Reds")

### Recap quiz

**Can you download and visualise near infrared reflectance from the same STAC Item? near infrared reflectance is band 8.**

In [None]:
## ADD CODE HERE ##

<details>
    <summary><b>answer</b></summary>

```python
least_cloudy_nir_href = least_cloudy_s2.assets["B08"].href
least_cloudy_nir_href = pc.sign(least_cloudy_nir_href)

with rasterio.open(least_cloudy_nir_href) as ds:
    aoi_bounds = features.bounds(aoi_shapely)
    warped_aoi_bounds = warp.transform_bounds("epsg:4326", ds.crs, *aoi_bounds)
    aoi_window = windows.from_bounds(transform=ds.transform, *warped_aoi_bounds)
    meta = ds.meta
    band_data = ds.read(1, window=aoi_window)
    
px.imshow(band_data, color_continuous_scale="viridis")
```
</details>

## Sentinel-2 and Amazon Web Services

One of the advantages of the STAC specification is that it's a common format for describing spatiotemporal assets. This means we can repeat our workflow to retrieve data from other locations (e.g. other cloud providers). Let's demonstrate this by downloading Sentinel-2 data for the same location and datetime from Amazon Web Services instead.

Free Sentinel-2 cloud-optimised GeoTIFFs can be found on AWS <a href="https://registry.opendata.aws/sentinel-2-l2a-cogs/" target="_blank">here</a> and the URL for the STAC API is `"https://earth-search.aws.element84.com/v0"`.

First, we create a `pystac_client.Client` object and open the STAC API. 

In [None]:
aws_catalog = pystac_client.Client.open(
    "https://earth-search.aws.element84.com/v0"
)

A `pystac_client.Client` object has a `get_collections()` method which lists the STAC Collections within the STAC Catalog. Let's use the `aws_catalog`'s `get_collections()` method to list of STAC Collections in AWS's Earth Search. 

In [None]:
# search a catalog by listing its collections
collections = list(aws_catalog.get_collections())

print(f"Number of collections: {len(collections)}")
print("Collections IDs:")
for collection in collections:
    print(f"- {collection.id}")

We're after the `sentinel-s2-l2a-cogs` collection. However, this is an example of how you can query a root catalog to find out what sub-catalogs it contains and could be useful for your analysis. For example, we can also see there is some Landsat 8 data available on AWS. 

Let's create a search of AWS STAC Catalog using the `sentinel-s2-l2a-cogs` collection.

In [None]:
aws_search = aws_catalog.search(
    collections=["sentinel-s2-l2a-cogs"],
    bbox=bbox,
    datetime=time_of_interest,
    query={"eo:cloud_cover": {"lt": 10}},
)

# Check how many items were returned
aws_items = aws_search.item_collection()
print(f"Returned {len(aws_items)} Items")

In [None]:
aws_items

Inspecting the `ItemCollection` from AWS you notice similarities to the organisation of STAC Items as was returned from Microsoft's Planetary Computer. One of the Assets is a Thumbnail which is a preview image in PNG format. This is useful if we want to visually inspect the satellite image without needing to download all the raw data. Let's demonstrate how to download and visualise the Thumbnail image. 

We can use `io` from the scikit-image package to read a PNG file.

Here, we'll get the first STAC Item from `aws_items` and access its Assets to get the link to the Thumbnail PNG file. Note, this thumbnail is for the Sentinel-2 tile and not just the area-of-interest for our field. 

In [None]:
img = io.imread(aws_items[0].assets["thumbnail"].href)
px.imshow(img)