# Download Landsat data using STAC API

## Landsat data

Landsat-5, 7, 8 and 9 [collection 2](https://www.usgs.gov/landsat-missions/landsat-collection-2) products are managed by USGS. USGS make Landsat data available via number of services, including:

- [Earth Explorer](https://earthexplorer.usgs.gov) - USGS data browser and viewer
- [Landsat Look](https://landsatlook.usgs.gov/stac-browser/collection02) - Landsat scene browser (STAC)
- [AWS OpenData](https://registry.opendata.aws/usgs-landsat) - Cloud-hosted data (STAC)
- [ESPA](https://espa.cr.usgs.gov) - USGS On-demand processing
- Google Earth Engine
- Microsoft Planetary Computer

#### Data source and documentation

[Surface Reflectance](https://www.usgs.gov/landsat-missions/landsat-collection-2-level-2-science-products), [Surface Temperature](https://www.usgs.gov/landsat-missions/landsat-collection-2-level-2-science-products) and [Level-1 (top of atmosphere)](https://www.usgs.gov/landsat-missions/landsat-collection-2-level-1-data) products for each of [Landsat-5, 7, 8 and 9](https://www.usgs.gov/landsat-missions/landsat-satellite-missions) are available.

## Spatio Temporal Asset Catalogs (STAC)
The STAC specification is a common language to describe geospatial information. A STAC API provides a search and selection interface to a catalog of items and files. See https://stacindex.org/catalogs#/ for a list of providers using STAC.

While the STAC specification allows for consistent searching and access to available files, how these files are used and interpreted can still be a challenge or at least specific to each custodian.

This notebook demonstrates how to search, download, visualise and export Landsat and Sentinel-2 satellite imagery.
- USGS Landsat on Level-1 and -2 products AWS, https://registry.opendata.aws/usgs-landsat/
- Element-84 Sentinel-2 "sen2cor"-corrected surface reflectance on AWS, https://registry.opendata.aws/sentinel-2/

## Open Data Cube

The Open Data Cube (ODC) records product and scene information in a database and provides tools for tranforming and aggregating scene data into geospatial python `xarray` "cubes". In this context, the STAC API can replace some parts of the ODC database while providing the core geospatial information required for transforming and aggregating the scene data into cubes.

The ODC [odc-stac](https://github.com/opendatacube/odc-stac) and [odc-geo](https://github.com/opendatacube/odc-geo) packages provide the core functionality of reading, tranforming and aggregating files. In particular, the odc-stac library takes a list of STAC items as input and reads these into an `xarray` cube compatible with ODC functions.

## More information

This notebook was adapted from https://github.com/opendatacube/odc-stac/tree/develop/notebooks.

In [None]:
# Minimal packages
import os, sys
from pystac_client import Client
from odc.stac import configure_s3_access, stac_load

# Python packages
import re
import json
import pandas as pd
import numpy as np
from pathlib import Path

# ODC packages
from dea_tools.plotting import display_map, rgb
from datacube.utils import masking
from odc.algo import mask_cleanup, erase_bad, to_f32
from odc.ui import image_aspect

# EASI packages
repo = Path.home() / 'eocsi-hackathon-2022'  # No easy way to get repo directory
if repo not in sys.path: sys.path.append(str(repo))
from tools.notebook_utils import heading, xarray_object_size, initialize_dask, localcluster_dashboard
from tools.stac_utils import stac_landsat_assets_df, stac_landsat_flags_to_dc

In [None]:
# Setup

# Does this work stand-alone or require an AWS account?
configure_s3_access(requester_pays=True)

# Optional: use EASI SE Asia caching-proxy service
os.environ["AWS_HTTPS"] = "NO"
os.environ["GDAL_HTTP_PROXY"] = "easi-caching-proxy.caching-proxy:80"
print(f'Will use caching proxy at: {os.environ.get("GDAL_HTTP_PROXY")}')

# Optional: Dask
cluster, client = initialize_dask(use_gateway=True, workers=(1,7), wait=False)
if cluster: display(cluster)
if cluster is None or 'LocalCluster' in str(type(cluster)): display(localcluster_dashboard(client))

## Select an area of interest

Here are some example areas of interest. For simplicity they are defined with a bounding box. The `pystac` and `odc-stac` packages both accept a geopolygon as well.

In [None]:
# Vietnam - Ha Long
latitude = (20.5, 21.1)
longitude = (106.5, 107.2)
time=('2022-01-01', '2022-06-01')

# PNG Milne Bay
# latitude = (-10.8, -10)
# longitude = (149.7, 150.8)  
# time=('2022-01-01', '2022-03-01')

# Fiji - blows up JHub memory due to antemeridian
# latitude = (-17.1, -16.2)
# longitude = (178.2, 180.0)
# time=('2020-02-01', '2020-02-20')

# west, south, east, north
bbox = [longitude[0], latitude[0], longitude[1], latitude[1]]

# Display bounding box on a map
display_map(longitude, latitude)

## Search USGS STAC catalog

The odc-stac package There is an amount of detail in the following cell, which has been accumulated from various sources and our experience. Some 


In [None]:
if do_landsat:

    # STAC catalog and query
    catalog = Client.open('https://landsatlook.usgs.gov/stac-server/')
    product = 'landsat-c2l2-sr'
    query_cfg = ["platform=LANDSAT_8", "landsat:collection_category=T1"]

    # Search for available items
    query = catalog.search(
        collections=[product], datetime=f'{time[0]}/{time[1]}', bbox=bbox, query=query_cfg
    )
    items = list(query.get_items())
    print(f"Found: {len(items):d} datasets")

    # Rewrite URLs to use S3
    def landsat_patch(uri: str) -> str:
        """Return the S3 version of the URI"""
        return uri.replace('https://landsatlook.usgs.gov/data/', 's3://usgs-landsat/')

    # Change or update STAC information for use by ODC
    stac_cfg = json.load(open(f'{repo}/stac_cfgs/aws/{product}.json'))

    # `stac_load` parameters
    bands = ('blue', 'green', 'red', 'nir08', 'qa_pixel')
    stac_call = {
        'bands': bands,                            # Optional: selected bands
        'bbox': bbox,                              # Bounding box. Also Geopolygon, GeoBox or None (full extent of items)
        'chunks': {'x': 4096, 'y': 4096},          # Optional: if using Dask
        'groupby': "solar_day",                    # "solar_day" = group scenes on same solar day into same time layer in cube
        'stac_cfg': stac_cfg,
        'patch_url': landsat_patch,
    }

    # Additional Landsat band specifications
    band_specs = {
        'red': {
            'scale': 0.0000275,
            'offset': -0.2
        },
        'nir08': {
            'scale': 0.0000275,
            'offset': -0.2
        },
    }

## Sentinel-2 configuration and settings

In [None]:
if do_sentinel:
    
    # STAC catalog and query
    catalog = Client.open('https://earth-search.aws.element84.com/v0')
    product = 'sentinel-s2-l2a-cogs'
    
    # Search for available items
    query = catalog.search(
        collections=[product], datetime=f'{time[0]}/{time[1]}', bbox=bbox,
    )
    items = list(query.get_items())
    print(f"Found: {len(items):d} datasets")
    
    # Rewrite URLs to use S3
    def patch(uri: str) -> str:
        """Return the Sentinel-2 S3 version of the URI"""
        return uri.replace('https://sentinel-cogs.s3.us-west-2.amazonaws.com/', 's3://sentinel-cogs/')
    
    # Change or update STAC information for use by ODC 
    stac2odc_cfg = {
        "sentinel-s2-l2a-cogs": {
            "assets": {
                "*": {"data_type": "uint16", "nodata": 0},
                "SCL": {"data_type": "uint8", "nodata": 0},
                "visual": {"data_type": "uint8", "nodata": 0},
            },
            "aliases": {"red": "B04", "green": "B03", "blue": "B02"},
        },
        "*": {"warnings": "ignore"},
    }
    
    # `stac_load` parameters
    stac_call = {
        'bands': ("B04",),
        'crs': crs,
        'resolution': 30,
        # chunks={},  # <-- use Dask
        # groupby="solar_day",
        'stac_cfg': cfg,
        'patch_url': patch,
    }

In [None]:
# Optional: Explore the structure of a STAC item
display(items[0])

# Optional: List available band names and selected details
display(stac_landsat_assets_df(items[0]))

## Load the selected items into an `xarray` cube

In [None]:
xx = stac_load(items, **stac_call)

heading(xarray_object_size(xx))
display(xx)
display(xx.odc.geobox)
aspect = image_aspect(xx)

## Apply pixel quality masking and scaling

Following [Cloud_and_pixel_quality_masking.ipynb](https://github.com/digitalearthafrica/deafrica-sandbox-notebooks/blob/main/Frequently_used_code/Cloud_and_pixel_quality_masking.ipynb), we choose the "cloud mask filtered" method to remove false-positive cloud features before further analysis.

> Note that there are some differences in the key or value labels between the Landsat flags definition in the referenced notebook and available from STAC, which is just due to human interpretation. See the Landsat product links at the top of this notebook for further information.

In [None]:
# Convert pixel quality flag descriptions to ODC format
flags_def = stac_landsat_flags_to_dc(
    items[0].assets['qa_pixel'].to_dict().get('classification:bitfields')
)
# heading('"qa_pixel" flags definition')
# display(flags_def)

# Cloud mask flags (logical AND?)
quality_flags = {
    'cloud': 'cloud',    # True where there is cloud
    'cirrus': 'cirrus',  # True where there is cirrus cloud
    'shadow': 'shadow',  # True where there is cloud shadow 
}

# Set bit mask: True=cloud, False=non-cloud
mask, _= masking.create_mask_value(flags_def, **quality_flags)

# Add the cloud mask to our dataset
xx['cloud_mask'] = (xx['qa_pixel'] & mask) != 0  # bitwise-and != 0 simulates an 'or'

# Apply morphological processing on the cloud mask
filters = [("opening", 2),("dilation", 2)]
xx['cloud_mask_filtered'] = mask_cleanup(xx['cloud_mask'], mask_filters=filters)

# Apply the the cloud-mask to the data variables
clear_filtered = erase_bad(xx.drop_vars(['cloud_mask_filtered', 'cloud_mask', 'qa_pixel']),
                           xx['cloud_mask_filtered'])

# Apply scale and offset
scale = stac_cfg[product]['assets']['*']['scale']
offset = stac_cfg[product]['assets']['*']['offset']
clear_filtered = (clear_filtered * scale + offset).astype(np.float32)

## Dask: compute results

If using dask then the above calculations may have been queued but not yet calculated. Dask will run the calculations when required, for example when creating an image or writing to a file. In this case, use `persist()` to force the calulations.
- https://distributed.dask.org/en/stable/manage-computation.html

In [None]:
xx = xx.persist()

clear_filtered = clear_filtered.persist()

In [None]:
# Optional: Check the data structures and sizes

# heading(xarray_object_size(xx))
# display(xx)

# heading(xarray_object_size(clear_filtered))
# display(clear_filtered)

## Optional plot summaries

These simpler "matplotlib backend" plot functions (`rgb`, `xx.plot`) render images, from the full data arrays, on the Jupyter notebook server. This may exceed the available Jupyter kernel memory depending on the size of the area of interest (size of the dataset). As such these methods are best suited to demonstrating smaller areas of interest.

To visualise larger areas of interest consider either subsampling the data arrays ([xarray indexing](https://docs.xarray.dev/en/stable/user-guide/indexing.html)) or try the [Holoviz](https://holoviz.org/) stack:
- https://datashader.org/getting_started/Pipeline.html
- https://examples.pyviz.org/landsat/landsat.html

In [None]:
# Optional: Plot and RGB timeseries
rgb(xx, ['red', 'green', 'blue'], col='time', col_wrap=4)

# Optional: plot the masked and scaled data
rgb(clear_filtered, ['red', 'green', 'blue'], col='time', col_wrap=4)

## Export to netCDF or Geotiff

## Close the Dask cluster

Its good practice to shutdown the dask cluster when the work is complete (no further work on the loaded and processed data).

In [None]:
# client.close()
# if cluster: cluster.close()