# GeoStacks: a library for efficient query and stacking of satellite remote sensing data sets

(Namely, file name should be: first author's initials_version_title.ipynb. For example:
*EF_01_Data Exploration.ipynb*)

## Author(s)

- Author1 = {"name": "Shane Grigsby",     "affiliation": "UMD/NASA Goddard", "email": "grigsby@umd.edu", "orcid": "0000-0003-4904-7785"}
- Author2 = {"name": "Whyjay Zheng",      "affiliation": "UC Berkeley",      "email": "whyjz@berkeley.edu", "orcid": "0000-0002-2316-2614"}
- Author3 = {"name": "Jonathan Taylor",   "affiliation": "Stanford",         "email": "jonathan.taylor@stanford.edu", "orcid": "0000-0002-1716-7160"}
- Author4 = {"name": "Facundo Sapienza",  "affiliation": "UC Berkeley",      "email": "fsapienza@berkeley.edu", "orcid": "0000-0003-4252-7161"}
- Author5 = {"name": "Tasha Snow",        "affiliation": "Mines",            "email": "tsnow@mines.edu", "orcid": "0000-0001-5697-5470"}
- Author6 = {"name": "Fernando Pérez",    "affiliation": "UC Berkeley",      "email": "fernando.perez@berkeley.edu", "orcid": "0000-0002-1725-9815"}
- Author7 = {"name": "Matthew Siegfried", "affiliation": "Mines",            "email": "siegfried@mines.edu", "orcid": "0000-0002-0868-4633"}

## Purpose
This notebook demonstrates two related tasks for Earth Science investigations using remote sensing data:

 1. Data Discover
 2. Scriptable and reproducible data retrieval
 
These tasks are broadly the 'data ingest' portion of Earth science research. Although reproducible science tends to focus on publically availible code that produces consistent and tracible outputs, most data ingest tasks are manual. A jupyter notebook might execute on a directory of data (Landsat or MODIS files), but the subset of data is often collected in a manual fashion. This breaks the chain of reproducibility: is a collaborator or other investigator using the same input files, from the same data collection? Futhermore, it also adds substantial overhead to validating or expanding anaylsis.

In this notebook we demonstrate a method of data retrieval that are scriptable, so that an analysis notebook can contain both the analyitic code, and calls to manafest the data that is used in a consistent and reproducible way.

## Technical contributions

We demonstrate a simple use case using Landsat 8 data---pulling a time series of data for a location and calculating estimated cloud cover over the area. Although we use Level 3 Landsat data, the process would be similar for Level 3 MODIS data and can be adapted for similar retrieval of Level 2 swath data products. The process shown includes an interactive data discovery stage (i.e., determining the scene or scenes of interest). If the study area is already known, these first steps can be omitted and just the retrieval code can be included.

Part of reproducible and open science is reducing barriers to adoption. Some of the techniques that we show do have mature analogues that already exist in the community. We are using Landsat data in this demonstration, and a clear question is why not use the USGS and NASA data discovery tools? There are multiple online viewers that facilitate data discovery and download (in fact a confusing number at times); it is also possible to use file lists and/or generated download scripts to recreate dataset directories in an automated way. So why this library? Several reasons motivate developement:

 - LandLook and similar interactive viewers are not scriptable in terms of data download
 - EarthData portals can produce download scripts, but require login credientials
 - All of the viewers and existing retrieval methods are multi-step:
   - File lists must be parsed and pointed to data source repositories
   - Auto-generated scripts must be configured with credentials 
   - Retrivels are only to disk and not to memory, requiring furter scripting
 - None of the existing methods pull data in a way conducive to a cloud workflow

We weave several modern projects together for a cohesive remote sensing data retrieval workflow. Specifically, we use the following libraries:

 - Intake for catalog templating and semantic dataset access
 - Xarray and Dask for distributed representation of data rasters (via Intake-xarray)
 - ipyleaflet for display of scene footprints and data exploration
 - Pandas for metadata access and management
 - sklearn for fast spatial indexing in spherical coordinates
 - GeoStacks, our library to glue everything together


## Methodology

Landsat 8 data is stored on the fixed World Reference System 2 (WRS-2) grid, which uses path and row coordinates to map a grid data aquisitions. We use a multi-year catalog of corrected scene center and corner locations index availible data scenes. After providing a query lat/lon point, a two step search is conducted: first using haversine distances to prune the tree in an approximate manner, with point in polygon checks to refine the data search. This search process accomplishes several objectives:

 - Finds all relevent scenes. Scene overlap increases towards the poles, and a single data site may have multiple data aquisitions in the polar regions.
 - Removes false positives, by ensuring that selected scenes have actual data that is of interest. This is important because the metadata of the raster files describe the bounds of the data grid, but the bearing of the flight path causes large no-data areas in the corners of the data grids when the image is projected onto the Earth surface; we use the aquisition footprints to only select granules that have valid data at the query point.
 - Find only valid scenes. Landsat does not aquire data at all path/row combinations; some of the path/row combinations are over ocean, or occur at late enough at night on the night side that there is no data telemetered for that path/row. 

Data can be selected for either single time steps, or time ranges. 

## Results
Describe and comment on the most important results. Include images and URLs as necessary. 

## Funding
Include references to awards that supported this research. Add as many award references as you need.

- Award1 = {"agency": "agency", "award_code": "award_code", "award_URL": "award_URL"}

## Keywords
Include up to 5 keywords, using the template below.

keywords=["keyword1", "keyword2", "keyword3", "keyword4", "keyword5"]

## Citation
Include recommended citation for the notebook.

## Work In Progress - improvements
Use this section only if the notebook is not final.

Notable TODOs:
- todo 1;
- todo 2;

## Suggested next steps
State suggested next steps, based on results obtained in this notebook. This section is optional.

## Acknowledgements 
Include any relevant acknowledgements, apart from funding (which was in section 1.6)

In [1]:
# Autoreload extension
%load_ext autoreload
%autoreload 2

%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
cd ../..

/home/espg/software/blanktest/GeoStacks


In [3]:
# Load geostacks
from geostacks import SpatialIndexLS8, GeoStacksUI

# Load S3 related module
from datetime import date

# Load dask
from dask.distributed import Client
import urllib

## Spatial Index Object

The base spatial indexing object combines several features of the sensor data catalog. It includes the footprint database of all valid Landsat 8 path/row combinations:

In [4]:
ls8 = SpatialIndexLS8()
ls8.data

Unnamed: 0,path,row,lat_CTR,lon_CTR,lat_UL,lon_UL,lat_UR,lon_UR,lat_LL,lon_LL,lat_LR,lon_LR
0,1,2,80.002493,-4.197763,81.205697,-2.730017,79.717460,2.594563,80.144957,-11.291900,78.789287,-5.405237
1,1,3,79.111023,-10.561457,80.332344,-9.994770,78.957946,-4.156684,79.130143,-17.075937,77.882976,-11.061501
2,1,4,78.118527,-15.970556,79.344246,-16.045440,78.079664,-10.018548,78.034189,-21.909139,76.886141,-15.950291
3,1,5,77.048224,-20.471403,78.269010,-20.978442,77.105158,-14.988030,76.879455,-25.870470,75.819367,-20.086304
4,1,6,75.902095,-24.338152,77.113394,-25.133782,76.041404,-19.307509,75.661913,-29.246411,74.680825,-23.697966
...,...,...,...,...,...,...,...,...,...,...,...,...
21898,233,242,80.008794,44.207091,80.154264,51.282248,78.799314,45.389529,81.215211,42.713538,79.726284,37.379350
21899,233,243,80.760793,36.728885,81.052819,44.273388,79.584925,38.823265,81.923478,34.042857,80.327864,29.693888
21900,233,244,81.338812,28.123821,81.798007,35.890003,80.221703,31.325044,82.424145,24.005498,80.744047,21.177876
21901,233,245,81.705630,18.551148,82.326232,26.541583,80.663323,23.323882,82.678136,12.476933,80.951763,11.702585


It also includes the Intake catalog (more on that later), and a spatial index data structure to query the catalog. A query on the object returns a subset of the path/row combinations that bound the query point:

In [5]:
idxs = ls8.query_pathrow(69, -50)   # lat, lon
idxs.data

Unnamed: 0,path,row,lat_CTR,lon_CTR,lat_UL,lon_UL,lat_UR,lon_UR,lat_LL,lon_LL,lat_LR,lon_LR
695,8,11,69.60647,-47.813174,70.756261,-49.045943,69.997303,-44.451408,69.154025,-51.050718,68.439813,-46.725544
696,8,12,68.279699,-49.50613,69.419648,-50.743468,68.700888,-46.367352,67.801954,-52.529493,67.122236,-48.403704
782,9,11,69.606482,-49.355231,70.757156,-50.589565,69.99752,-45.990513,69.153716,-52.595528,68.438861,-48.266305
783,9,12,68.279697,-51.046291,69.420484,-52.285488,68.700913,-47.904523,67.801817,-54.072478,67.121321,-49.942257
866,10,11,69.606439,-50.895321,70.756467,-52.1327,69.996319,-47.530333,69.154836,-54.136166,68.439445,-49.803544
6845,81,233,69.606467,-48.202833,69.158069,-44.965508,68.442808,-49.295142,70.763699,-46.977141,70.002875,-51.580448
6915,82,233,69.60644,-49.749759,69.160805,-46.513479,68.444886,-50.847126,70.761497,-48.518743,70.00013,-53.125468
7017,83,232,68.279726,-49.604053,67.806027,-46.581815,67.125546,-50.709373,69.426907,-48.374276,68.706648,-52.756036
7018,83,233,69.606447,-51.293647,69.158697,-48.052981,68.442311,-52.389424,70.764101,-50.06405,70.002091,-54.674559
7087,84,232,68.279691,-51.145138,67.808695,-48.124313,67.127863,-52.255251,69.424457,-49.910379,68.70397,-54.294948


In [6]:
cpanel = GeoStacksUI(spatial_index=ls8)
cpanel.gen_ui()

AppLayout(children=(VBox(children=(HTML(value='<h2>Drag the marker to your region of interest</h2>'), Select(d…

In [7]:
# helper functions
def JD(year,month,day):
    "converts to day of year"
    t = time.mktime((year,month,day,0,0,0,0,0,0))
    return int(time.gmtime(t)[7])

def pad(number, length):
    "takes number, cast to string with padded zeros"
    while len(str(number)) < length:
        number = '0' + str(number)
        pad(number, length)
    return number


In [8]:
client = Client(processes=True, n_workers=4, threads_per_worker=1)
client

0,1
Client  Scheduler: tcp://127.0.0.1:39325  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 16.55 GB


In [9]:
date_range = pd.date_range(start='2014-01-01', end='2014-04-01')


# Function for cleaning the data: rename band -> time and create datetime object
def preprocess(ds, value):
    ds["band"] = [value.to_numpy()]
    ds = ds.rename({'band': 'time'})
    return ds


def retrieve_dataset(value):
    try:
        doy_ = pad(JD(value.year, value.month, value.day ), 3)
        ds = cat.landsat8aws(year=value.year, doy=doy_, path="003", row="010").to_dask()
        return preprocess(ds, value)
    except Exception:
        return None


datasets = client.map(retrieve_dataset, date_range)
datasets = client.gather(datasets)
datasets = [dataset for dataset in datasets if dataset is not None]
ds = xr.concat(datasets, dim='time', compat='override', coords='minimal').squeeze()
ds

AppLayout(children=(VBox(children=(HTML(value='<h2>Drag the marker to your region of interest</h2>'), Select(d…

In [7]:
idxs.intake

'/home/whyj/Projects/Github/GeoStacks/geostacks/sensors/ls8.yaml'

In [22]:

date_range = pd.date_range(start='2014-01-01', end='2014-04-01')


# Function for cleaning the data: rename band -> time and create datetime object
def preprocess(ds, value):
    ds["band"] = [value.to_numpy()]
    ds = ds.rename({'band': 'time'})
    return ds


def retrieve_dataset(value):
    try:
        doy_ = pad(JD(value.year, value.month, value.day ), 3)
        ds = cat.landsat8aws(year=value.year, doy=doy_, path="003", row="010").to_dask()
        return preprocess(ds, value)
    except Exception:
        return None


datasets = client.map(retrieve_dataset, date_range)
datasets = client.gather(datasets)
datasets = [dataset for dataset in datasets if dataset is not None]
ds = xr.concat(datasets, dim='time', compat='override', coords='minimal').squeeze()
ds

  result = blockwise(
  result = blockwise(
  result = blockwise(
  result = blockwise(


Unnamed: 0,Array,Chunk
Bytes,2.42 GB,2.10 MB
Shape,"(4, 8701, 8691)","(1, 512, 512)"
Count,16462 Tasks,3332 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 2.42 GB 2.10 MB Shape (4, 8701, 8691) (1, 512, 512) Count 16462 Tasks 3332 Chunks Type float64 numpy.ndarray",8691  8701  4,

Unnamed: 0,Array,Chunk
Bytes,2.42 GB,2.10 MB
Shape,"(4, 8701, 8691)","(1, 512, 512)"
Count,16462 Tasks,3332 Chunks
Type,float64,numpy.ndarray


In [23]:
import hvplot.xarray

width = 800
height = 400
widget_type = 'scrubber'
widget_location = 'bottom'


ds.hvplot.image(
    rasterize=True,
    aspect='equal',
    x="x",
    y="y",
    cmap='gray',
    clim=(4000, 6500),
    width=width,
    height=height,
    widget_type=widget_type,
    widget_location=widget_location,
)