# Geocube SDK Tutorial

-------

#### Short description

This notebook introduces you to the SDK framework with a complete Geocube workflow using the Python Client. You will see a typical workflow and how to parallelize an image processing algorithm.
This notebook is in two chapters. The first presents the SDK, the second gives an example of a workflow.

-------

#### Requirements

- Python 3.7
- The Geocube Python Client library : https://github.com/airbusgeo/geocube-client-python.git
- The url of a [Geocube Server](https://github.com/airbusgeo/geocube.git) & its Client ApiKey (for the purpose of this notebook, `GEOCUBE_SERVER` and `GEOCUBE_CLIENTAPIKEY` environment variable) - [Installation](https://github.com/airbusgeo/geocube/blob/main/INSTALL.MD) 
- The url of a [Geocube Downloader](https://github.com/airbusgeo/geocube/blob/main/INSTALL.MD#Downloader) (for the purpose of this notebook, `GEOCUBE_DOWNLOADER` environment variable)

-------

#### Table of content

- [Part 1 - SDK](#SDK)
  * [Connection Parameters](#Connection)
  * [Downloader Service](#Downloader-Service)
  * [Catalogue Functions](#Catalogue-functions)
  * [Multiprocess and Dask](#Multiprocess-and-Dask)
  * [XArray and Collection](#XArray-and-Collection)
  
  
- [Part 2 - Geocube Workflow by example: Abnormal change detection](Geocube-Client-SDK-2.ipynb#Table-of-content) (Geocube-Client-SDK-2.ipynb)


#### Notebook initialisation

In [None]:
from datetime import datetime
from shapely import geometry
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import os

from geocube import utils, entities, Consolidater, sdk

# Define the connection to the server
secure = False # in local, or true to use TLS
geocube_client_server  = os.environ['GEOCUBE_SERVER']        # e.g. 127.0.0.1:8080 for local use
geocube_client_api_key = os.environ['GEOCUBE_CLIENTAPIKEY']  # Usually empty for local use
geocube_downloader_server = os.environ['GEOCUBE_DOWNLOADER']


# SDK
Geocube Client Python provides several functions to easily scale-up an image processing pipeline.
In particular, Geocube implements an [xarray](https://xarray.pydata.org/en/stable/) backend to access geocube images using the standard `xarray.Dataset`.

As scaling-up a pipeline implies parallel computing, the Geocube entities are picklable to be transferred between processes.

### Connection
A `geocube.Client` is not picklable. `geocube.sdk` provides a convenient way to pass connection parameters: `sdk.ConnectionParams`. This function takes the same parameters as `geocube.Client` and has a function `new_client()` to connect to the server.

In [None]:
# Define the parameters to connect to the Geocube server
connection_params = sdk.ConnectionParams(geocube_client_server, secure, geocube_client_api_key)

# Connect to the server
client = connection_params.new_client()
print("Connected to server: ", client.version())

### Downloader Service
Because the Geocube Server can be easily overwhelmed by a lot of connections, two solutions are possible to massively download data:
- deploy several Geocube servers
- run (locally or remotely) a lighter service that is in charge of downloading the images using metadata returned by the Geocube Server.

The latter case is handled by the [Geocube Downloader service](https://github.com/airbusgeo/geocube/blob/main/cmd/downloader). It can be run as a docker:
```bash
export STORAGE=[...]
docker run --rm -p 127.0.0.1:8081:8081/tcp -v $STORAGE:$STORAGE geocube-downloader -local -port 8081
```
Don't forget to mount all the local storage folders and give the proper access rights to the remote storages.
More details on the [Geocube INSTALL.MD](https://github.com/airbusgeo/geocube/blob/main/INSTALL.MD#downloader).

Downloader Service needs `CubeMetadata` to request a cube. It can be retrieved using `get_cube_metadata()` function.

In [None]:
from geocube import Downloader
downloader = Downloader(geocube_downloader_server)


cube_params = entities.CubeParams.from_tile(
    tile          = entities.Tile.from_bbox((573440., 6184960.,  593920., 6205440), crs=32632, resolution=20),
    instance      = client.variable("RGB").instance("master"),
    tags          = {"source":"tutorial"},
    from_time     = datetime(2019, 1, 1),
    to_time       = datetime(2019, 5, 1),
)

# Request metadata
metadata = client.get_cube_metadata(cube_params)

# Download the cube described by metadata
images, _ = downloader.get_cube(metadata)

print("Finished !")

#### Metadata
`CubeMetadata` and `CubeMetadata.slices` contains all that is necessary to download and format the cube and 2d-slices of data : the underlying files, their internal data format, the mapping to `CubeMetadata.ref_dformat`, the records, the transform and the crs, the resampling algorithm...

It can be used to have direct access to the file or change the resampling algorithm.

In [None]:
for i, s in enumerate(metadata.slices):
    print(f"Image {i}:")
    for file_metadata in s.metadata:
        print(f"   - file {file_metadata.container_uri}, subdir {file_metadata.container_subdir}.")

A client can be linked to a downloader and the former will automatically use the latter to download a cube of data.

In [None]:
client.use_downloader(downloader)

# Or using ConnectionParams
connection_params = sdk.ConnectionParams(geocube_client_server, secure, geocube_client_api_key,
                                         downloader=sdk.ConnectionParams(geocube_downloader_server))
client = connection_params.new_client()

_ = client.get_cube(cube_params, verbose=True)

### Catalogue functions
`sdk.get_cube()` is a convenient function to process a cube of data in a parallel workflow.
It takes `ConnectionParams` and `CubeParams`, downloads the cube and process it.

Two callback functions can be passed as parameter of this function:
- `image_callback` will be called for every image received by `get_cube` (see `sdk.image_callback_t`). It has to return an image  and takes as parameters:
  * `image`
  * `grouped_records` (optional)
  * `crs` or `projection`  (optional)
  * `transform` (optional)
- `cube_callback` will be called with the results of a `get_cube` (see `sdk.cube_callback_t`). The result of this function will be the returned value of `get_cube()`. It takes as parameters:
  * `timeseries` or `cube`
  * `grouped_records` (optional)
  * `crs` or `projection` (optional)
  * `transform` (optional): geotransform

All these fields will be automatically provided by `get_cube()` if they are defined.

`sdk.get_cube` is equivalent to:
```python
def sdk.get_cube():
    for image in client.get_cube():
        image = image_callback(image, [grouped_records, crs, projection, transform])
        cube.append(image)
    return cube_allback(cube, [grouped_records, crs, projection, transform])
```

In [None]:
import numpy as np
from geocube import utils
import functools

def mean(timeseries, projection, transform, instance):
    """ Compute the mean of a timeseries over time and save it as a GeoTiff"""
    if len(timeseries) != 0:
        m = np.nanmean(timeseries, axis=-1)
        filename="outputs/" + ("_".join([f"{v:.2f}" for v in transform.to_gdal()]))+".tiff"
        utils.image_to_geotiff(m, transform, projection, instance.dformat.no_data, filename)
        return filename
    return None

# Download cube and call mean().
sdk.get_cube(connection_params, cube_params, cube_callback=functools.partial(mean, instance=client.variable("RGB").instance("master")))



### Multiprocess and Dask
The `sdk.get_cube()` function can be used for parallel processing as well as any picklable user-defined function.
With `dask.delayed`, a pool of tasks can be easily created:

In [None]:
import dask
import functools

# Define the AOI, the records and the instance
aoi = utils.read_aoi('inputs/Denmark.json')
records = client.list_records(aoi=aoi, tags={"source":"tutorial"})
instance = client.variable("RGB").instance("master")

# Tile the AOI and create the pool of tasks
tiles = client.tile_aoi(aoi, resolution=20, crs="epsg:32632", shape=(1024,1024))
tasks = [dask.delayed(sdk.get_cube)(
    connection_params=connection_params,
    cube_params=entities.CubeParams.from_tile(t, records=records, instance=instance),
    cube_callback=functools.partial(mean, instance=instance))
         for t in tiles]


plt.rcParams['figure.figsize'] = [20, 16]
print(f"Plot {len(tiles)} tiles...")
entities.Tile.plot(tiles)

The tasks can be computed using `dask.scheduler`:

In [None]:
dask.config.set(scheduler='synchronous')
dask.compute(tasks)
print("done")


Or with `dask.distributed`
```shell
python -m pip install "dask[distributed]"
```

In [None]:
from dask.distributed import Client, wait
from tqdm import tqdm
import time
c = None

try:
    c = Client(n_workers=4)
    futures = c.compute(tasks)

    with tqdm(total=len(futures)) as pbar:
        while len(futures) > 0:
            nf = []
            for f in futures:
                if f.status=="finished":
                    pbar.update(1)
                else:
                    nf.append(f)
                    if f.status == "error":
                        print("retry", f)
                        f.retry()
            time.sleep(0.5)
            futures = nf
    print("done")
finally:
    if c is not None:
        c.close()
    


### XArray and Collection
A `Collection` describes a collection of datasets corresponding to several variables and a set of records.
It is readable by `xarray`:

In [None]:
import xarray
from geocube import sdk, entities
import matplotlib.pyplot as plt
import math

# Create a tile
record = client.list_records("S2B_MSIL1C_20190105T103429_N0207_R108_T32UNG_20190105T122413", with_aoi=True)[0]
tile = entities.Tile.from_record(record=record, crs="epsg:32632", resolution=20)

# Select variables
instances = [
    client.variable("RGB").instance("master"),
    client.variable("NDVI").instance("master")
]

# Select records
records = client.list_records(tags={'source':'tutorial','constellation':'SENTINEL2'})

# Create collection
collection = sdk.Collection.from_tile(tile, records=records, instances=instances)

# Open collection
#ds = xarray.open_dataset(collection, connection_params=connection_params, block_size=(256, 256), engine=sdk.GeocubeBackendEntrypoint)
ds = sdk.open_geocube(collection, connection_params=connection_params, block_size=(256, 256))

# Get timeseries of RGB
images = ds["RGB:master"][0:1000,3000:4000].compute()


In [None]:

nbimages=images.shape[3]
plt.rcParams['figure.figsize'] = [20, 16]
for f in range(0, nbimages):
    plt.subplot(math.ceil(nbimages/4), 4, f+1).set_axis_off()
    plt.imshow(images[:,:,:, f]/255)

# Next step: demonstration of a full workflow by example
In the [next notebook](Geocube-Client-SDK-2.ipynb), the sdk will be used to address a simple but typical use-case: abnormal change detection.