# Geocube Data Indexation Tutorial

-------

**Short description**

This notebook introduces you to the Geocube Python Client. You will learn how to index images in the Geocube.

-------

**Requirements**

-------

- Python 3.7
- The Geocube Python Client library : https://github.com/airbusgeo/geocube-client-python.git
- The Geocube Server & Client ApiKey (for the purpose of this notebook, GEOCUBE_SERVER and GEOCUBE_CLIENTAPIKEY environment variable)

-------

**Installation**

-------

Install Python client:

```shell
pip install --user git+https://github.com/airbusgeo/geocube-client-python.git
```

Run docker (example):
```shell
docker run --rm --network=host -v $(pwd)/inputs:$(pwd)/inputs geocube -dbConnection=postgresql://user:password@localhost:5432/geocube -local
```

## 1 - Connect to the Geocube


In [None]:
import glob
import math
import os
from datetime import datetime
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 16]

In [None]:
from geocube import Client, entities, utils

# Define the connection to the server
secure = False # in local, or true to use TLS
geocube_client_server  = os.environ['GEOCUBE_SERVER']        # e.g. 127.0.0.1:8080 for local use
geocube_client_api_key = os.environ['GEOCUBE_CLIENTAPIKEY']  # Usually empty for local use

# Connect to the server
client = Client(geocube_client_server, secure, geocube_client_api_key)

## 2 - Indexation in a nutshell

In the Geocube, an image (a **dataset**) is indexed by a **record** and an **instance** of a **variable**.
These concepts will soon be defined in details, but in short, a record defines the data-take and the variable definies the kind of data.

<img src="GetImage.png" width=400>

Adding an image in the Geocube is a process called *indexation*.

In [None]:
print("It's very easy to add an image in the Geocube:")

print("Create the AOI of the record")
aoi_id = client.create_aoi(utils.read_aoi('inputs/UTM32UNG.json'), exist_ok=True)

print("Create the record")
record = client.create_record(aoi_id, "MyFirstRecord", {"source":"tutorial"}, datetime.now(), exist_ok=True)

print("Create the variable and instantiate it")
instance = client.create_variable("MyFirstVariable", "i2,-1,0,255", [''], exist_ok=True).instantiate("AnInstance", {})

print("And finally, index the image")
client.index_dataset(os.getcwd()+'/inputs/myFirstImage.tif', record, instance, instance.dformat, bands=[1])

print("You did it !")


## 3 - Records
A record defines a data-take by its geometry, its sensing time and user-defined tags that describe the context in more detail.
A record is usually linked to an image of a satellite. For example, the image taken by S2A over the 31TDL tile on the 1st of April 2018 is described by the record:
- **S2A_MSIL1C_20180401T105031_N0206_R051_T31TDL_20180401T144530**
    * **AOI** : _31TDL tile (POLYGON ((2.6859855651855 45.680294036865, 2.811126708984324 45.680294036865, 2.811126708984324 45.59909820556617, 2.6859855651855 45.59909820556617, 2.6859855651855 45.680294036865)))_
    * **DateTime** : _2018-04-01 10:50:31_

But a record can describe any product like a mosaic over a country, or a decade mosaic :
- **Mosaic of France January 2020**
    * **AOI** : _France_
    * **DateTime** : _2020-01-31 00:00:00_

<img src="RecordsSeveralLayers.png" width=400>

### Create an AOI

A record is linked to an AOI (that can be shared between several records). Before creating the record, its AOI must be created with the function `create_aoi` taking a geometry in **geographic coordinates** as input.

If the aoi already exists, `create_aoi` raises an error. Its ID can be retrieved from the details of the error.

In [None]:
aoi = utils.read_aoi('inputs/UTM32UNG.json')
try:
    aoi_id = client.create_aoi(aoi)
    print("AOI created with id "+aoi_id)
except utils.GeocubeError as e:
    aoi_id = e.details[e.details.rindex(' ')+1:]
    print("AOI already exists with id: "+aoi_id)


### Create a record
The `create_records` function is used to create new records. Records are uniquely defined by :
- `name`
- `tags` : user-defined depending on the project (currently, no standard are implemented).
- `datetime`

In [None]:
name = "S2B_MSIL1C_20190118T104359_N0207_R008_T32UNG_20190118T123528"
tags = {"source":"tutorial", "constellation":"SENTINEL2", "satellite":"SENTINEL2B", "user-defined-tag": "whatever is necessary to search for this record"}
date = datetime(2019, 1, 18, 10, 43, 59, 0, None)

record_id = client.create_record(aoi_id, name, tags, date, exist_ok=True)
print("Record created with id: ", record_id)


## 4 - Variables
A variable describes the kind of data stored in a product, for example _a spectral band, the NDVI, an RGB image, the backscatter, the coherence_...

This entity has what is needed to **describe**, **process** and **visualize** the product.

In particular, the variable has a `dformat` (for _data format_ ):
- dformat.dtype   : _data type_
- dformat.min     : theoretical minimum value
- dformat.max     : theoretical maximum value
- dformat.no_data : the NoData value

In the Geocube Database, the (internal) data format of an image indexed in the Geocube may be different (for exemple, in order to optimize storage costs), but when the data is retrieved, the Geocube maps the internal format  to the data format of the variable. This process may map the data below the minimum or above the maximum value. In that case, no crop is performed.

### Create a variable

In [None]:
print("Create a variable that describes an RGB product")
variable_name = "RGB"
variable = client.create_variable(
    name=variable_name,
    dformat={"dtype":"i2", "no_data": -1, "min_value": 0, "max_value": 255},
    bands=['R', 'G', 'B'],
    description="",
    unit="",
    resampling_alg=entities.Resampling.bilinear, exist_ok=True)

print(variable)

print("Create a variable that describes an NDVI product")
variable_name = "NDVI"
try:
    variable = client.create_variable(
        name=variable_name,
        dformat={"dtype":"f4", "no_data": np.nan, "min_value": -1, "max_value": 1},
        bands=[''],
        description="Normalized Difference Vegetation Index",
        resampling_alg=entities.Resampling.bilinear)
except utils.GeocubeError as e:
    print(e.codename + " " + e.details)
    variable = client.variable(name=variable_name)
print(variable)

### Instantiate a variable

An instance is a declination of a variable with different processing parameters.

For example, an RGB variable can be defined with different spectral bands (RGB bands of Sentinel-2 are not the same as LANDSAT's), a Label variable can have a different mapping. The SAR products can be processed with different processing graphs or softwares, but they all belongs to the same variable.

The processing parameters can be provided in the metadata field of the instance.

`client.variable("RGB").instantiate("Sentinel2-Raw-Bands", {"R":"664.6", "G":...})`
`client.variable("LandUseLabels").instantiate("v1", {"0":"Undefined","1":"Urban", "2":...})`
`client.variable("Sigma0VV").instantiate("terrain-corrected", {"snap_graph_name":"mygraph.xml", ...})`


In [None]:
try:
    instance=client.variable("RGB").instantiate("master", {"any-metadata": "(for information purpose)"})
except utils.GeocubeError as e:
    print(e.codename + " " + e.details)
    instance = variable.instance("master")
print(instance)

try:
    instance=client.variable("NDVI").instantiate("master", {"any-metadata": "(for information purpose)"})
except utils.GeocubeError as e:
    print(e.codename + " " + e.details)
    instance = variable.instance("master")
print(instance)

## 5 - Dataset

As we saw in introduction, we have all we need to index an image in the Geocube. Such an image is called a **dataset**.

Actually, to index an image, we also have to define :
- which band(s) are indexed (usually all the bands, but it can be a subset)
- how to map the value of its pixels to the dataformat of the variable.

For the second point, we will define :
- the dataformat of the dataset (`dformat.[no_data, min, max]`) that describes the pixel of the image
- the mapping from each pixel to the data format of the variable (`variable.dformat`). This mapping is defined as `[MinOut, MaxOut, Exponent]`. See the diagram below:

NB:
- **`dataset.Min` and `dataset.max` are NOT necessarily the minimum and maximum values of the pixels but the minimum and maximum possible values.**
- `index_dataset()` **does not perform any transformation on the image** (all the information provided during the indexation is for the interpretation of the image - by the Geocube or the user) and is idempotent.

<a name="diagram"></a><img src="InternalDFormatToVariableDFormat.png" width=800>

### Index a dataset (common case)
The dataformat of the dataset is generally the same as the one of the variable.

In [None]:
# Define URI, record and variable.instance
uri = os.getcwd() + "/inputs/S2B_MSIL1C_20190118T104359_N0207_R008_T32UNG_20190118T123528.tif"
record = client.list_records("S2B_MSIL1C_20190118T104359_N0207_R008_T32UNG_20190118T123528")[0]
instance = client.variable(name="RGB").instance("master")

# In that case, the dformat of the dataset is the same as the one of the variable
dataset_dformat = instance.dformat

client.index_dataset(uri, record, instance, dataset_dformat, bands=[1,2,3])
print("Done !")

### Index a dataset (Storage optimisation)
In order to optimize the storage of a large volume of data, it can be decided to reduce the size of the data type (for example from float32 to int16) and/or scale the data.

So, the dataformat of the dataset can be different from the variable in some ways:
- **For compression purpose** :
     1. the data type is smaller. For example data is encoded in byte [0, 255] that maps to float [0, 1] in the variable.
- **To optimize accuracy** : the range of values is smaller than the one of the variable. Two examples :
     2. Given a variable between -1 and 1, the data in a given image is known to be in [0, 1] instead of [-1, 1]. To optimize accuracy, the data is encoded between 0 and 255 and min/max_out are [0, 1].
     3. Given a variable between 0 and 100, 90% of the data is known to be between 0 and 10. To optimize accuracy, the data is encoded between 0 and 255, using a non-linear mapping to [0, 100] using an exponent=2. Data is scaled according to the non-linear scaling in the [diagram](#diagram):

<img src="DataFormatExample.png" width=800>

NB: below : dformat.dtype is retrieved from the file, hence the "auto" keyword.

In [None]:
print("NDVI variable is defined in the range [-1, 1]")
instance = client.variable(name="NDVI").instance("master")

print("Example 1: the datatype of the dataset has been encoded in int16, mapping [-10000, 10000] to [-1, 1]")
internal_dformat = {"dtype":"auto", "no_data": -10001, "min_value": -10000, "max_value": 10000}

try:
    tags = {"source":"tutorial", "constellation":"SENTINEL2", "satellite":"SENTINEL2A"}
    date = datetime(2019, 2, 24, 10, 30, 19, 0, None)
    client.create_record(aoi_id, "S2A_MSIL1C_20190224T103019_N0207_R108_T32UNG_20190224T141253", tags, date)
except utils.GeocubeError:
    pass

uri = os.getcwd() + "/inputs/ndviS2A_MSIL1C_20190224T103019_N0207_R108_T32UNG_20190224T141253.tif"
record = client.list_records("S2A_MSIL1C_20190224T103019_N0207_R108_T32UNG_20190224T141253")[0]


client.index_dataset(uri, record, instance, internal_dformat, bands=[1])
plt.imshow
print("Done !")


In [None]:
print("Example 2: this NDVI dataset is known to have no value below 0. Therefore, it has been encoded in uint8, mapping [0, 255] to [0, 1]")
internal_dformat = {"dtype":"auto", "no_data": 0, "min_value": 0, "max_value": 255}
min_out, max_out = 0, 1

uri = os.getcwd() + "/inputs/ndviS2B_MSIL1C_20190118T104359_N0207_R008_T32UNG_20190118T123528.tif"
record = client.list_records("S2B_MSIL1C_20190118T104359_N0207_R008_T32UNG_20190118T123528")[0]

client.index_dataset(uri, record, instance, internal_dformat, bands=[1], min_out=min_out, max_out=max_out)
print("Done !")

In [None]:
print("Example 3: this NDVI dataset has most of its value in [0, 0.1]. Therefore, it has been encoded in uint8, mapping [0, 255] to [0, 1] with an exponent=2")
internal_dformat = {"dtype":"auto", "no_data": 0, "min_value": 0, "max_value": 255}
min_out, max_out, exponent = 0, 1, 2

uri = os.getcwd() + "/inputs/ndviS2A_MSIL1C_20190224T103019_N0207_R108_T32UNG_20190224T141253_2.tif"
record = client.list_records("S2A_MSIL1C_20190224T103019_N0207_R108_T32UNG_20190224T141253")[0]

client.index_dataset(uri, record, instance, internal_dformat, bands=[1], min_out=min_out, max_out=max_out, exponent=2)
print("Done !")

## 6 - Index a list of datasets

In [None]:
filepaths = list(glob.glob(os.getcwd() + "/inputs/S2B_MSIL1C*.tif"))

print("Create all records")
records_name = []
records_tags = []
records_date = []
records_aoi = []

for filepath in filepaths:
    # This record already exists
    if 'S2B_MSIL1C_20190118T104359_N0207_R008_T32UNG_20190118T123528' in filepath:
        continue
    record_name = os.path.basename(filepath).strip(".tif")
                 
    records_name.append(record_name)
    records_date.append(datetime(int(record_name[11:15]), int(record_name[15:17]), int(record_name[17:19]), int(record_name[20:22]), int(record_name[22:24]), int(record_name[24:26]), 0))
    records_aoi.append(aoi_id)
    records_tags.append({"source":"tutorial", "constellation":"SENTINEL2", "satellite":"SENTINEL2B"})

try:
    record_ids = client.create_records(records_aoi, records_name, records_tags, records_date)
    print(f"{len(record_ids)} records added")
except utils.GeocubeError as e:
    print(e.codename + " " + e.details)

print("Index all datasets")
records = client.list_records(tags={"source":"tutorial", "constellation":"SENTINEL2", "satellite":"SENTINEL2B"})
record_map = {record.name: record.id for record in records}
instance = client.variable(name="RGB").instance("master")
containers = []
for filepath in filepaths:
    # Find the record
    record_id = record_map[os.path.basename(filepath).strip(".tif")]
    # Create the container (dformat is the one of the variable)
    containers.append(entities.Container.new(filepath, record_id,
                                             instance = instance.instance_id, 
                                             bands=[1, 2, 3],
                                             dformat=instance.dformat,
                                             min_out=0, max_out=255))

client.index(containers)
print("Done !")

## 7 - Conclusion
In this notebook, you have learnt to create aois, records and variables, instantiate a variable and index a dataset.


To populate the Geocube for the next tutorial, we will index some additional datasets :

In [None]:
instance = client.variable("NDVI").instance("master")
try:
    aoi_id = client.create_aoi(utils.read_aoi('inputs/UTM32VNH.json'))
    record = client.create_record(aoi_id, "S2B_MSIL1C_20190105T103429_N0207_R108_T32VNH_20190105T122413", {"source":"tutorial", "constellation":"SENTINEL2", "satellite":"SENTINEL2B"}, datetime(2019,1,5,10,34,29))
    client.index_dataset(os.getcwd()+'/inputs/ndviS2B_MSIL1C_20190105T103429_N0207_R108_T32VNH_20190105T122413.tif', record, instance, "auto,-1001,-1,1", bands=[1])
except utils.GeocubeError as e:
    print("It seems that you already did it !")

try:
    aoi_id = client.create_aoi(utils.read_aoi('inputs/UTM32VNJ.json'))
    record = client.create_record(aoi_id, "S2B_MSIL1C_20190105T103429_N0207_R108_T32VNJ_20190105T122413", {"source":"tutorial", "constellation":"SENTINEL2", "satellite":"SENTINEL2B"}, datetime(2019,1,5,10,34,29))
    client.index_dataset(os.getcwd()+'/inputs/ndviS2B_MSIL1C_20190105T103429_N0207_R108_T32VNJ_20190105T122413.tif', record, instance, "auto,-1001,-1,1", bands=[1])
except utils.GeocubeError as e:
    print("It seems that you already did it !")