## Setup

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import geopandas as gpd
import json
import matplotlib.pyplot as plt
import numpy as np
import semantique as sq
import xarray as xr

In [3]:
# Load a mapping.
with open("../files/mapping.json", "r") as file:
    mapping = sq.mapping.Semantique(json.load(file))

# Represent an EO data cube.
with open("../files/layout_gtiff.json", "r") as file:
    dc = sq.datacube.GeotiffArchive(json.load(file), src = "../files/layers_gtiff.zip")

# Set the spatio-temporal extent.
space = sq.SpatialExtent(gpd.read_file("../files/footprint.geojson"))
time = sq.TemporalExtent("2019-01-01", "2020-12-31")

## How the cache works

Caching data layers in RAM should only be done for those that are needed again when evaluating downstream parts of the recipe. This requires foresight about the execution order of the recipe, which accordingly requires a simulated run preceding the actual execution. This simulated run is performed by the FakeProcessor. It resolves the data references and fills a cache by creating a list of the data references in the order in which they are evaluated. This list is then used dynamically during the actual execution of the recipe as a basis for keeping data layers in the cache and reading them from there if they are needed again.

In [4]:
from semantique.processor.core import FakeProcessor, QueryProcessor

# define a simple recipe for a cloudfree composite
recipe = sq.QueryRecipe()
red_band = sq.reflectance("s2_band04")
green_band = sq.reflectance("s2_band03")
blue_band = sq.reflectance("s2_band02")
recipe["composite"] = sq.collection(red_band, green_band, blue_band).\
    filter(sq.entity("cloud").evaluate("not")).\
    reduce("median", "time").\
    concatenate("band")

# define context 
context = {
    "datacube": dc, 
    "mapping": mapping,
    "space": space,
    "time": time,
    "crs": 3035, 
    "tz": "UTC", 
    "spatial_resolution": [-10, 10],
}

In [5]:
# step I: fake run
fp = FakeProcessor.parse(recipe, **context)
fp.optimize().execute()
fp.cache.seq

[('reflectance', 's2_band04'),
 ('reflectance', 's2_band03'),
 ('reflectance', 's2_band02'),
 ['atmosphere', 'colortype']]

In [6]:
# step II: query processor execution
qp = QueryProcessor.parse(recipe, **{**context, "cache": fp.cache})
result = qp.optimize().execute()
result["composite"].shape

(3, 563, 576)

As you can see the FakeProcessor run resolves the references to the data layers as they are provided by looking up the entities' references in the mapping.json. Note, that in the current case the result is not that interesting, though, since four different data layers are to be loaded. Therefore, there is nothing to be cached during recipe execution. Therefore the QueryProcessor will load all data layers from the referenced sources without storing any of them in the cache. 

As a user, however, you can directly initiate the entire caching workflow (preview & full resolution recipe execution) by setting the context parameter when calling `recipe.execute(..., cache_data = True)`. This is enabled by default.

In [7]:
# same as above in a single step 
result = recipe.execute(**{**context, "cache_data": True})

## Assessment of cache performance

Now let's analyse some timing differences in executing a recipe with/without caching. Most importantly, the timing difference depends on...
* the redundancy of the data references in the recipe, i.e. if layers are called multiple times loading them from cache will reduce the overall time significantly
* the data source (EO data cube) from which they are loaded

Especially for the later it should be noted that in this demo only data loaded from a locally stored geotiff (i.e. the GeoTiffArchive layout) are analysed. This is sort of the worst case for demonstrating the benefits of caching since the data is stored locally and is therfore quickly accessible.

Consequently, you will observe that in almost all of the following cases, caching actually adds a small computational overhead. Keep in mind, however, that caching is designed for and particularly beneficial in case of STACCubes when loading data over the internet.

In [8]:
# function to compare timing for given recipe 
def eval_timing(recipe, caching=False):
    context = {
        "datacube": dc, 
        "mapping": mapping,
        "space": space,
        "time": time,
        "crs": 3035, 
        "tz": "UTC", 
        "spatial_resolution": [-10, 10],
        "cache_data": caching
    }
    res = recipe.execute(**context)

In [9]:
# recipe I
recipe_I = sq.QueryRecipe()
red_band = sq.reflectance("s2_band04")
green_band = sq.reflectance("s2_band03")
blue_band = sq.reflectance("s2_band02")
recipe_I["composite"] = sq.collection(red_band, green_band, blue_band).\
    filter(sq.entity("cloud").evaluate("not")).\
    reduce("median", "time").\
    concatenate("band")

In [10]:
%%timeit
# without caching
_ = eval_timing(recipe_I, False)

640 ms ± 3.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [11]:
%%timeit
# with caching
_ = eval_timing(recipe_I, True)

703 ms ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [12]:
# recipe II
recipe_II = sq.QueryRecipe()
recipe_II["dates"] = sq.entity("vegetation").\
    filter(sq.self()).\
    assign_time().\
    reduce("first", "time")

In [13]:
%%timeit
# without caching
_ = eval_timing(recipe_II, False)

5.28 s ± 72.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [14]:
%%timeit
# with caching
_ = eval_timing(recipe_II, True)

5.51 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [15]:
# recipe III
recipe_III = sq.QueryRecipe()
recipe_III["water_count_time"] = sq.entity("water").reduce("count", "time")
recipe_III["vegetation_count_time"] = sq.entity("vegetation").reduce("count", "time")
recipe_III["water_count_space"] = sq.entity("water").reduce("count", "space")
recipe_III["vegetation_count_space"] = sq.entity("vegetation").reduce("count", "space")

In [16]:
%%timeit
# without caching
_ = eval_timing(recipe_III, False)

495 ms ± 7.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [17]:
%%timeit
# with caching
_ = eval_timing(recipe_III, True)

283 ms ± 1.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


The more expressive examples for the STACCube are provided below. The question if caching brings significant advantages when loading data from a well-indexed OpenDataCube stored on a quickly accessible hot storage, remains to be assessed. 