# Verbs

In this notebook we will describe the different verbs that semantique offers. Remember that result instructions in query recipes can be formulated by combining basic building blocks into processing chains. These processing chains start with a reference. For a description of those, see the [References notebook](references.ipynb). At the query recipe construction stage, the reference is nothing more than a small piece of text. When executing the recipe, the query processor solves this reference and evaluates it internally into a multi-dimensional array filled with data values. Several actions can then be applied to this array. These actions are all labeled with an action word that should intuitively describe the operation they are performing. That is why we call them *verbs*. The same building blocks can also be used when constructing a set of mapping rules according to semantiques native mapping configuration.

## Content

- [Verbs for single arrays](#Verbs-for-single-arrays)
- [Verbs for collections of arrays](#Verbs-for-collections-of-arrays)
- [Split-apply-combine structures](#Split-apply-combine-structures)
- [Utility verbs](#Utility-verbs)

## Prepare

Import packages:

In [1]:
import semantique as sq

In [2]:
import geopandas as gpd
import pandas as pd
import numpy as np
import xarray as xr
import json
import copy

Set the context of query execution:

In [3]:
# Load a mapping.
with open("files/mapping.json", "r") as file:
    mapping = sq.mapping.Semantique(json.load(file))

# Represent an EO data cube.
with open("files/layout.json", "r") as file:
    dc = sq.datacube.GeotiffArchive(json.load(file), src = "files/layers.zip")

# Set the spatio-temporal extent.
space = sq.SpatialExtent(gpd.read_file("files/footprint.geojson"))
time = sq.TemporalExtent("2019-01-01", "2020-12-31")

# Collect the full context.
# Including additional configuration parameters.
context = {
    "datacube": dc, 
    "mapping": mapping,
    "space": space,
    "time": time,
    "crs": 3035, 
    "tz": "UTC", 
    "spatial_resolution": [-1800, 1800]
}

## Verbs for single arrays

Most verbs in semantique are verbs that apply an action to a single array. The currently implemented verbs in this category are:

- [Evaluate](#Evaluate): Evaluates an expression for each pixel in an array.
- [Extract](#Extract): Extracts dimension coordinates as a new one-dimensional array.
- [Filter](#Filter): Filters values from an array.
- [Assign](#Assign): Assign a new value to each pixel in an array.
- [Reduce](#Reduce): Reduces the dimensionality of an array.
- [Shift](#Shift): Shifts array values a given number of steps along a dimension.
- [Smooth](#Smooth): Smoothes array values by applying a moving window function.
- [Trim](#Trim): Trims the dimensions of an array.
- [Groupby](#Groupby): Splits an array into multiple groups.

### Evaluate

The evaluate verb evaluates an expression for each pixel in an array. These expressions can take many different forms, but each of them accepts the value of a specific pixel in the input array and applies some operator to it. The result of that operation is the new value of that particular pixel in the output array. That is, the output array always has the *same shape* as the input array. Below, we will show different forms of expressions, and the built-in [operators](https://zgis.github.io/semantique/reference.html#operator-functions) that semantique offers for them. For advanced users, it is also possible to define their own custom operators, which is explained in the notebook on [Internal query processing](processor.ipynb#Adding-custom-operators).

You can specify an operator function simply by its name:

```python
sq.entity("water").evaluate("not")
```

To be autocomplete-friendly, you can also use built-in constants that refer to an operator function. These are stored in the operators module of semantique, and are nothing more than the name of the function:

```python
sq.entity("water").evaluate(sq.operators.NOT)
```

When using the evaluate verb, it is important to be aware of the *value type* of the input array(s). For example, they may contain nominal, ordinal, binary or numerical data. Different operators may only support specific (combinations of) value types. For details, see [here](processor.ipynb#Tracking-value-types).

#### Univariate expressions

The simplest form of expressions are the univariate expressions. For each pixel value $x_{i}$ in input array $X$, these expressions *only* consider $x_{i}$, and apply an operator to it. That is, the expression is of the following form.

$$
expression = operator(x_{i})
$$

This means that the output array has the same shape as the input array, and that each pixel value in the output array is the result of the univariate expression evaluated on the value of the corresponding pixel in the input array.

![evaluate_univariate](figures/evaluate_uni.png)

The built-in operators for this purpose include the **numerical univariate operators**, which are intended for usage on numerical arrays:

- `absolute`: Computes the absolute value of $x_{i}$.
- `floor`: Returns the largest integer $k$ such that $k \leq x_{i}$.
- `ceiling`: Returns the smallest integer $k$ such that $k \geq x_{i}$.
- `exponential`: Computes the exponential value of $x_{i}$, i.e. $e^{x_{i}}$.
- `natural_logarithm`: Computes the natural logarithm of $x_{i}$.
- `square_root`: Computes the square root of $x_{i}$.
- `cube_root`: Computes the cube root of $x_{i}$.

Next to those there are the **boolean univariate operators**, which are intended for usage on binary arrays:

- `not`: Returns $1$ if $x_{i} = 0$, and $0$ if $x_{i} \neq 0$.

Finally, there are two operators that allow to locate nodata values in an array:

- `is_missing`: Returns $1$ if $x_{i}$ is a missing observation, and $0$ otherwise.
- `not_missing`: Returns $1$ if $x_{i}$ is valid observation, and $0$ otherwise.

For example, applying the `not` operator to the translated semantic concept *water* marks all non-water pixels as 1 and all water pixels as 0.

In [4]:
recipe = sq.QueryRecipe()

In [5]:
recipe["water"] = sq.entity("water")
recipe["not_water"] = sq.result("water").evaluate("not")

In [6]:
out = recipe.execute(**context)

In [7]:
out["water"]

In [8]:
out["not_water"]

#### Bivariate expressions

Bivariate expressions consist of a left-hand side value, a right-hand side value and an operator that in some way combines these values. For each pixel value $x_{i}$ in input cube $X$, the left-hand side value of the expression is $x_{i}$, and the right-hand side value of the expression is another value $y_{i}$.

$$
expression = x_{i} \; operator \; y_{i}
$$

The right-hand side value $y_{i}$ can be a constant, meaning that the same right-hand side value is used for all pixels in $X$.

![evaluate_constant](figures/evaluate_const.png)

The right-hand side value $y_{i}$ can also be a pixel value from another array $Y$. In that case, $Y$ should have the same shape as $X$ (or able to be aligned to that shape, see [below](#Aligning-two-arrays)), such that each pixel $x_{i} \in X$ has a *matching* pixel $y_{i} \in Y$, i.e. a pixel that has exactly the same coordinates for each dimension.

![evaluate_multivariate](figures/evaluate_multi.png)

The built-in operators for bivariate expressions can be subdivided into different categories. The **algebraic operators** are intended for usage on numerical arrays and perform an operation of arithmetic:

- `add`: Adds $y_{i}$ to $x_{i}$..
- `subtract`: Subtracts $y_{i}$ from $x_{i}$.
- `multiply`: Multiplies $x_{i}$ by $y_{i}$.
- `divide`: Divides $x_{i}$ by $y_{i}$.
- `power`: Raises $x_{i}$ to the $y_{i}$th power.
- `normalized_difference`: Calculates $\frac{x_{i} - y_{i}}{x_{i} + y_{i}}$.

In [9]:
recipe = sq.QueryRecipe()

In [10]:
recipe["count"] = sq.entity("vegetation").reduce("time", "count")
recipe["twice"] = sq.result("count").evaluate("multiply", 2)

In [11]:
out = recipe.execute(**context)

In [12]:
out["count"]

In [13]:
out["twice"]

The **relational operators** evaluate a condition. The result is always binary, i.e. true (1) when the condition holds and false (2) when it doesn't. Some of the conditions test for equality, and hence can be used on any array whenever the value type of the left-hand values is the same as the value type of the right-hand values:

- `equal`: Returns $1$ if $x_{i} = y_{i}$, and $0$ otherwise.
- `not_equal`: Returns $1$ if $x_{i} \neq y_{i}$, and $0$ otherwise.

The other conditions imply a fixed order among the values, and hence should not be used on nominal arrays:

- `greater`: Returns $1$ if $x_{i} > y_{i}$, and $0$ otherwise.
- `less`: Returns $1$ if $x_{i} < y_{i}$, and $0$ otherwise.
- `greater_equal`: Returns $1$ if $x_{i} \geq y_{i}$, and $0$ otherwise.
- `less_equal`:Returns $1$ if $x_{i} \leq y_{i}$, and $0$ otherwise.

In [14]:
recipe = sq.QueryRecipe()

In [15]:
recipe["count"] = sq.entity("vegetation").reduce("time", "count")
recipe["high"] = sq.result("count").evaluate("greater_equal", 2)

In [16]:
out = recipe.execute(**context)

In [17]:
out["count"]

In [18]:
out["high"]

The **boolean operators** are intended to be used in expressions involving two binary values.

- `and`: Returns $1$ when both $x_{i} \neq 0$ *and* $y_{i} \neq 0$, and $0$ otherwise.
- `or`: Returns $1$ when $x_{i} \neq 0$, $y_{i} \neq 0$, or both, and $0$ otherwise.
- `exclusive_or`: Returns $1$ when either $x_{i} \neq 0$ *or*  $y_{i} \neq 0$, but *not* both, and $0$ otherwise.

In [19]:
recipe = sq.QueryRecipe()

In [20]:
recipe["water"] = sq.entity("water")
recipe["vegetation"] = sq.entity("vegetation")
recipe["both"] = sq.result("water").evaluate("or", sq.result("vegetation"))

In [21]:
out = recipe.execute(**context)

In [22]:
out["water"]

In [23]:
out["vegetation"]

In [24]:
out["both"]

The **membership operators** are a special category of operators that test for set membership. That is, they test for each pixel value $x_{i}$ in cube $X$ if it is or is not member of a set of values $Y$. $Y$ may be another array, but in that case it will be treated as being a set. That is, for each pixel value $x_{i} \in X$ it will be tested if it occurs anywhere in array $Y$.

- `in`: Returns $1$ if $x_{i} \in Y$, and $0$ otherwise. Here, $Y$ is a finite set of values that remains constant for each $x_{i}$.
- `not_in`: Returns $1$ if $x_{i} \notin Y$, and $0$ otherwise. Here, $Y$ is a finite set of values that remains constant for each $x_{i}$.

To represent a set, you can use the [set()](https://zgis.github.io/semantique/_generated/semantique.set.html) function of semantique. Alternatively, you can use Pythons built-in set, list or tuple objects.

In [25]:
recipe = sq.QueryRecipe()

In [26]:
recipe["colors"] = sq.appearance("colortype")
recipe["water"] = sq.result("colors").evaluate("in", sq.set(21, 22, 23, 24))

In [27]:
out = recipe.execute(**context)

In [28]:
out["colors"]

In [29]:
out["water"]

To make your lives easier, semantique also includes the [interval()](https://zgis.github.io/semantique/_generated/semantique.value_label.html) function, allowing you to specify a set of values as an interval between a lower bound and an upper bound. The interval is assumed to be closed, meaning that both the lower and upper bounds are included. Do note that intervals are only supported for numerical or ordinal data.

In [30]:
recipe["water"] = sq.result("colors").evaluate("in", sq.interval(21, 24))

In [31]:
out = recipe.execute(**context)

In [32]:
out["water"]

When the values in $X$ have labels (i.e. when category indices are mapped to category labels), you can also use the labels as set members, instead of the indices. For that, use the [label()](https://zgis.github.io/semantique/_generated/semantique.label.html) function:

In [33]:
labels = [sq.label(x) for x in ["DPWASH", "SLWASH", "TWASH", "SASLWA"]]
recipe["water"] = sq.result("colors").evaluate("in", labels)

In [34]:
out = recipe.execute(**context)

In [35]:
out["water"]

The **temporal relational operators** are relational operators specifically designed to deal with *time instants* and *time intervals* as operand values. Hence, it requires each pixel value $x_{i}$ in input array $X$ to be a time instant. The right-hand side of the expression can either be a single time instant, a time interval (i.e. a list of two time instants representing the start and end of an interval) or another array $Y$ in which each pixel value $y_{i}$ is a time instant. The latter will in practice be evaluated as being a time interval with the earliest time instant being the start of the interval, and the latest time instant the end of the interval.

The currently implemented temporal relational operators are:

- `after`: When $y$ is a time instant: returns $1$ if $x_{i} > y$, and $0$ otherwise. When $y$ is a time interval: returns $1$ if $x_{i} > max(y)$, and $0$ otherwise.
- `before`: When $y$ is a time instant: returns $1$ if $x_{i} < y$, and $0$ otherwise. When $y$ is a time interval: returns $1$ if $x_{i} < min(y)$, and $0$ otherwise.
- `during`: Returns $1$ if $min(y) \leq x_{i} \leq max(y)$, and $0$ otherwise. Only intended for time intervals as $y$.

To construct time instants and time intervals for usage as right-hand operand, you can use the [time_instant()](https://zgis.github.io/semantique/_generated/semantique.time_instant.html) and [time_interval()](https://zgis.github.io/semantique/_generated/semantique.time_interval.html) functions that semantique provides. The first expects a single datetime value, while the latter expects two datetime values (i.e. the start and end of the interval, with the interval being closed at both sides). You can provide datetimes in formats as "2020-12-31" or "2020/12/31", but also complete ISO8601 timestamps such as "2020-12-31T14:37:22". As long as the [Timestamp](https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html) initializer of the [pandas](https://pandas.pydata.org/) package can understand it, it is supported by semantique. Any additional keyword arguments will be forwarded to this initializer.

In [36]:
recipe = sq.QueryRecipe()

In [37]:
recipe["times"] = sq.entity("water").extract("time")
recipe["early"] = sq.result("times").evaluate("before", sq.time_instant("2019-12-31"))
recipe["late"] = sq.result("times").evaluate("during", sq.time_interval("2020-01-01", "2020-12-31"))

In [38]:
out = recipe.execute(**context)

In [39]:
out["early"]

In [40]:
out["late"]

The **spatial relational operators** are relational operators specifically designed to deal with *geometries* as operand values. It requires each pixel value $x_{i}$ in input array $X$ to be a tuple of spatial (x,y) coordinates. The right-hand side of the expression can either be one or more spatial geometries, or another array $Y$ in which each pixel value $y_{i}$ is a coordinate tuple. The latter will in practice be evaluated as being a geometry covering the spatial bounding box of the array.

The currently implemented spatial relational operators are:

- `intersects`: Returns $1$ if the spatial point with the coordinates of $x_{i}$ spatially intersects with geometry $y$, and $0$ otherwise.

To construct a spatial geometry for usage as right-hand operand, you can use the [geometry()](https://zgis.github.io/semantique/_generated/semantique.geometry.html) function that semantique offers. It expects an object that can be read by the [GeoDataFrame](https://geopandas.org/docs/reference/api/geopandas.GeoDataFrame.html) initializer of the [geopandas](https://geopandas.org/en/stable/) package. Any additional keyword arguments will be forwarded to this initializer. In practice, this means you can read any GDAL-supported file format with [geopandas.read_file()](https://geopandas.org/en/stable/docs/reference/api/geopandas.read_file.html), and then use that object to create spatial geometries.

In [41]:
recipe = sq.QueryRecipe()

In [42]:
parcels = gpd.read_file("files/parcels.geojson")
recipe["coords"] = sq.entity("water").extract("space")
recipe["in_parcel"] = sq.result("coords").evaluate("intersects", sq.geometry(parcels))

In [43]:
out = recipe.execute(**context)

In [44]:
out["in_parcel"]

#### Aligning two arrays

In bivariate expressions involving two arrays, the second array $Y$ does not necessarily have to be of the same shape as input array $X$. Instead, it should be possible to *align* it to that shape. This can be done in two ways. 

First consider the case where $Y$ has the same dimensions as $X$, but not all coordinate values of $X$ are present in $Y$. In that case, we can align $Y$ with $X$ such that pixel values at position $i$ in both arrays, i.e. $x_{i}$ and $y_{i}$ respectively, belong to pixels with the *same coordinates*. If $y_{i}$ was not originally part of $Y$, we assign it a nodata value. This also works vice-versa, with coordinate values in $Y$ that are not present in $X$.

![evaluate_align_same_dims](figures/evaluate_align1.png)

Secondly, consider a case where $Y$ has one or more dimensions with exactly the same coordinate values as $X$, but does not have *all* the dimensions that $X$ has. In that case, we can align $Y$ with $X$ by duplicating its values along those dimensions that are missing. This does *not* work vice versa. When cube $Y$ has more dimensions that cube $X$, there is no clear way to define how to subset the values in $Y$ to match the shape of $X$.

![evaluate_align_missing_dims](figures/evaluate_align2.png)

Alignment is something you have to be aware of to understand how bivariate expressions are evaluated. However, it is not something you have to do for yourself. Internally, the query processor takes care of it.

### Extract

The extract verbs extracts the coordinates of a specified dimension from an array, and returns them as a new, one-dimensional array.

![extract](figures/extract.png)

In [45]:
recipe = sq.QueryRecipe()

In [46]:
recipe["time"] = sq.entity("water").extract("time")
recipe["space"] = sq.entity("water").extract("space")

In [47]:
out = recipe.execute(**context)

In [48]:
out["time"]

In [49]:
out["space"]

Coordinate values of some dimensions may consist of multiple components. For example, the spatial dimension contains coordinate tuples that consist of x and y coordinates, and the time dimension contains timestamps that consist of a year, a month, a day, an hour, etc. If you are only interested in a single component of a dimension, you can specify that through the second, optional argument of the extract verb.

In [50]:
recipe = sq.QueryRecipe()

In [51]:
recipe["years"] = sq.entity("water").extract("time", "year")
recipe["xcoords"] = sq.entity("water").extract("space", "x")

In [52]:
out = recipe.execute(**context)

In [53]:
out["years"]

In [54]:
out["xcoords"]

### Filter

The filter verb filters values from an array. That is, the output array is a *subset* of the input array. Which values in the input array are kept, and which are removed, is defined by a second, binary array which we call the *filterer*. The filterer should have the same shape as the input array (or able to be aligned to that shape, see [below](#Filtering-by-dimension-coordinates)), such that each pixel $x_{i}$ in input cube $X$ has a *matching* pixel $y_{i}$ in filterer $Y$, i.e. a pixel that has exactly the same coordinates for each dimension. Then:

- $x_{i}$ is kept if $y_{i} \neq 0$.
- $x_{i}$ is removed if $y_{i} = 0$.

Here, *kept* means that the pixel value remains unchanged, while *removed* means that the pixel gets a nodata value assigned.

![filter](figures/filter.png)

For example, we may filter only those pixels in the translated semantic concept *water* that where not covered by clouds.

In [55]:
recipe = sq.QueryRecipe()

In [56]:
recipe["water"] = sq.entity("water")
recipe["cloud"] = sq.entity("cloud")
recipe["not_cloud"] = sq.result("cloud").evaluate("not")
recipe["filtered"] = sq.result("water").filter(sq.result("not_cloud"))

In [57]:
out = recipe.execute(**context)

In [58]:
out["water"]

In [59]:
out["not_cloud"]

In [60]:
out["filtered"]

We could also filter a vegetation count map keeping only those pixels with a value above 1. We first evaluate that condition using the [evaluate()](#Evaluate) verb, and use that output as the filterer.

In [61]:
recipe = sq.QueryRecipe()

In [62]:
recipe["count"] = sq.entity("vegetation").reduce("time", "count")
recipe["high"] = sq.result("count").filter(sq.self().evaluate("greater", 1))

In [63]:
out = recipe.execute(**context)

In [64]:
out["count"]

In [65]:
out["high"]

#### Filtering by dimension coordinates

The filterer does not have to be of the same shape as the input array, as long as we can align it to that shape. See [this section](#Aligning-two-arrays) for more details on how this works. In practice, it means that we can also filter values in an array based on the coordinates of a dimension. All we have to do is to construct a filterer for that dimension. Hence, a binary, one-dimensional array that specifies for each of the coordinates if values of pixels having that coordinate should be kept (i.e. 1) or removed (i.e. 0).

![filter_align](figures/filter_align.png)

For example, when we only want to keep pixel values observed in 2020:

In [66]:
recipe = sq.QueryRecipe()

In [67]:
recipe["2020"] = sq.entity("water").filter(sq.self().extract("time", "year").evaluate("equal", 2020))

In [68]:
out = recipe.execute(**context)

In [69]:
out["2020"]

You can also use a handy shortcut for the above formulation: the **filter_time** verb. This verb allows you to apply a temporal filter *without* having to explicitly extract the time coordinates from the array and evaluating a comparison expression on them.

The filter_time verb is only a "shortcut" verb, not an independent verb on its own. This means that when calling the filter_time verb, it is internally translated into a textual query recipe containing the self reference and the extract and evaluate verbs instead. In the same way, you can also use the shortcut verb **filter_space** for spatial filters.

In [70]:
recipe["2020"] = sq.entity("water").filter_time("year", "equal", 2020)

In [71]:
out = recipe.execute(**context)

In [72]:
out["2020"]

#### Self-filtering

A special type of a filtering operation is self-filtering, i.e. filtering an array by itself. In this case, the input array should be binary. In the output, the "true" values will be preserved, while the "false" values are removed.

![filter_self](figures/filter_self.png)

In [73]:
recipe = sq.QueryRecipe()

In [74]:
recipe["vegetation"] = sq.entity("vegetation")
recipe["true_vegetation"] = sq.result("vegetation").filter(sq.self())

In [75]:
out = recipe.execute(**context)

In [76]:
out["vegetation"]

In [77]:
out["true_vegetation"]

### Assign

The assign verb assigns each pixel in an array a new value that is not a function of the original value. Missing observations (i.e. pixels with a nodata value) in the input are always preserved: they never get assigned a new value.

The new value can be a constant, meaning that the same right-hand side value is used for all pixels in the input array. 

The new value can also be a pixel value from another array $Y$. In that case, $Y$ should have the same shape as input array $X$ (or able to be aligned to that shape, see [here](#Aligning-two-arrays)), such that each pixel $x_{i} \in X$ has a *matching* pixel $y_{i} \in Y$, i.e. a pixel that has exactly the same coordinates for each dimension.

A trivial example:

In [78]:
recipe = sq.QueryRecipe()

In [79]:
recipe["vegetation"] = sq.entity("vegetation")
recipe["foo"] = sq.result("vegetation").assign(-99)

In [80]:
out = recipe.execute(**context)

In [81]:
out["vegetation"]

In [82]:
out["foo"]

Optionally, you can assign a new value to only a subset of the pixels in the input array. In that case, the assigned values should always be of the same value type as the input array! The subset of pixels can be specified by providing a binary array as parameter *at*:

In [83]:
recipe = sq.QueryRecipe()

In [84]:
recipe["count"] = sq.entity("vegetation").reduce("time", "count")
recipe["foo"] = sq.result("count").assign(0, at = sq.self().evaluate("less", 2))

In [85]:
out = recipe.execute(**context)

In [86]:
out["count"]

In [87]:
out["foo"]

#### Assigning dimension coordinates

When the values to be assigned are taken from another array, this array does not have to be of the same shape as the input array, as long as we can align it to that shape. See [this section](#Aligning-two-arrays) for more details on how this works. In practice, it means that we can also assign dimension coordinates as new values. Hence, each pixel gets its coordinate for a specific dimension as its new pixel value. For example, for each observation, we want to store the month in which the observation was made:

In [88]:
recipe = sq.QueryRecipe()

In [89]:
recipe["vegetation"] = sq.entity("vegetation")
recipe["months"] = sq.result("vegetation").assign(sq.self().extract("time", "month"))

In [90]:
out = recipe.execute(**context)

In [91]:
out["vegetation"]

In [92]:
out["months"]

You can also use a handy shortcut for the above formulation: the **assign_time** verb. This verb allows you to assign temporal coordinates *without* having to explicitly extract them from the array.

This verb is only a "shortcut" verb, not an independent verb on its own. This means that when calling it, it is internally translated into a textual query recipe containing the self reference and the extract verb. In the same way, you can also use the shortcut verb **assign_space** to assign (components of) spatial coordinates.

In [93]:
recipe["months"] = sq.result("vegetation").assign_time("month")

In [94]:
out = recipe.execute(**context)

In [95]:
out["months"]

### Reduce

The reduce verb applies a reducer function along a dimension and subsequently drops the reduced dimension. That is, the output array always has one dimension less than the input array. Hence, the reduce verb reduces the dimensionality of an array.

To reduce a dimension, the reducer function operates on each slice of values along the axis of the dimension. Such a *slice* contains one value for each coordinate label of the dimension to reduce over, while the coordinate labels of all other dimensions are *constant* within each slice. The reducer function always returns a single value, such that each slice gets reduced from $n$ values to one value.

For example: an array with a spatial and a temporal dimension contains for each location in space $n$ values, where $n$ is the number of timestamps in the temporal dimension. When we reduce the temporal dimension of this array, the reducer function reduces these $n$ values for each location in space to one value. The resulting array has a single value per location in space, and no temporal dimension anymore.

![reduce](figures/reduce.png)

The are many different types of reducers available in semantique. For advanced users, it is also possible to define their own custom reducers, which is explained in the notebook on [Internal query processing](processor.ipynb#Adding-custom-reducers).

You can specify a reducer function simply by its name:

```python
sq.entity("water").reducer("time", "mean")
```

To be autocomplete-friendly, you can also use built-in constants that refer to a reducer function. These are stored in the reducers module of semantique, and are nothing more than the name of the function:

```python
sq.entity("water").reduce("time", sq.reducers.MEAN)
```

The built-in reducer functions of semantique currently are:

- `mean`: Returns the average value of each slice $S$.
- `median`: Returns the median value of each slice $S$.
- `mode`: Returns the most occuring value in each slice $S$.
- `max`: Returns the largest value in each slice $S$.
- `min`: Returns the smallest value in each slice $S$.
- `range`: Returns the difference between the largest and smallest value in each slice $S$.
- `n`: Returns the number of observations in each slice $S$.
- `product`: Returns the product of the values in each slice $S$.
- `sum`: Returns the sum of the values in each slice $S$.
- `standard_deviation`: Returns the standard deviation of the values in each slice $S$.
- `variance`: Returns the variance of the values in each slice $S$.
- `all`: For each slice $S$, returns $1$ if all $x_{i} \in S \neq 0$, and $0$ otherwise.
- `any`: For each slice $S$, returns $1$ if any $x_{i} \in S \neq 0$, and $0$ otherwise.
- `none`: For each slice $S$, returns $1$ if all $x_{i} \in S \eq 0$, and $0$ otherwise.
- `count`: Counts the number of non-zero values in each slice $S$.
- `percentage`: Calculates the percentage of non-zero values in each slice $S$.
- `first`: Returns the first value of each slice $S$.
- `last`: Returns the last value of each slice $S$.

It is important to mention that nodata values are **ignored** by the reducer functions! That is, for example, when a slice has the values `[1, 1, nan, 1]` the `all` reducer will return "true" and the `percentage` reducer will return 100. A reducer will only return a nodata value when *all* values in a slice are nodata values.

Also, when using the reduce verb, it is important to be aware of the *value type* of the input array(s). For example, they may contain nominal, ordinal, binary or numerical data. Different reducers may only support specific value types. For details, see [here](advanced.ipynb#Tracking-value-types).

Having said that, lets show some examples. The reduce verb takes as first argument the name of the dimension to be reduced over, and as second argument the reducer function to be applied.

In [96]:
recipe = sq.QueryRecipe()

In [97]:
recipe["vegetation"] = sq.entity("vegetation")
recipe["map"] = sq.result("vegetation").reduce("time", "count")
recipe["series"] = sq.result("vegetation").reduce("space", "count")

In [98]:
out = recipe.execute(**context)

In [99]:
out["vegetation"]

In [100]:
out["map"]

In [101]:
out["series"]

### Shift

### Smooth

### Trim

### Groupby

The groupby verb splits an array into multiple smaller subsets, called groups. That is, the output array is a collection of multiple subsets of the input array. Grouping is always done *along* a dimension. That means that first the coordinate labels of this dimension are divided into distinct groups. Then, the input array is split such that for each of these groups there is a subset of the input array containing all pixels that have a coordinate for the given dimension which matches a label in the group. How the coordinate labels are grouped is defined by a second, categorical array which we call the *grouper*. The grouper should be a one-dimensional array with a dimension that *matches* an existing dimension in the input array. Then, coordinate labels $\theta_{i}$ and $\theta_{j}$ of grouper $Y$ are in the same group if and only if $y_{i} = y_{j}$.

![groupby_single](figures/groupby_single.png)

For example, we may group the translation of the semantic concept *water* along the time dimension such that pixels observed in different seasons end up in different groups. The result of this operation is a collection of multiple arrays, i.e. one for each season. These array collections object have specific verbs to combine their elements into a single array again. For a description of those, see [the next section](#Verbs-for-collections-of-arrays).

In [102]:
recipe = sq.QueryRecipe()

In [103]:
recipe["seasons"] = sq.entity("water").groupby(sq.self().extract("time", "season"))

In [104]:
out = recipe.execute(**context)

In [105]:
out["seasons"][0]

In [106]:
out["seasons"][1]

You can also use a handy shortcut for the above formulation: the **groupby_time** verb. This verb allows you to group along the temporal dimension *without* having to explicitly extract the time coordinates from the array.

This is only a "shortcut" verb, not an independent verb on its own. This means that when calling it, it is internally translated into a textual query recipe containing the self reference and the extract verb. In the same way, you can also use the shortcut verb **groupby_space** for grouping directly along the spatial dimension.

In [107]:
recipe["seasons"] = sq.entity("water").groupby_time("season")

In [108]:
out = recipe.execute(**context)

In [109]:
out["seasons"][0]

In [110]:
out["seasons"][1]

#### Multiple groupers

It is also possible to provide a collection of groupers to the groupby verb, as long as their dimensions match. In that case, groups of coordinate labels are formed as follows: given grouper $Y$ and grouper $Z$ with matching coordinates, coordinate labels $\theta_{i}$ and $\theta_{j}$ are in the same group if and only if $y_{i} = y_{j}$ *and* $z_{i} = z_{j}$.

![groupby_multi](figures/groupby_multi.png)

That means for example that we can group the input array along the time dimension in a way that two pixels observed in the same month but a different year end up in a different group.

In [111]:
recipe = sq.QueryRecipe()

In [112]:
multigrouper = sq.collection(sq.self().extract("time", "year"), sq.self().extract("time", "month"))
recipe["groups"] = sq.entity("water").groupby(multigrouper)

In [113]:
out = recipe.execute(**context)

In [114]:
out["groups"][0]

In [115]:
out["groups"][1]

In [116]:
out["groups"][2]

A shorter formulation for the above statement would be:

In [117]:
recipe["groups"] = sq.entity("water").groupby_time(["year", "month"])

## Verbs for collections of arrays

When constructing a query recipe, you can start a processing chain with a reference to a [collection of multiple arrays](references.ipynb#Referencing-collections). These array collections have specific verbs that all in some way combine the elements of the collections back into a single array. The currently implemented verbs in this category are:

- [Concatenate](#Concatenate): Concatenates multiple arrays over a new or an existing dimension.
- [Compose](#Compose): Creates a categorical composition of multiple binary arrays.
- [Merge](#Merge): Merges values of corresponding pixels in multiple arrays into one by applying a reducer function.

It is important to mention that the verbs are intended for arrays that all have the *same* dimensions (but not necessarily the same coordinates)! They will also work on arrays that do not have the same dimensions, as long as they can all be aligned to each other. However, in these cases you should be aware of the pecularities of alignment, see [here](#Aligning-two-arrays) for details. To summarize: When the arrays in the collection have the same dimensions, but don't share all of their coordinate labels, they get aligned to each other by filling the missing pixels in either of them with nodata values. When one of the arrays in the collection (e.g. $C_{1}$) is missing a dimension that is present in another array in the collection (e.g. $C_{2}$), they get aligned to each other by duplicating the values of $C_{1}$ for each coordinate of the missing dimension. In any case, the coordinates of the output array are always the *union* of all coordinates from the input arrays. Only when the input arrays can in no way be aligned to each other, the verb will throw an error.

### Concatenate

The concatenate verb concatenates multiple arrays along a given dimension. There are two main ways in which you can do this: either you concatenate along a *new* dimension, or you concatenate along an *existing* dimension.

#### Concatenating along a new dimension

Concatenating multiple arrays along a new dimension is a relatively simple process. Each of the input arrays becomes a dimension in the output array. Lets consider a collection with two two-dimensional arrays $A$ and $B$ that have matching coordinates along the dimensions $\Gamma$ and $\Delta$. Concatenating them along a new dimension $E$ will result in a new three-dimensional array $C$ with dimensions $\Gamma$, $\Delta$ and $E$. A pixel with coordinates $(\gamma_{i}, \delta_{i})$ in array $A$ becomes a pixel with coordinates $(\gamma_{i}, \delta_{i}, \epsilon = A)$ in array $C$, while the pixel with the same coordinates $(\gamma_{i}, \delta_{i})$ in array $B$ becomes a pixel with coordinates $(\gamma_{i}, \delta_{i}, \epsilon = B)$ in array $C$.

![concatenate_new](figures/concat_new.png)

All you have to provide to the concatenate verb is the name of the new dimension. Be careful with using "time" and "space" as names for your new dimension. The dimension name "time" is reserved by for a temporal dimension with coordinate labels that are datetime objects, while the dimension name "space" is reserved for a spatial dimension with coordinate labels that are (x,y) coordinate pairs. Also, the names "x" and "y", "longitude" and "latitude" and "lon" and "lat" are generally used for the individual coordinate dimensions that make up the stacked "space" dimension.

The coordinate labels of the new dimension will be the names of the input arrays.

In [118]:
recipe = sq.QueryRecipe()

In [119]:
recipe["concepts"] = sq.collection(sq.entity("water"), sq.entity("snow"), sq.entity("vegetation")).\
    concatenate("concept")

In [120]:
out = recipe.execute(**context)

In [121]:
out["concepts"]

#### Concatenating over an existing dimension

Concatenating over an existing dimension is mainly meant for cases where each of the input arrays has different coordinate labels for that dimension. For example, we have one array with a time dimension containing dates in 2019, and another one with a time dimension containing dates in 2020. Then, concatenating them over the time dimension gives us a single array with a time dimension containing both the dates from 2019 and 2020.

![concatenate_existing](figures/concat_existing.png)

In [122]:
recipe = sq.QueryRecipe()

In [123]:
recipe["water"] = sq.entity("water").\
    groupby_time("year").\
    concatenate("time")

In [124]:
out = recipe.execute(**context)

In [125]:
out["water"]

We can also concatenate arrays that share coordinate labels of the dimension to concatenate over. However, for these coordinates, only the values of the *first* array in the collection that contains that coordinate, will end up in the output array. For the others, these values will simply be dropped.

### Compose

The compose verb is primarily meant for collections of binary arrays, i.e. arrays that only have "true" (i.e. 1) and "false" (i.e. 0) values. Then, a pixel in the output array gets a value of 1 when it was "true" in the first array of the collection, a value of 2 of it was "true" in the second array of the collection, a value of 3 if it was "true" in the third array of the collection, et cetera. Hence, with the compose verb you convert a set of binary arrays into one categorical array.

When a pixel is "true" in more than one array in the collection, it gets the index of that array that comes first in the collection. Hence, if a pixel is "true" in both the second and third array in a collection, it gets a value of 2 in the output array. When a pixel is not "true" for any of the arrays in the collection, it gets a nodata value in the output array.

![compose](figures/compose.png)

In [126]:
recipe = sq.QueryRecipe()

In [127]:
recipe["land_cover"] = sq.collection(sq.entity("water"), sq.entity("snow"), sq.entity("vegetation")).\
    compose()

In [128]:
out = recipe.execute(**context)

In [129]:
out["land_cover"]

### Merge

The merge verb is actually a combination of two other verbs. First, it [concatenates](#Concatenate) the arrays in the collection along a new dimension, and then it [reduces](#Reduce) the output of that over this new dimension. In practice, that means that the merge verb applies a reduction function to each set of pixels that have the same dimension coordinates but are stored in different arrays in the collection. For example, if we merge the translated semantic concepts *water*, *snow* and *vegetation* using the `any` reducer, we get an output cube that contains a "true" value (i.e. 1) for a pixel if the value of that pixel was "true" in at least one of the water, snow or vegetation arrays, and a "false" value (i.e. 0) if the value of that pixel was not "true" in any of those.

![merge](figures/evaluate_multi.png)

The only argument you need to provide to the verb is the reducer function. See the [reduce()](#Reduce) verb for an overview of them.

In [130]:
recipe = sq.QueryRecipe()

In [131]:
recipe["any_concept"] = sq.collection(sq.entity("water"), sq.entity("snow"), sq.entity("vegetation")).\
    merge("any")

In [132]:
out = recipe.execute(**context)

In [133]:
out["any_concept"]

Note that the process of merging a collection of two arrays usually can be modelled as well with the [evaluate()](#Evaluate) verb. For example, the following lines produce identical results:

```python
sq.collection(A, B).merge("any")
A.evaluate("or", B)
```

However, where the evaluate verb can "merge" one other array into a given input cube, the merge verb allows to combine an unrestricted number of arrays in one go.

## Split-apply-combine structures

All [verbs for single arrays](#Verbs-for-single-arrays) (except the groupby verb) can also be applied to array collections. In that case, they will simply be applied to each element of the collection seperately. Hence, the output will again be an array collection, with the same amount of members.

This allows to model well-know "split-apply-combine" processes, such as aggregation. You start with a single array, split it with the [groupby()](#Groupby) verb into a collection, apply one of the verbs for single arrays to each of its members, and then combine them back together using one of the dedicated verbs for array collections.

For example: we want to know the average water count over space for each year in our time dimension separately.

In [134]:
recipe = sq.QueryRecipe()

In [135]:
recipe["avg_count"] = sq.entity("water").\
    groupby_time("year").\
    reduce("space", "count").\
    reduce("time", "mean").\
    concatenate("year")

In [136]:
out = recipe.execute(**context)

In [137]:
out["avg_count"]

Another example: when we have a spatial extent consisting of multiple distinct spatial features, we might want to know the water count at each timestamp for each feature separately.

In [138]:
recipe = sq.QueryRecipe()

In [139]:
recipe["count_per_feat"] = sq.entity("water").\
    groupby_space("feature").\
    reduce("space", "count").\
    concatenate("feat")

In [140]:
parcels = gpd.read_file("files/parcels.geojson")
parcels.explore()

In [141]:
new_context = copy.deepcopy(context)
new_context["space"] = sq.SpatialExtent(parcels)
new_context["spatial_resolution"] = [-100, 100]

out = recipe.execute(**new_context)

In [142]:
out["count_per_feat"]

## Utility verbs

An additional set of verbs are the utility verbs. These are verbs that do not affect the values of an array themselves, but rather update the attributes of the array. The currently implemented verbs in this category are:

- [Name](#Name): Give a (new) name to an array.

### Name

The name verbs gives a name to an array. In some cases this can be particularly useful. For example, when [concatenating](#Concatenate) multiple arrays together along a new dimension, the names of these arrays will be used as coordinate labels of this new dimension.

In [143]:
recipe = sq.QueryRecipe()

In [144]:
water = sq.entity("water").name("W")
snow = sq.entity("snow").name("S")
vegetation = sq.entity("vegetation").name("V")

recipe["concepts"] = sq.collection(water, snow, vegetation).concatenate("concept")

In [145]:
out = recipe.execute(**context)

In [146]:
out["concepts"]