pres.qmd

---
title: "The R-spatial package ecosystem and openEO for analysing Earth Observation data"
author: "E. Pebesma, M. Mohr, F. Lahn, P. Zellner, M. Rossi, A. Jacob, P. Griffiths"
format: 
  beamer:
    incremental: false
classoption: "aspectratio=169"
---

```{r, echo=FALSE, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE, echo = TRUE)
```

## Why do we use Data Science languages, and Open Source?

The aim of science is communication:

- DS languages summarise and abstract problems, and allow reproducing the computational steps of a study
- open source lets anyone scrutinize computation, it allows to 
    - learn, 
    - understand, 
    - critisize and 
    - extend

\scriptsize 
`https://github.com/edzer/LPS22`

## What is R, and R-Spatial?

Why do people use R?

- .. it is a free software environment for statistical computing and graphics
- it is extendible
- it runs on Win/OSX/Linux, and has no package install hell
- it has strong support for spatial data, and spatial statistics

Why do people use R-Spatial?

- it solves problems, they know how
- many problems need _statistical_ analysis, e.g. for model inference or assessing prediction errors
- complex graphs are easy to make
- friendly and diverse commmunity support

---------

![](sf_deps.png)

## Data cubes, image collections

What we have:

![](L7-sparsetime.png)

\scriptsize 
Appel, M., Lahn, F., Buytaert, W. & Pebesma, E. (2018). Open and scalable analytics of large Earth observation datasets: from scenes to multidimensional arrays using SciDB and GDAL. ISPRS Journal of Photogrammetry and Remote Sensing, 138, 47-56. 

------

What we want:

![](cube2.png)

------

Or even:

![https://twitter.com/miguelmahechag1/status/1528653878798991360](esdc.png)

------

## How to cube?

Need to choose:

* spatial resolution: not everything needs to be done always at observation resolution
* target CRS, in case e.g. multiple UTM zones are covered
* spatial and/or temporal aggregation/interpolation _methods_

It seems unlikely that major EO datasets are _distributed_ as cubes, because

* there is no one-fits-all
* the "best" cube seems application dependent

## Creating cubes from image collections

* on-the-fly, GEE, openEO: data cube _views_
* explicitly: e.g. using R package `gdalcubes`

## package `gdalcubes`

```{r eval=TRUE}
library(gdalcubes)
L8.files = list.files("/home/edzer/data/L8_cropped", pattern = ".tif",
				recursive = TRUE, full.names = TRUE)
L8.col = create_image_collection(L8.files, format = "L8_SR", 
				out_file = "L8.db")
v.overview = cube_view(srs="EPSG:3857", extent=L8.col,
		dx = 500, dy = 500, dt = "P1Y", resampling = "nearest", 
		aggregation = "median")
```

------

```{r eval=TRUE}
v.overview
L8.cube.overview = raster_cube(L8.col, v.overview)
L8.cube.overview.rgb = select_bands(L8.cube.overview, 
    c("B02", "B03", "B04"))
# write_ncdf(L8.cube.overview.rgb, "L8.nc")
```
`gdalcubes` understands STAC collections, using R package `rstac`

## package `stars`

```{r}
library(stars)
file_name = system.file("tif/L7_ETMs.tif", package = "stars")
(r = read_stars(file_name))
```

```{r echo=FALSE}
read_stars(file_name, proxy = TRUE) |> 
   st_set_dimensions("band", c("B1", "B2", "B3", "B4", "B5", "B6")) -> r
```

-----

```{r fig.width=8, fig.height=4}
plot(r)
```

-----

```{r}
read_stars(file_name, proxy = TRUE) |> 
   st_set_dimensions("band", c("B1", "B2", "B3", "B4", "B5", "B6")) -> r
r
```

- lazy: delays reading pixels and subsequent computations until needed (plot, download)

## Vector data cubes

```{r}
r |> st_bbox() |> st_as_sfc() |> st_sample(10) -> pts
(e <- st_extract(r, pts))
```

--------------

```{r}
as.data.frame(e) |> dim()
as.data.frame(e) |> head(3)               # "long"
st_as_sf(e) |> as.data.frame() |> dim()
st_as_sf(e) |> as.data.frame() |> head(3) # "wide"
```

## openEO: R client

Connect, load collection:

```{r eval=FALSE}
library(openeo)
con = connect("https://openeo.cloud")
login()
# list_collections()
collection = "SENTINEL2_L2A"
bbox = list(west = 7, east = 7.01, south = 52, north = 52.01)
bands = c("B04", "B08")
time_range = list("2018-01-01", "2019-01-01")
p = processes()
data = p$load_collection(id = collection, spatial_extent = bbox,
          temporal_extent = time_range, bands = bands) 
```

-----

Process (asynchronous):

```{r eval=FALSE}
ndvi = function(data, context) { 
  red = data[1]; nir = data[2]; (nir-red)/(nir+red) 
}
calc_ndvi = p$reduce_dimension(data = data, dimension = "bands",
    reducer = ndvi)
intervals = ... ; labels = ...
temp_period = p$aggregate_temporal(data = calc_ndvi,
    reducer = function(data, context) p$median(data),
    intervals = intervals, labels = labels, dimension = "t")
result = p$save_result(data = temp_period, format="NetCDF")
job = create_job(graph = result,
    title = "ndvi.nc", description = "ndvi.nc",
    format = "netCDF")
start_job(job = job$id) # use the id of the job (job$id) to start the job
status(job)
dwnld = download_results(job = job$id, folder = "./") # when finished
```

--------

```{r fig.height=4}
r = read_stars("openEO.nc")
plot(r)
```

## UDF: user-defined functions

local exploration, remote execution

1. obtain a small cube section, using `openeo::get_sample()`, as `stars` object
2. explore locally, developing an analysis function 
3. push _exactly_ the same function as UDF to the openEO backend
4. show, explore, or download the results

- At the back-end, the "full" data cube is chunked, and pulled through
the UDF, again as `stars` sub-cubes.
- Currently works with `reduce_dimension` and `apply_dimension`, so
that the backend _knows_ which chunking strategy to use

## Vision

1. Minimize modification of the script

* extend `stars_proxy` objects to "cubed" remote image collections
* running locally or in cloud, with minimal adaptation
* use UDF: test locally, deploy remotely

2. Add syntactic sugar 

```{r eval = FALSE}
# after connect()/login():
get_collection("SENTINEL2_L2A") |>
	filter(bbox, time_range, bands) |>
    st_apply(~band, ndvi) |>
    aggregate("months", median) |>
	mapview()
```

## Further packages:

* `raster`, now `terra`: stop at 3 dimensions, directly read GDAL datasets, no vector data cubes; focus on scalable and high performance
* `sits`: trains ML models and predicts using EO _time series_ ; used operationally by INPE for mapping land use change in Brazil; see https://e-sensing.github.io/sitsbook/

## Discussion

* vector data cubes arise naturally from raster data cubes (by sampling, aggregating over polygons)
* Data Science is multi-lingual; language cross fertilization is useful
* R-spatial welcomes new contributions, and developers willing to take responsibility
* use or search for `#rspatial` on Twitter
* Get involved, get in touch!