day1.qmd

# Introduction to spatial data

### Learning goals

-   learn the different data types considered in spatial statistics, and get familiar with simple features, as well as with the difference between discrete and continuous variability in *space* and in *attributes*
-   learn what spatial dependence is about, and cases where it can be ignored
-   introduce spatial point patterns and processes

### Reading materials

From [Spatial Data Science: with applications in R](https://r-spatial.org/book/):

-   Preface
-   Chapter 1: Getting started
-   Chapter 2: Coordinates
-   Chapter 10: Statistical modelling of spatial data

### Exercises that can be prepared for the first day:

-   Section 10.6: Exercises

::: {.callout-tip title="Summary"}
-   Introduction to spatial data, support, coordinate reference systems
-   Introduction to spatial statistical data types: point patterns, geostatistical data, lattice data; imagery, tracks/trajectories
-   Is spatial dependence a fact? And is it a curse, or a blessing?
-   Spatial sampling, design-based and model-based inference
-   Intro to point patterns and point processes, observation window, first and second order properties
-   Checklist for your spatial data
:::

## What is special about spatial data?

-   **Location**. Does location always involve coordinates? Relative/absolute, qualitative/quantitative
-   **Coordinates**. What are coordinates? Dimension(s), unit
-   **Time**. If not explicit, there is an implicit time reference. Dimension(s), unit, datetime
-   **Attributes**. *at* specific locations we measure (observe) specific properties
-   Quite often, we want to know *where* things change (**space-time interactions**).
-   **Reference systems** for space, time, and attributes: what are they?
-   **Support**: if we have an attribute value associated with a line, polygon or grid cell:
    -   does the value summarise all values at points? (line/area/cell support), or
    -   is the value constant throughout the line/area/cell (point support)?
-   **Continuity**:
    -   is a variable *spatially* continuous? Yes for geostatitical data, no for point patterns
    -   is an *attribute variable* continuous? [Stevens's measurement scales](https://www.science.org/doi/pdf/10.1126/science.103.2684.677?casa_token=H8am2h3sIYUAAAAA:eUd2ZU6ZRyJNqx4jdRv0E9WG7k3OBXAVbqgZ2O-Bl7pHNJSI0L2h9TM6i3YXve2nY5rD_4RbI2aecQ): yes if *Interval* or *Ratio*.

### Support: examples

-   Road properties
    -   road type: gravel, brick, asphalt (point support: everywhere on the whole road)
    -   mean width: block support (summary value)
    -   minimum width: block support (although the minimum width may be the value at a single (point) location, it summarizes all widths of the road--we no longer know the width at any specific point location)
-   Land use/land cover
    -   when we classify e.g. 30 m x 30 m Landsat pixels into a single class, this single class is not constant throughout this pixel
    -   road type is a land cover type, but a road never covers a 30 m x 30 m pixel
    -   a land cover type like "urban" is associated with a positive (non-point) support: we don't say a point in a garden or park is urban, or a point on a roof, but these are part of a (block support) urban fabric
-   Elevation
    -   in principle, we can measure elevation at a point; in practice, every measuring device has a physical (non-point) size
-   Further reading: [Chapter 5: Attributes and Support](https://r-spatial.org/book/05-Attributes.html)

## Spatial vs. Geospatial

-   Spatial refers (physical) spaces, 2- or 3-dimensional ($R^2$ or $R^3$)
    -   Most often spatial statistics considers 2-dimensional problems
    -   3-d: meteorology, climate science, geophysics, groundwater hydrology, aeronautics, ...
-   "Geo" refers to the Earth
-   For Earth coordinates, we always need a *datum*, consisting of an ellipsoid (shape) and the way it is fixed to the Earth (origin)
    -   The Earth is modelled by an ellipsoid, which is nearly round
    -   If we consider Earth-bound areas as flat, for larger areas we get the distances wrong
    -   We can (and do) also work on $S^2$, the surface of a sphere, rather than $R^2$, to get distances right, but this creates a number of challenges (such as plotting on a 2D device)
-   Non-geospatial spaces could be:
    -   Associated with other bodies (moon, Mars)
    -   Astrophysics, places/directions in the universe
    -   Locations in a building (where we use "engineering coordinates", relative to a building corner and orientation)
    -   Microscope images
    -   MRT scans (3-D), places in a human body
    -   locations on a genome?

```{r}
#| code-fold: true
#| out.width: '100%'
#| fig.cap: "world map, with longitude and latitude map linearly to x and y ([Plate Caree](https://en.wikipedia.org/wiki/Equirectangular_projection))"
library(rnaturalearth)
library(sf)
par(mar = c(2,2,0,0) + .1)
ne_countries() |> st_geometry() |> plot(axes=TRUE)
```

::: {.callout-tip title="What is Statistics"}
... or what *are* statistics?

-   statistic: singular; a descriptive measure summarising some data
-   statistics: plural; "minimum and maximum are two statistics"
-   Statistics: singular; a scientific discipline aiming at modelling data, using probability theory
    -   where does randomness come from? Design-based vs. model-based
    -   are parameters random or fixed? Bayesian vs. frequentist
    -   inference, prediction, simulations
-   Typical approach: observation = signal + noise, noise modelled by random variables
:::

### Design-based statistics

In design-based statistics, randomness comes from random sampling. Consider an area $B$, from which we take samples $$z(s),
s \in B,$$ with $s$ a location for instance two-dimensional: $s_i =
\{x_i,y_i\}$. If we select the samples *randomly*, we can consider $S \in B$ a random variable, and $z(S)$ a random sample. Note the randomness in $S$, not in $z$.

Two variables $z(S_1)$ and $z(S_2)$ are *independent* if $S_1$ and $S_2$ are sampled independently. For estimation we need to know the inclusion probabilities, which need to be non-negative for every location.

If inclusion probabilities are constant (simple random sampling; or complete spatial randomness: day 2, point patterns) then we can estimate the mean of $Z(B)$ by the sample mean $$\frac{1}{n}\sum_{j=1}^n
z(s_j).$$ This also predicts the value of a *randomly* chosen observation $z(S)$. It cannot be used to predict the value $z(s_0)$ for a non-randomly chosen location $s_0$; for this we need a model.

### Model-based statistics

Model-based statistics assumes randomness in the measured responses; consider a regression model $y = X\beta + e$, where $e$ is a random variable and as a consequence $y$, the response variable is a random variable. In the spatial context we replace $y$ with $z$, and capitalize it to indicate it is a random variable, and write $$Z(s) = X(s)\beta + e(s)$$ to stress that

-   $Z(s)$ is a random function (random variables $Z$ as a function of $s$)
-   $X(s)$ is the matrix with covariates, which depend on $s$
-   $\beta$ are (spatially) constant coefficients, not depening on $s$
-   $e(s)$ is a random function with mean zero and covariance matrix $\Sigma$

In the regression literature this is called a (linear) mixed model, because $e$ is not i.i.d. If $e(s)$ contains an iid component $\epsilon$ we can write this as

$$Z(s) = X(s)\beta + w(s) + \epsilon$$

with $w(s)$ the spatial signal, and $\epsilon$ a noise compenent e.g. due to measurement error.

Predicting $Z(s_0)$ will involve (GLS) estimation of $\beta$, but also prediction of $e(s_0)$ using correlated, nearby observations (day 3: geostatistics).

### Design- or model-based?

-   design-based requires a random sample, if that is the case it needs no further assumptions
-   model-based requires stationarity assumptions to estimate $\Sigma$
-   model-based is typically more effective for interpolation problems
-   design-based can be most effective when estimating, e.g. average mapping errors

### Using coordinates as covariates?

-   (day 4)

## Spatial statistics: data types

### Point Patterns

-   Points (locations) + observation window
-   Example from [here](https://opendata-esri-de.opendata.arcgis.com/datasets/dc6d012f47d94fde99deacc316721f30/explore?location=51.099061%2C10.453852%2C7.45)

```{r fig-gdal-fig-nodetails, echo = FALSE}
#| code-fold: true
#| out.width: '100%'
#| fig.cap: "Wind turbine parks in Germany"
knitr::include_graphics("turbines.png")
```

-   The locations contain the information
-   Points may have (discrete or continuous) *marks* (attributes)
-   The observation window is, apart from the points, *empty*

### Geostatistical data: locations + measured values

```{r}
#| code-fold: true
#| out.width: '100%'
#| fig.cap: "NO2 measurements at rural background stations (EEA)"
library(sf)
no2 <- read.csv(system.file("external/no2.csv",
    package = "gstat"))
crs <- st_crs("EPSG:32632")
st_as_sf(no2, crs = "OGC:CRS84", coords =
    c("station_longitude_deg", "station_latitude_deg")) |>
    st_transform(crs) -> no2.sf
library(ggplot2)
# plot(st_geometry(no2.sf))
"https://github.com/edzer/sdsr/raw/main/data/de_nuts1.gpkg" |>
  read_sf() |>
  st_transform(crs) -> de
ggplot() + geom_sf(data = de) +
    geom_sf(data = no2.sf, mapping = aes(col = NO2))
```

-   The value of interest is measured at a set of sample locations
-   At other location, this value exists but is *missing*
-   The interest is in estimating (predicting) this missing value (interpolation)
-   The actual sample locations are not of (primary) interest, the signal is in the measured values

### Areal data

-   polygons (or grid cells) + polygon summary values

```{r}
#| code-fold: true
#| out.width: '100%'
#| fig.cap: "NO2 rural background, average values per NUTS1 region"
# https://en.wikipedia.org/wiki/List_of_NUTS_regions_in_the_European_Union_by_GDP
de$GDP_percap = c(45200, 46100, 37900, 27800, 49700, 64700, 45000, 26700, 36500, 38700, 35700, 35300, 29900, 27400, 32400, 28900)
ggplot() + geom_sf(data = de) +
    geom_sf(data = de, mapping = aes(fill = GDP_percap)) + 
	geom_sf(data = st_cast(de, "MULTILINESTRING"), col = 'white')
```

-   The polygons contain polygon summary (polygon support) values, not values that are constant throughout the polygon (as in a soil, lithology or land cover map)
-   Neighbouring polygons are typically related: spatial correlation
-   neighbour-neighbour correlation: Moran's I
-   regression models with correlated errors, spatial lag models, CAR models, GMRFs, ...
-   see Ch 14-17 of [SDSWR](https://r-spatial.org/book/)
-   briefly addressed on Friday

## Data types that received less attention in the spatial statistics literature

### Image data

```{r}
#| code-fold: true
#| out.width: '100%'
#| fig.cap: "RGB image from a Landsat scene"
library(stars)
plot(L7_ETMs, rgb = 1:3)
```

-   are these geostatistical data, or areal data?
-   If we identify objects from images, can we see them as point patterns?

### Tracking data, trajectories

```{r}
#| code-fold: true
#| out.width: '100%'
#| fig.cap: "Storm/hurricane trajectories colored by year"
# from: https://r-spatial.org/r/2017/08/28/nest.html
library(tidyverse)
storms.sf <- storms %>%
    st_as_sf(coords = c("long", "lat"), crs = 4326)
storms.sf <- storms.sf %>% 
    mutate(time = as.POSIXct(paste(paste(year,month,day, sep = "-"), 
                                   paste(hour, ":00", sep = "")))) %>% 
    select(-month, -day, -hour)
storms.nest <- storms.sf %>% group_by(name, year) %>% nest
to_line <- function(tr) st_cast(st_combine(tr), "LINESTRING") %>% .[[1]] 
tracks <- storms.nest %>% pull(data) %>% map(to_line) %>% st_sfc(crs = 4326)
storms.tr <- storms.nest %>% select(-data) %>% st_sf(geometry = tracks)
storms.tr %>% ggplot(aes(color = year)) + geom_sf()
```

-   A temporal snapshot (time slice) of a set of moving things forms a point pattern
-   We often analyse trajectories by
    -   estimating densities, for space-time blocks, per individual or together
    -   analysing interactions (alibi problem, mating animals, home range, UDF etc)

## Checklist if you have spatial data

-   Do you have the spatial coordinates of your data?
-   Are the coordinates Earth-bound?
-   If yes, do you have the coordinate reference system of them?
-   What is the support (physical size) of your observations?
-   Were the data obtained by random sampling, and if yes, do you have sampling weights?
-   Do you know the *extent* ($B$) from which your data were sampled, or collected?

## Exercises

-   What is the coordinate reference system of the `ne_countries()` dataset, imported above?
-   Look up the "Equidistant Cylindrical (Plate Carrée)" projection on the <https://proj.org> website.
-   Why is this projection called *The simplest of all projections*?
-   Project `ne_countries` to Plate Carrée, and plot it with `axes=TRUE`. What has changed? (Hint: `st_crs()` accepts a *proj string* to define a coordinate reference system (CRS); `st_transform()` transforms a dataset to a new CRS.)
-   Project the same dataset to Eckert IV projection. What has changed?
-   Also try plotting this dataset after transforming it to an orthographic projection with `+proj=ortho`; what went wrong?

## Further reading

-   Ripley, B. 1981. Spatial Statistics. Wiley.
-   Cressie, N. 1993. Statistics for Spatial Data. Wiley.
-   Cochran, W.G. 1977. Sampling Techniques. Wiley.