forked from r-spatial/stars
-
Notifications
You must be signed in to change notification settings - Fork 0
/
stars7.Rmd
290 lines (234 loc) · 11.2 KB
/
stars7.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
---
title: "7. Statistical modelling with stars objects"
author: "Edzer Pebesma"
output:
html_document:
toc: true
toc_float:
collapsed: false
smooth_scroll: false
toc_depth: 2
vignette: >
%\VignetteIndexEntry{7. Statistical modelling with stars objects}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, comment = "#>")
suppressPackageStartupMessages(library(dplyr))
EVAL = suppressWarnings(require(starsdata, quietly = TRUE))
```
We will first fix the random number seed, to get identical results for procedures that involve random sampling. Remove this command if you want the random effect in outcomes.
```{r}
set.seed(131)
```
## Training and prediction with stars objects
The usual way of statistical modelling in R uses `data.frame`s (or tibbles), and proceeds like
```{r eval=FALSE}
m = model(formula, data)
pr = predict(m, newdata)
```
where `model` is a function like `lm`, `glm`, `randomForest` etc. that returns a classed object, such that the `predict` generic can choose the right prediction function based on that class. `formula` looks like `y ~ x1+x2` and specifies the dependent variable (`y`) and predictors (`x1`, `x2`), which are found as columns in `data`. `newdata` needs to have the predictors in its columns, and returns the predicted values for `y` at these values for predictors.
### stars objects as data.frames
The analogy of stars objects to `data.frame` is this:
* each attribute (array) becomes a single column
* dimensions become added (index) columns
To see how this works with the 6-band example dataset, consider this:
```{r}
library(stars)
l7 = system.file("tif/L7_ETMs.tif", package = "stars") %>%
read_stars()
l7
as.data.frame(l7) %>% head()
```
We see that we get **one** single variable with the object (array) name, and
added columns with the dimension values (x, y, band). In a typical case, we
would like to have the six bands distributed over six variables, and have
a single observation (row) for each x/y pair.
For this, we _could_ use e.g. `utils::unstack` or `dplyr::pivot_wider` on this data.frame, but
a more efficient way is to use the dedicated `split` method for `stars` objects,
which resolves a dimension and splits it over attributes, one for each dimension value:
```{r}
l7 %>% split("band") %>%
as.data.frame() %>%
head()
```
The reason that `split` is more efficient than the mentioned alternatives is that (i) `split` does not have to match records based on dimensions (x/y), and (ii) it works for out-of-memory (stars_proxy) arrays, in the chunked process/write loop of `write_stars()`.
### Predict for `stars` objects
The pattern to obtain predictions for all pixels of a `stars` objects is:
* use the full dataset or a sample of it to train the model, using `as.data.frame()` (possibly after a `split`)
* use `predict(star_object, model)` to predict for all pixels of `stars_object`, using the stars-wrapper of the `predict` method for `model`.
* if there is no `predict` method for `model`, provide one (see the `kmeans` example below)
This works both for `stars` objects (in-memory) as `stars_proxy` objects (out-of memory). For plotting `stars_proxy` objects, downsampling is done _before_ prediction (predicting only the pixels that are shown), full rasters can be written to disk with `write_stars()`, which will carry out predictions on chunks being read and written.
## models fitted for every pixel
We can run models in many different ways on array data.
One way is to run a single model to all pixels, where the model operates
e.g. on the spectral (band) or temporal dimension. An example was given
[in vignette 2](https://r-spatial.github.io/stars/articles/stars2.html#plotting-with-changed-evaluation-order), where NDVI was computed from the red and near infrared band. NDVI does not involve estimating parameters, but reducing two bands to one.
An example where we fit a model to every pixel would be fit a time series model to each pixel time series, and output one or more model coefficients for each pixel; this is shown next.
### Linear regression on pixel time series
We can read in the avhrr dataset, containing only 9 days:
```{r eval=EVAL}
library(stars)
x = c("avhrr-only-v2.19810901.nc",
"avhrr-only-v2.19810902.nc",
"avhrr-only-v2.19810903.nc",
"avhrr-only-v2.19810904.nc",
"avhrr-only-v2.19810905.nc",
"avhrr-only-v2.19810906.nc",
"avhrr-only-v2.19810907.nc",
"avhrr-only-v2.19810908.nc",
"avhrr-only-v2.19810909.nc")
file_list = system.file(paste0("netcdf/", x), package = "starsdata")
y = read_stars(file_list, sub = "sst", quiet = TRUE, proxy = TRUE)
(t = st_get_dimension_values(y, 4))
```
We will use a function that computes the slope of the regression
line for temperature with time. We get temperatures as a vector in
the first argument of the function supplied to `st_apply`, and have
`t` already defined. The function could look like
```{r}
slope = function(x) {
if (any(is.na(x)))
NA_real_
else
coeffients(lm(x~t))[2]
}
```
but we will optimize this a bit, using `anyNA` and `lm.fit`
rather than `lm`:
```{r}
slope = function(x) {
if (anyNA(x))
NA_real_
else
lm.fit(cbind(1, t), x)$coefficients[2]
}
```
The result is lazily defined by (`adrop` drops the singular dimension)
```{r eval=EVAL}
out = st_apply(adrop(y), c(1,2), slope)
```
but only _computed_ by the following command, where the
computations are restricted to the pixels plotted:
```{r eval=EVAL}
plot(out, breaks = "equal", main = "9-day time trend (slope)")
```
An interisting pattern appears (despite the very short time series!): where SST reveals a main signal of colder when getting further from the equator, _changes_ in SST show much more fine grained structures of areas going up, and others going down. A diverging color ramp would be a better choice here, to distinguis positive from negative trends.
## Unsupervised learners
### Principal components
In the first example, we build principal components on the entire data set,
because it is rather small.
```{r}
tif = system.file("tif/L7_ETMs.tif", package = "stars")
r = split(read_stars(tif))
pc = prcomp(as.data.frame(r)[,-(1:2)]) # based on all data
out = predict(r, pc)
plot(merge(out), breaks = "equal", join_zlim = FALSE)
```
We see, amongst others, that PC1 picks up the difference between sea (dark) and land, and PC2 and 3 structures in sea and coastal waters.
In the second example, we build principal components from a sample
of the entire data set, because the entire dataset is rather large.
We apply it, using `predict`, to pixels shown in the plot (i.e. at
reduced rather than full resolution)
```{r eval=EVAL}
granule = system.file("sentinel/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.zip",
package = "starsdata")
s2 = paste0("SENTINEL2_L1C:/vsizip/", granule,
"/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.SAFE/MTD_MSIL1C.xml:10m:EPSG_32632")
p = read_stars(s2, proxy = TRUE, NA_value = 0) %>%
split()
r = st_sample(p, 1000)
pc = prcomp(na.omit(as.data.frame(r))[,-(1:2)]) # based on all data
out = predict(p, pc)
```
Before plotting this, we'll add country borders that delineate sea, obtained from the `mapdata` package:
```{r eval=EVAL}
bb = st_bbox(p) %>%
st_as_sfc() %>%
st_transform(4326) %>%
st_bbox()
library(maps)
library(mapdata)
m = map("worldHires", xlim = bb[c(1,3)], ylim = bb[c(2,4)], plot=F,fill=TRUE) %>%
st_as_sfc() %>%
st_transform(st_crs(r))
```
We plot the results with independent color ranges, so every PC is stretched over the entire grey scale.
```{r eval=EVAL}
plt_boundary = function() plot(m, border = 'orange', add = TRUE)
plot(merge(out), hook = plt_boundary, join_zlim = FALSE)
```
This suggests that PC1 picks up the difference cloud signal (difference between clouds and non-clouds), PC2 the difference between sea and land areas, and PC4 some sensor artefacts (striping in swath direction).
To compute full resolution (10000 x 10000 pixels) results and write them to a file, use
```{r eval=FALSE}
write_stars(merge(out), "out.tif")
```
### K-means clustering
```{r}
library(clue)
predict.kmeans = function(object, newdata, ...) {
unclass(clue::cl_predict(object, newdata[, -c(1:2)], ...))
}
```
For a small dataset:
```{r}
tif = system.file("tif/L7_ETMs.tif", package = "stars")
i = read_stars(tif, proxy = TRUE) %>%
split()
nclus = 5
sam = st_sample(i, 1000)
k = kmeans(na.omit(as.data.frame(sam)[, -c(1:2)]), nclus)
out = predict(i, k)
plot(out, col = sf.colors(nclus, categorical=TRUE))
```
This seems to pick up a fair number of land cover classes: water (5), rural (3), and densely populated (1, 2).
For the large(r) dataset:
```{r eval=EVAL}
i = read_stars(s2, proxy = TRUE, NA_value = 0) %>%
split()
sam = st_sample(i, 1000)
k = kmeans(na.omit(as.data.frame(sam)[, -c(1:2)]), nclus)
out = predict(i, k)
plot(out, col = sf.colors(nclus, categorical=TRUE), reset = FALSE)
plot(m, add = TRUE)
```
we see that class 1 and 3 identify with the unclouded area, 3 with land, the other classes seem to mainly catch aspects of the cloud signal.
## Supervised learners
### Random Forest land use classification
The following example is purely for educational purposes; the classified "land use" is just a rough approximation from what seems to be easily visible on the image: sea, land, and areas with both but partially covered by clouds. We opted therefore for four classes: sea, land, clouds over sea, clouds over land.
We have polygon areas where the land use was classified, residing in a GeoPackage file. (This file was created using QGIS, using the instructions found [here](https://www.qgistutorials.com/en/docs/digitizing_basics.html).)
```{r eval=EVAL}
# for all, multi-resolution, use:
bands = c("B04", "B03", "B02", "B08", "B01", "B05", "B06", "B07", "B8A", "B09", "B10", "B11", "B12")
# bands = c("B04", "B03", "B02", "B08")
s2 = paste0("/vsizip/", granule,
"/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.SAFE/GRANULE/L1C_T32ULE_A013919_20180220T105539/IMG_DATA/T32ULE_20180220T105051_", bands, ".jp2")
r = read_stars(s2, proxy = TRUE, NA_value = 0) %>%
setNames(bands)
cl = read_sf(system.file("gpkg/s2.gpkg", package = "stars")) %>%
st_transform(st_crs(r))
plot(r, reset = FALSE)
plot(cl, add = TRUE)
plot(m, add = TRUE, border = 'orange')
```
Next, we need points, sampled inside these polygons, for which we need to extract the satellite spectral data
```{r eval=EVAL}
pts = st_sample(cl, 1000, "regular") %>%
st_as_sf() %>%
st_intersection(cl)
train = st_extract(r, pts)
train$use = as.factor(pts$use) # no need for join, since the order did not change
train
```
```{r eval=EVAL}
library(randomForest)
train = as.data.frame(train)
train$x = NULL # remove geometry
rf = randomForest(use ~ ., train) # ~ . : use all other attributes
pr = predict(r, rf)
plot(pr, key.width = lcm(5), reset = FALSE, key.pos = 4)
# add country outline:
plot(m, add = TRUE)
```
This comes with the rather trivial finding that land and sea can be well predicted when there are no clouds, and the less trivial finding that they can be reasonably distinguished through patchy clouds of this kind. Note that predictions of this kind are pure pixel-based: for each prediction only the spectral bands for this pixel are considered, not for instance of any neighboring pixels.