Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postmerge fixes spatial CV chapter #777

Merged
merged 29 commits into from
Apr 21, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
ec9fb3c
fix spcv refs
jannes-m Apr 19, 2022
68f9f94
delete superfluous code chunk and point again to the excercises
jannes-m Apr 19, 2022
fc8460f
make cv boxplotting more generic and add fallback learner explanation
jannes-m Apr 19, 2022
209ae67
add fallback learners and make sure benchmark does neither store the …
jannes-m Apr 19, 2022
f6f6fd1
fix wrong labeling
jannes-m Apr 19, 2022
d6bbb67
rename bmr output
jannes-m Apr 19, 2022
ac167ae
Merge branch 'main' into spcv_fixes
jannes-m Apr 19, 2022
4092502
resize lsl initiation point figure
jannes-m Apr 19, 2022
3030e1e
fix reference to gis chapter
jannes-m Apr 19, 2022
aa2ea14
resize prediction map figure
jannes-m Apr 19, 2022
b5ed4d9
add spatiotempcv ref again
jannes-m Apr 19, 2022
00edc9d
define a fallback learner and don't store model for follow-up analyses
jannes-m Apr 19, 2022
34b6578
replace parameter by object
jannes-m Apr 19, 2022
41b9774
polish refs
jannes-m Apr 19, 2022
efa9186
polish exercises
jannes-m Apr 19, 2022
c1f763b
add hyphen and state benchmark_grid() and benchmark() in the right order
jannes-m Apr 19, 2022
631b7c4
harmonize with main
jannes-m Apr 20, 2022
05e1c18
evaluate code chunks and use lazy.eval=FALSE knitr option where appro…
jannes-m Apr 20, 2022
d788bfe
start using the freshly computed benchmark values for glm and svm
jannes-m Apr 20, 2022
a4ce0d7
delete paragraph on optimal hyperpars for each fold at the performanc…
jannes-m Apr 20, 2022
4d70dff
evaluate boxplot chunk again
jannes-m Apr 20, 2022
5c1ab0e
save benchmark score output
jannes-m Apr 20, 2022
295e1c3
delete no longer needed rds files
jannes-m Apr 20, 2022
3beaa62
evaluate score code chunks
jannes-m Apr 20, 2022
72db49a
slightly rewrite mlr3 glm output part
jannes-m Apr 20, 2022
2f4e075
add further packages to the first chunk and slightly rewrite svm scor…
jannes-m Apr 20, 2022
24fe84c
fix typo
jannes-m Apr 20, 2022
1a6c1d9
update code/12-cv.R
jannes-m Apr 20, 2022
7f44b73
read in score output
jannes-m Apr 20, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
162 changes: 72 additions & 90 deletions 12-spatial-cv.Rmd

Large diffs are not rendered by default.

57 changes: 30 additions & 27 deletions 15-eco.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ In it you will also make use of R's\index{R} interfaces to dedicated GIS\index{G

The chapter uses the following packages:

```{r 15-eco-1, message=FALSE, eval=FALSE}
```{r 15-eco-1, message=FALSE}
library(data.table)
library(dplyr)
library(mlr3)
Expand Down Expand Up @@ -70,7 +70,7 @@ To guarantee an optimal prediction, it is advisable to tune beforehand the hyper

All the data needed for the subsequent analyses is available via the **spDataLarge** package.

```{r 15-eco-2, eval=FALSE}
```{r 15-eco-2}
# spatial vector objects
data("study_area", "random_points", "comm", package = "spDataLarge")
# spatial raster objects
Expand Down Expand Up @@ -229,13 +229,13 @@ ep$carea = log10(ep$carea)

As a convenience to the reader, we have added `ep` to **spDataLarge**:

```{r 15-eco-9, eval=FALSE}
```{r 15-eco-9, cache.lazy=FALSE}
ep = terra::rast(system.file("raster/ep.tif", package = "spDataLarge"))
```

Finally, we can extract the terrain attributes to our field observations (see also Section \@ref(raster-extraction)).

```{r 15-eco-10, eval=FALSE}
```{r 15-eco-10, cache=TRUE, cache.lazy=FALSE, message=FALSE, warning=FALSE}
random_points[, names(ep)] =
# terra::extract adds automatically a for our purposes unnecessary ID column
terra::extract(ep, terra::vect(random_points)) |>
Expand Down Expand Up @@ -272,7 +272,7 @@ Often ordinations\index{ordination} using presence-absence data yield better res
Ordination techniques such as NMDS\index{NMDS} require at least one observation per site.
Hence, we need to dismiss all sites in which no species were found.

```{r 15-eco-11, eval=FALSE}
```{r 15-eco-11}
# presence-absence matrix
pa = vegan::decostand(comm, "pa") # 100 rows (sites), 69 columns (species)
# keep only sites in which at least one species was found
Expand Down Expand Up @@ -303,7 +303,7 @@ nmds$stress
saveRDS(nmds, "extdata/15-nmds.rds")
```

```{r 15-eco-14, include=FALSE, eval=FALSE}
```{r 15-eco-14, include=FALSE}
nmds = readRDS("extdata/15-nmds.rds")
```

Expand All @@ -326,7 +326,13 @@ plot(y = sc[, 1], x = elev, xlab = "elevation in m",
ylab = "First NMDS axis", cex.lab = 0.8, cex.axis = 0.8)
```

```{r xy-nmds, fig.cap="Plotting the first NMDS axis against altitude.", fig.scap = "First NMDS axis against altitude plot.", fig.asp=1, out.width="60%"}
```{r xy-nmds, fig.cap="Plotting the first NMDS axis against altitude.", fig.scap = "First NMDS axis against altitude plot.", fig.asp=1, out.width="60%", message=FALSE, echo=FALSE}
elev = dplyr::filter(random_points, id %in% rownames(pa)) |>
dplyr::pull(dem)
# rotating NMDS in accordance with altitude (proxy for humidity)
rotnmds = vegan::MDSrotate(nmds, elev)
# extracting the first two axes
sc = vegan::scores(rotnmds, choices = 1:2)
knitr::include_graphics("figures/15_xy_nmds.png")
```

Expand Down Expand Up @@ -393,7 +399,7 @@ We refer the reader to @james_introduction_2013 for a more detailed description
To introduce decision trees by example, we first construct a response-predictor matrix by joining the rotated NMDS\index{NMDS} scores to the field observations (`random_points`).
We will also use the resulting data frame for the **mlr3**\index{mlr3 (package)} modeling later on.

```{r 15-eco-16, eval=FALSE}
```{r 15-eco-16, message=FALSE}
# construct response-predictor matrix
# id- and response variable
rp = data.frame(id = as.numeric(rownames(sc)), sc = sc[, 1])
Expand Down Expand Up @@ -478,7 +484,7 @@ Having already constructed the input variables (`rp`), we are all set for specif
For specifying a spatial task, we use again the **mlr3spatiotempcv** package [@schratz_mlr3spatiotempcv_2021 & Section \@ref(spatial-cv-with-mlr3)].
Since our response (`sc`) is numeric, we use a regression\index{regression} task.

```{r 15-eco-20, eval=FALSE}
```{r 15-eco-20}
# create task
task = mlr3spatiotempcv::TaskRegrST$new(
id = "mongon", backend = dplyr::select(rp, -id, -spri), target = "sc")
Expand All @@ -488,7 +494,7 @@ Using an `sf` object as the backend automatically provides the geometry informat
Additionally, we got rid of the columns `id` and `spri` since these variables should not be used as predictors in the modeling.
Next, we go on to construct the a random forest\index{random forest} learner from the **ranger** package.

```{r 15-eco-21, eval=FALSE}
```{r 15-eco-21}
lrn_rf = lrn("regr.ranger", predict_type = "response")
```

Expand All @@ -507,7 +513,7 @@ Hyperparameter\index{hyperparameter} combinations will be selected randomly but
<!-- (`r # ncol(rp) - 1`), -->
(4), `sample.fraction` should range between 0.2 and 0.9 and `min.node.size` should range between 1 and 10.

```{r 14-eco-23, eval=FALSE}
```{r 15-eco-23}
# specifying the search space
search_space = paradox::ps(
mtry = paradox::p_int(lower = 1, upper = ncol(task$data()) - 1),
Expand All @@ -517,12 +523,12 @@ search_space = paradox::ps(
```

Having defined the search space, we are all set for specifying our tuning via the `AutoTuner()` function.
Since we deal with geographic data, we will again make use of spatial cross-validation to tune the hyperparameters\index{hyperparameter} (see Sections \@ref(intro-cv) and \@ref(spatial-cv-with-mlr)).
Since we deal with geographic data, we will again make use of spatial cross-validation to tune the hyperparameters\index{hyperparameter} (see Sections \@ref(intro-cv) and \@ref(spatial-cv-with-mlr3)).
Specifically, we will use a five-fold spatial partitioning with only one repetition (`rsmp()`).
In each of these spatial partitions, we run 50 models (`trm()`) while using randomly selected hyperparameter configurations (`tnr()`) within predefined limits (`seach_space`) to find the optimal hyperparameter\index{hyperparameter} combination [see also Section \@ref(svm) and https://mlr3book.mlr-org.com/optimization.html#autotuner, @becker_mlr3_2021].
In each of these spatial partitions, we run 50 models (`trm()`) while using randomly selected hyperparameter configurations (`tnr()`) within predefined limits (`seach_space`) to find the optimal hyperparameter\index{hyperparameter} combination [see also Section \@ref(svm) and https://mlr3book.mlr-org.com/optimization.html#autotuner, @becker_mlr3_2022].
The performance measure is the root mean squared error (RMSE\index{RMSE}).

```{r 15-eco-22, eval=FALSE}
```{r 15-eco-22}
autotuner_rf = mlr3tuning::AutoTuner$new(
learner = lrn_rf,
# spatial partitioning
Expand All @@ -540,7 +546,7 @@ autotuner_rf = mlr3tuning::AutoTuner$new(

Calling the `train()`-method of the `AutoTuner`-object finally runs the hyperparameter\index{hyperparameter} tuning, and will find the optimal hyperparameter\index{hyperparameter} combination for the specified parameters.

```{r 14-eco-24, eval=FALSE}
```{r 15-eco-24, eval=FALSE}
# hyperparameter tuning
set.seed(0412022)
autotuner_rf$train(task)
Expand All @@ -561,33 +567,29 @@ autotuner_rf$train(task)
saveRDS(autotuner_rf, "extdata/15-tune.rds")
```

```{r 15-eco-26, echo=FALSE, eval=FALSE}
```{r 15-eco-26, echo=FALSE, eval=TRUE, cache.lazy=FALSE}
autotuner_rf = readRDS("extdata/15-tune.rds")
```

An `mtry` of 4, a `sample.fraction` of 0.9, and a `min.node.size` of 7 represent the best hyperparameter\index{hyperparameter} combination.
An RMSE\index{RMSE} of
<!-- `r # round(autotuner_rf$tuning_result$regr.rmse, 2)` -->
0.38
An `mtry` of `r autotuner_rf$tuning_result$mtry`, a `sample.fraction` of `r round(autotuner_rf$tuning_result$sample.fraction, 2)`, and a `min.node.size` of `r autotuner_rf$tuning_result$min.node.size` represent the best hyperparameter\index{hyperparameter} combination.
An RMSE\index{RMSE} of `r round(autotuner_rf$tuning_result$regr.rmse, 2)`
is relatively good when considering the range of the response variable which is
<!-- `r # round(diff(range(rp$sc)), 2)` -->
3.04
(`diff(range(rp$sc))`).
`r round(diff(range(rp$sc)), 2)` (`diff(range(rp$sc))`).

### Predictive mapping

The tuned hyperparameters\index{hyperparameter} can now be used for the prediction.
To do so, we only need to run the `predict` method of our fitted `AutoTuner` object.

```{r 15-eco-27, eval=FALSE}
```{r 15-eco-27, eval=TRUE, cache.lazy=FALSE}
# predicting using the best hyperparameter combination
autotuner_rf$predict(task)
```

The `predict` method will apply the model to all observations used in the modeling.
Given a multilayer `SpatRaster` containing rasters named as the predictors used in the modeling, `terra::predict()` will also make spatial predictions, i.e., predict to new data.

```{r 15-eco-28, eval=FALSE}
```{r 15-eco-28, eval=TRUE, cache.lazy=FALSE}
pred = terra::predict(ep, model = autotuner_rf, fun = predict)
```

Expand Down Expand Up @@ -627,7 +629,7 @@ knitr::include_graphics("figures/15_rf_pred.png")

In case, `terra::predict()` does not support a model algorithm, you can still make the predictions manually.

```{r 15-eco-29, eval=FALSE}
```{r 15-eco-29, cache.lazy=FALSE}
newdata = as.data.frame(as.matrix(ep))
colSums(is.na(newdata)) # 0 NAs
# but assuming there were 0s results in a more generic approach
Expand All @@ -637,7 +639,8 @@ newdata[ind, "pred"] = data.table::as.data.table(tmp)[["response"]]
pred_2 = ep$dem
# now fill the raster with the predicted values
pred_2[] = newdata$pred
identical(values(pred), values(pred_2)) # TRUE
# check if terra and our manual prediction is the same
all(values(pred - pred_2) == 0)
```

The predictive mapping clearly reveals distinct vegetation belts (Figure \@ref(fig:rf-pred)).
Expand Down
18 changes: 15 additions & 3 deletions _12-ex.Rmd
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
```{asis, message=FALSE}
The solutions assume the following packages are attached (other packages will be attached when needed):
```

```{r 12-ex-e0, message=FALSE, warning=FALSE}
library(dplyr)
Expand All @@ -16,6 +18,7 @@ library(tmap)
```

E1. Compute the following terrain attributes from the `elev` dataset loaded with `terra::rast(system.file("raster/ta.tif", package = "spDataLarge"))$elev` with the help of R-GIS bridges (see this [Chapter](https://geocompr.robinlovelace.net/gis.html#gis)):

- Slope
- Plan curvature
- Profile curvature
Expand Down Expand Up @@ -116,9 +119,11 @@ tm_shape(hs, bbox = bbx) +
legend.title.size = 0.9)
```

E4. Compute a 100-repeated 5-fold non-spatial cross-validation and spatial CV based on the GLM learner and compare the AUROC values from both resampling strategies with the help of boxplots (see this [Figure](https://geocompr.robinlovelace.net/spatial-cv.html#fig:boxplot-cv).
E4. Compute a 100-repeated 5-fold non-spatial cross-validation and spatial CV based on the GLM learner and compare the AUROC values from both resampling strategies with the help of boxplots (see this [Figure](https://geocompr.robinlovelace.net/spatial-cv.html#fig:boxplot-cv)).

Hint: You need to specify a non-spatial resampling strategy.
Another hint: You might want to Excercises 4 to 6 in one go with the help of `mlr3::benchmark()` and `mlr3::benchmark_grid()` (for more information, please refer to https://mlr3book.mlr-org.com/perf-eval-cmp.html#benchmarking).

Another hint: You might want to solve Excercises 4 to 6 in one go with the help of `mlr3::benchmark()` and `mlr3::benchmark_grid()` (for more information, please refer to https://mlr3book.mlr-org.com/perf-eval-cmp.html#benchmarking).
When doing so, keep in mind that the computation can take very long, probably several days.
This, of course, depends on your system.
Computation time will be shorter the more RAM and cores you have at your disposal.
Expand All @@ -140,11 +145,14 @@ task = TaskClassifST$new(
# construct learners (for all subsequent exercises)
# GLM
lrn_glm = lrn("classif.log_reg", predict_type = "prob")
lrn_glm$fallback = lrn("classif.featureless", predict_type = "prob")

# SVM
# construct SVM learner (using ksvm function from the kernlab package)
lrn_ksvm = lrn("classif.ksvm", predict_type = "prob", kernel = "rbfdot",
type = "C-svc")
lrn_ksvm$fallback = lrn("classif.featureless", predict_type = "prob")

# specify nested resampling and adjust learner accordingly
# five spatially disjoint partitions
tune_level = rsmp("spcv_coords", folds = 5)
Expand All @@ -167,6 +175,7 @@ at_ksvm = AutoTuner$new(

# QDA
lrn_qda = lrn("classif.qda", predict_type = "prob")
lrn_qda$fallback = lrn("classif.featureless", predict_type = "prob")

# SVM without tuning hyperparameters
vals = lrn_ksvm$param_set$values
Expand Down Expand Up @@ -194,7 +203,10 @@ library(future)
future::plan(list("sequential", "multisession"),
workers = floor(availableCores() / 2))
set.seed(021522)
bmr = benchmark(grid, store_backends = FALSE)
bmr = benchmark(grid,
store_backends = FALSE,
store_models = FALSE,
encapsulate = "evaluate")
# stop parallelization
future:::ClusterRegistry("stop")
# save your result, e.g. to
Expand Down
Loading