geocompx · Robinlovelace · Apr 24, 2022 · Apr 20, 2022 · Apr 20, 2022 · Apr 20, 2022
diff --git a/12-spatial-cv.Rmd b/12-spatial-cv.Rmd
@@ -30,9 +30,9 @@ Required data will be attached in due course.
 
 Statistical learning\index{statistical learning} is concerned with the use of statistical and computational models for identifying patterns in data and predicting from these patterns.
 Due to its origins, statistical learning\index{statistical learning} is one of R's\index{R} great strengths (see Section \@ref(software-for-geocomputation)).^[
-Applying statistical techniques to geographic data has been an active topic of research for many decades in the fields of Geostatistics, Spatial Statistics and point pattern analysis [@diggle_modelbased_2007; @gelfand_handbook_2010; @baddeley_spatial_2015].
+Applying statistical techniques to geographic data has been an active topic of research for many decades in the fields of geostatistics, spatial statistics and point pattern analysis [@diggle_modelbased_2007; @gelfand_handbook_2010; @baddeley_spatial_2015].
 ]
-Statistical learning\index{statistical learning} combines methods from statistics\index{statistics} and machine learning\index{machine learning} and its methods can be categorized into supervised and unsupervised techniques.
+Statistical learning\index{statistical learning} combines methods from statistics\index{statistics} and machine learning\index{machine learning} and can be categorized into supervised and unsupervised techniques.
 Both are increasingly used in disciplines ranging from physics, biology and ecology to geography and economics [@james_introduction_2013].
 
 This chapter focuses on supervised techniques in which there is a training dataset, as opposed to unsupervised techniques such as clustering\index{clustering}.
@@ -79,7 +79,7 @@ data("lsl", "study_mask", package = "spDataLarge")
 ta = terra::rast(system.file("raster/ta.tif", package = "spDataLarge"))
 ```
 
-This should load three objects: a `data.frame` named `lsl`, an `sf` object named `study_mask` and a `SpatRaster` (see Section \@ref(raster-classes)) named `ta` containing terrain attribute rasters.
+The above code loads three objects: a `data.frame` named `lsl`, an `sf` object named `study_mask` and a `SpatRaster` (see Section \@ref(raster-classes)) named `ta` containing terrain attribute rasters.
 `lsl` contains a factor column `lslpts` where `TRUE` corresponds to an observed landslide 'initiation point', with the coordinates stored in columns `x` and `y`.^[
 The landslide initiation point is located in the scarp of a landslide polygon. See @muenchow_geomorphic_2012 for further details.
 ]
@@ -90,28 +90,26 @@ The 175 non-landslide points were sampled randomly from the study area, with the
 # library(tmap)
 # data("lsl", package = "spDataLarge")
 # ta = terra::rast(system.file("raster/ta.tif", package = "spDataLarge"))
-# lsl_sf = sf::st_as_sf(lsl, coords = c("x", "y"), crs = 32717)
+# lsl_sf = sf::st_as_sf(lsl, coords = c("x", "y"), crs = "EPSG:32717")
 # hs = terra::shade(slope = ta$slope * pi / 180,
 #                   terra::terrain(ta$elev, v = "aspect", unit = "radians"))
 # # so far tmaptools does not support terra objects
 # 
-# rect = tmaptools::bb_poly(raster::raster(hs))
-# bbx = tmaptools::bb(raster::raster(hs), xlim = c(-0.0001, 1), 
+# bbx = tmaptools::bb(raster::raster(hs), xlim = c(-0.0001, 1),
 #                     ylim = c(-0.0001, 1), relative = TRUE)
 # 
 # map = tm_shape(hs, bbox = bbx) +
 #   tm_grid(col = "black", n.x = 1, n.y = 1, labels.inside.frame = FALSE,
-#           labels.rot = c(0, 90)) +
+#           labels.rot = c(0, 90), lines = FALSE) +
 #   tm_raster(palette = gray(0:100 / 100), n = 100, legend.show = FALSE) +
 #   tm_shape(ta$elev) +
 #   tm_raster(alpha = 0.5, palette = terrain.colors(10), legend.show = FALSE) +
 #   tm_shape(lsl_sf) +
-#   tm_bubbles("lslpts", size = 0.2, palette = "-RdYlBu", 
+#   tm_bubbles("lslpts", size = 0.2, palette = "-RdYlBu",
 #              title.col = "Landslide: ") +
-#   qtm(rect, fill = NULL) +
-#   tm_layout(outer.margins = c(0.04, 0.04, 0.02, 0.02), frame = FALSE) +
+#   tm_layout(inner.margins = 0) +
 #   tm_legend(bg.color = "white")
-# tmap::tmap_save(map, filename = "figures/lsl-map-1.png", width = 11, 
+# tmap::tmap_save(map, filename = "figures/lsl-map-1.png", width = 11,
 #                 height = 11, units = "cm")
 knitr::include_graphics("figures/lsl-map-1.png")
 ```
@@ -121,21 +119,20 @@ The first three rows of `lsl`, rounded to two significant digits, can be found i
 
 ```{r lslsummary, echo=FALSE, warning=FALSE}
 lsl_table = lsl |>
-  mutate(across(.cols = -any_of(c("x", "y", "lslpts")), ~signif(., 2))) |>
-  head(3)
-knitr::kable(lsl_table, caption = "Structure of the lsl dataset.",
+  mutate(across(.cols = -any_of(c("x", "y", "lslpts")), ~signif(., 2)))
+knitr::kable(lsl_table[c(1, 2, 350), ], caption = "Structure of the lsl dataset.",
              caption.short = "`lsl` dataset.", booktabs = TRUE) |>
   kableExtra::kable_styling(latex_options = "scale_down")
 ```
 
 To model landslide susceptibility, we need some predictors.
 Since terrain attributes are frequently associated with landsliding [@muenchow_geomorphic_2012], we have already extracted following terrain attributes from `ta` to `lsl`:
 
-- `slope`: slope angle (°).
-- `cplan`: plan curvature (rad m^−1^) expressing the convergence or divergence of a slope and thus water flow.
-- `cprof`: profile curvature (rad m^-1^) as a measure of flow acceleration, also known as downslope change in slope angle.
-- `elev`: elevation (m a.s.l.) as the representation of different altitudinal zones of vegetation and precipitation in the study area.
-- `log10_carea`: the decadic logarithm of the catchment area (log10 m^2^) representing the amount of water flowing towards a location.
+- `slope` -  slope angle (°)
+- `cplan` - plan curvature (rad m^−1^) expressing the convergence or divergence of a slope and thus water flow
+- `cprof` - profile curvature (rad m^-1^) as a measure of flow acceleration, also known as downslope change in slope angle
+- `elev` - elevation (m a.s.l.) as the representation of different altitudinal zones of vegetation and precipitation in the study area
+- `log10_carea` - the decadic logarithm of the catchment area (log10 m^2^) representing the amount of water flowing towards a location
 
 It might be a worthwhile exercise to compute the terrain attributes with the help of R-GIS bridges (see Chapter \@ref(gis)) and extract them to the landslide points (see Exercise section at the end of this Chapter).
 
@@ -158,7 +155,7 @@ It is worth understanding each of the three input arguments:
 
 - A formula, which specifies landslide occurrence (`lslpts`) as a function of the predictors
 - A family, which specifies the type of model, in this case `binomial` because the response is binary (see `?family`)
-- The data frame which contains the response and the predictors
+- The data frame which contains the response and the predictors (as columns)
 
 The results of this model can be printed as follows (`summary(fit)` provides a more detailed account of the results):
 
@@ -179,7 +176,7 @@ head(pred_glm)
 
 Spatial predictions can be made by applying the coefficients to the predictor rasters. 
 This can be done manually or with `terra::predict()`.
-In addition to a model object (`fit`), this function also expects a `SpatRaster` with the predictors named as in the model's input data frame (Figure \@ref(fig:lsl-susc)).
+In addition to a model object (`fit`), this function also expects a `SpatRaster` with the predictors (raster layers) named as in the model's input data frame (Figure \@ref(fig:lsl-susc)).
 
 ```{r 12-spatial-cv-9, eval=FALSE}
 # making the prediction
@@ -191,16 +188,15 @@ pred = terra::predict(ta, model = fit, type = "response")
 # data("lsl", "study_mask", package = "spDataLarge")
 # ta = terra::rast(system.file("raster/ta.tif", package = "spDataLarge"))
 # study_mask = terra::vect(study_mask)
-# # white raster to only plot the axis ticks, otherwise gridlines would be visible
 # lsl_sf = sf::st_as_sf(lsl, coords = c("x", "y"), crs = 32717)
 # hs = terra::shade(ta$slope * pi / 180,
 #                   terra::terrain(ta$elev, v = "aspect", unit = "radians"))
-# rect = tmaptools::bb_poly(raster::raster(hs))
-# bbx = tmaptools::bb(raster::raster(hs), xlim = c(-0.00001, 1), 
-#                     ylim = c(-0.00001, 1), relative = TRUE)
+# bbx = tmaptools::bb(raster::raster(hs), xlim = c(-0.0001, 1),
+#                     ylim = c(-0.0001, 1), relative = TRUE)
+# 
 # map = tm_shape(hs, bbox = bbx) +
 #   tm_grid(col = "black", n.x = 1, n.y = 1, labels.inside.frame = FALSE,
-#           labels.rot = c(0, 90)) +
+#           labels.rot = c(0, 90), lines = FALSE) +
 #   tm_raster(palette = "white", legend.show = FALSE) +
 #   # hillshade
 #   tm_shape(terra::mask(hs, study_mask), bbox = bbx) +
@@ -210,12 +206,10 @@ pred = terra::predict(ta, model = fit, type = "response")
 #   tm_shape(terra::mask(pred, study_mask)) +
 # 	tm_raster(alpha = 0.5, palette = "Reds", n = 6, legend.show = TRUE,
 # 	          title = "Susceptibility") +
-# 	# rectangle and outer margins
-#   qtm(rect, fill = NULL) +
-# 	tm_layout(outer.margins = c(0.04, 0.04, 0.02, 0.02), frame = FALSE,
-# 	          legend.position = c("left", "bottom"),
-# 	          legend.title.size = 0.9)
-# tmap::tmap_save(map, filename = "figures/lsl-susc-1.png", width = 11, 
+# 	tm_layout(legend.position = c("left", "bottom"),
+# 	          legend.title.size = 0.9,
+# 	          inner.margins = 0)
+# tmap::tmap_save(map, filename = "figures/lsl-susc-1.png", width = 11,
 #                 height = 11, units = "cm")
 knitr::include_graphics("figures/lsl-susc-1.png")
 ```
@@ -313,12 +307,12 @@ Third, the **resampling** approach assesses the predictive performance of the mo
 To implement a GLM\index{GLM} in **mlr3**\index{mlr3 (package)}, we must create a **task** containing the landslide data.
 Since the response is binary (two-category variable) and has a spatial dimension, we create a classification\index{classification} task with `TaskClassifST$new()` of the **mlr3spatiotempcv** package [@schratz_mlr3spatiotempcv_2021, for non-spatial tasks, use `mlr3::TaskClassif$new()` or `mlr3::TaskRegr$new()` for regression\index{regression} tasks, see `?Task` for other task types].^[The **mlr3** ecosystem makes heavily use of **data.table** and **R6** classes. And though you might use **mlr3** without knowing the specifics of **data.table** or **R6**, it might be rather helpful. To learn more about **data.table**, please refer to https://rdatatable.gitlab.io/data.table/index.html. To learn more about **R6**, we recommend [Chapter 14](https://adv-r.hadley.nz/fp.html) of the Advanced R book [@wickham_advanced_2019].]
 The first essential argument of these `Task*$new()` functions is `backend`.
-`backend` expects the data to be used for the modeling including the response and predictor variables.
+`backend` expects that the input data includes the response and predictor variables.
 The `target` argument indicates the name of a response variable (in our case this is `lslpts`) and `positive` determines which of the two factor levels of the response variable indicate the landslide initiation point (in our case this is `TRUE`).
 All other variables of the `lsl` dataset will serve as predictors.
 For spatial CV, we need to provide a few extra arguments (`extra_args`).
 The `coordinate_names` argument expects the names of the coordinate columns (see Section \@ref(intro-cv) and Figure \@ref(fig:partitioning)).
-Additionally, one should indicate the used CRS (`crs`) and if one wishes to use the coordinates as predictors in the modeling (`coords_as_features`).
+Additionally, we should indicate the used CRS (`crs`) and decide if we want to use the coordinates as predictors in the modeling (`coords_as_features`).
 
 ```{r 12-spatial-cv-11, eval=FALSE}
 # create task
@@ -330,17 +324,16 @@ task = mlr3spatiotempcv::TaskClassifST$new(
   extra_args = list(
     coordinate_names = c("x", "y"),
     coords_as_features = FALSE,
-    crs = 32717)
+    crs = "EPSG:32717")
   )
 ```
 
 Note that `TaskClassifST$new()` also accepts an `sf`-object as input for the `backend` parameter.
 In this case, you might only want to specify the `coords_as_features` argument of the `extra_args` list.
-We did not convert `lsl` into an `sf`-object because `TaskClassifST$new()` just converts it back into a non-spatial `data.table` object in the background.
+We did not convert `lsl` into an `sf`-object because `TaskClassifST$new()` would just turn it back into a non-spatial `data.table` object in the background.
 For a short data exploration, the `autoplot()` function of the **mlr3viz** package might come in handy since it plots the response against all predictors and all predictors against all predictors (not shown).
 
 ```{r autoplot, eval=FALSE}
-library(mlr3viz)
 # plot response against each predictor
 mlr3viz::autoplot(task, type = "duo")
 # plot all variables against each other
@@ -353,7 +346,6 @@ All classification\index{classification} **learners** start with `classif.` and
 To find out about learners that are able to model a binary response variable, we can run:
 
 ```{r 12-spatial-cv-12, eval=FALSE}
-library(mlr3extralearners)
 mlr3extralearners::list_mlr3learners(
   filter = list(class = "classif", properties = "twoclass"), 
   select = c("id", "mlr3_package", "required_packages")) |>
@@ -384,7 +376,6 @@ We opt for the binomial classification\index{classification} method used in Sect
 Additionally, we need to specify the `predict.type` which determines the type of the prediction with `prob` resulting in the predicted probability for landslide occurrence between 0 and 1 (this corresponds to `type = response` in `predict.glm`).
 
 ```{r 12-spatial-cv-13, eval=FALSE}
-library(mlr3learners)
 learner = mlr3::lrn("classif.log_reg", predict_type = "prob")
 ```
 
@@ -468,6 +459,7 @@ mean(score_spcv_glm$classif.auc) |>
 ```
 
 To put these results in perspective, let us compare them with AUROC\index{AUROC} values from a 100-repeated 5-fold non-spatial cross-validation (Figure \@ref(fig:boxplot-cv); the code for the non-spatial cross-validation\index{cross-validation} is not shown here but will be explored in the exercise section).
+<!--JN: why "as expected"? I think it would be great to explain this expectation in a few sentences here...-->
 As expected, the spatially cross-validated result yields lower AUROC values on average than the conventional cross-validation approach, underlining the over-optimistic predictive performance due to spatial autocorrelation\index{autocorrelation!spatial} of the latter.
 
 ```{r boxplot-cv, echo=FALSE, out.width="75%", fig.cap="Boxplot showing the difference in GLM AUROC values on spatial and conventional 100-repeated 5-fold cross-validation.", fig.scap="Boxplot showing AUROC values."}
@@ -499,7 +491,7 @@ Random forest\index{random forest} models might be more popular than SVMs; howev
 Since (spatial) hyperparameter tuning is the major aim of this section, we will use an SVM.
 For those wishing to apply a random forest model, we recommend to read this chapter, and then proceed to Chapter \@ref(eco) in which we will apply the currently covered concepts and techniques to make spatial predictions based on a random forest model.
 
-SVMs\index{SVM} search for the best possible 'hyperplanes' to separate classes (in a classification\index{classification} case) and estimate 'kernels' with specific hyperparameters to allow for non-linear boundaries between classes [@james_introduction_2013].
+SVMs\index{SVM} search for the best possible 'hyperplanes' to separate classes (in a classification\index{classification} case) and estimate 'kernels' with specific hyperparameters to create non-linear boundaries between classes [@james_introduction_2013].
 Hyperparameters\index{hyperparameter} should not be confused with coefficients of parametric models, which are sometimes also referred to as parameters.^[
 For a detailed description of the difference between coefficients and hyperparameters, see the 'machine mastery' blog post on the subject.
 <!-- For a more detailed description of the difference between coefficients and hyperparameters, see the [machine mastery blog](https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/). -->
@@ -516,9 +508,9 @@ The classification\index{classification} task remains the same, hence we can sim
 Learners implementing SVM can be found using `listLearners()` as follows:
 
 ```{r 12-spatial-cv-23}
-mlr3extralearners::list_mlr3learners() %>%
-  .[class == "classif" & grepl("svm", id),
-    .(id, class, mlr3_package, required_packages)]
+mlr3_learners = list_mlr3learners()
+mlr3_learners[class == "classif" & grepl("svm", id),
+              .(id, class, mlr3_package, required_packages)]
 ```
 
 Of the options illustrated above, we will use `ksvm()` from the **kernlab** package [@karatzoglou_kernlab_2004].
@@ -574,14 +566,13 @@ talk in person (see also exercises):
 -->
 
 ```{r 12-spatial-cv-26, eval=FALSE}
-library("mlr3tuning")
 # five spatially disjoint partitions
 tune_level = mlr3::rsmp("spcv_coords", folds = 5)
 # use 50 randomly selected hyperparameters
 terminator = mlr3tuning::trm("evals", n_evals = 50)
 tuner = mlr3tuning::tnr("random_search")
 # define the outer limits of the randomly selected hyperparameters
-seach_space = paradox::ps(
+search_space = paradox::ps(
   C = paradox::p_dbl(lower = -12, upper = 15, trafo = function(x) 2^x),
   sigma = paradox::p_dbl(lower = -15, upper = 6, trafo = function(x) 2^x)
 )
@@ -601,17 +592,17 @@ at_ksvm = mlr3tuning::AutoTuner$new(
 ```
 
 The tuning is now set-up to fit 250 models to determine optimal hyperparameters for one fold.
-Repeating this for each fold, we end up with 1250 (250 \* 5) models for each repetition.
+Repeating this for each fold, we end up with 1,250 (250 \* 5) models for each repetition.
 Repeated 100 times means fitting a total of 125,000 models to identify optimal hyperparameters (Figure \@ref(fig:partitioning)).
 These are used in the performance estimation, which requires the fitting of another 500 models (5 folds \* 100 repetitions; see Figure \@ref(fig:partitioning)). 
 To make the performance estimation processing chain even clearer, let us write down the commands we have given to the computer:
 
-1. Performance level (upper left part of Figure \@ref(fig:inner-outer)): split the dataset into five spatially disjoint (outer) subfolds.
-1. Tuning level (lower left part of Figure \@ref(fig:inner-outer)): use the first fold of the performance level and split it again spatially into five (inner) subfolds for the hyperparameter tuning. 
-Use the 50 randomly selected hyperparameters\index{hyperparameter} in each of these inner subfolds, i.e., fit 250 models.
-1. Performance estimation: Use the best hyperparameter combination from the previous step (tuning level) and apply it to the first outer fold in the performance level to estimate the performance (AUROC\index{AUROC}).
-1. Repeat steps 2 and 3 for the remaining four outer folds.
-1. Repeat steps 2 to 4, 100 times.
+1. Performance level (upper left part of Figure \@ref(fig:inner-outer)) - split the dataset into five spatially disjoint (outer) subfolds
+1. Tuning level (lower left part of Figure \@ref(fig:inner-outer)) - use the first fold of the performance level and split it again spatially into five (inner) subfolds for the hyperparameter tuning. 
+Use the 50 randomly selected hyperparameters\index{hyperparameter} in each of these inner subfolds, i.e., fit 250 models
+1. Performance estimation - Use the best hyperparameter combination from the previous step (tuning level) and apply it to the first outer fold in the performance level to estimate the performance (AUROC\index{AUROC})
+1. Repeat steps 2 and 3 for the remaining four outer folds
+1. Repeat steps 2 to 4, 100 times
 
 The process of hyperparameter tuning and performance estimation is computationally intensive.
 To decrease model runtime, **mlr3** offers the possibility to use parallelization\index{parallelization} with the help of the **future** package.